DESIGNER:合成数据用于 LLM 推理
Mar 1, 2026 00:00 · 6112 words · 13 minute read
原文 DESIGNER: Design-Logic-Guided Multidisciplinary Data Synthesis for LLM Reasoning
背景:现有的推理数据集通常缺乏学科广度、推理深度和多样性。
DESIGNER 是 DESIGN-logic-guidEd Reasoning 的缩写,是一个设计逻辑引导的推理数据合成管道,利用大量原始文档(书籍语料库和网络语料库)来生成多门学科具有挑战性的问题。引入“设计逻辑”概念,指导 LLM 模仿人类教育者的提问过程,实现大规模、高难度问题的自动化合成。使用 LLM 对来自不同学科的问题中超过 120,000 种设计逻辑进行逆向工程和抽象。通过将这些设计逻辑与源文档匹配,能够生成可控问题类型和难度级别的推理问题。
1 介绍
LLM 在大学水平、特定学科的推理能力仍落后于人类专家,主要是由于缺乏大规模、高质量和多样化的训练数据。现有数据集主要集中于数学和编程,其他学科缺乏开放性的资源。这种稀缺性限制了 LLM 多学科推理能力的发展。
使用 LLM 合成数据是解决数据稀缺的有效方案。现有的问题合成方法分为两类:以查询为中心和以文档为中心。前者通过重写、添加约束或结合思维连推理来扩展种子问题,但它们限制于种子池和固有模型偏差;而后者则从非结构化(网络、书籍)和结构化(知识图谱)的源文档中生成问题,确保了真实知识的广泛学科覆盖。然而它们难以控制难度和多样性,经常退化为事实记忆。与此同时,后训练甚至中途训练都严重依赖于考试风格的数据。这提出了一个关键问题:如何快速合成大量高质量、多学科的考试题,同时控制它们的难度、多样性和问题类型。
人类专家构建问题的过程反映了一种“设计逻辑”:将基本知识点转换为复杂的、上下文丰富的、需要多阶段推理的问题。DESIGNER 通过匹配逻辑与原始语料并综合多学科问题来模拟此过程。
提出 DESIGNER,一个设计逻辑引导的推理数据合成管道。该方法的核心是设计逻辑(Design Logic)的概念,抽象了人类教育专家设计考题的思维过程。
-
多维度标注和过滤大量书籍和网络语料库,构建高质量的源材料库。从数亿个题库中,聚类并抽样一组多样化的高难度问题,用 LLM 逆向并抽象出超过 12W 个结构化的设计逻辑,以构建可重用的设计逻辑库。
-
在合成问题时,采用两阶段的检索-生成机制:
- 向量相似度检索每个源文档的粗粒度候选逻辑
- LLM 执行细粒度评估选择最佳逻辑,并严格安装步骤从源文档生成推理问题。
该方法解决了先前的数据合成方法中缺乏指导原则的问题,能够自动化地生成大量多样化和高难度的考题。
通过对 Qwen3 和 Llama3 模型家族的全面比较和消融实验,验证了合成的数据是有效的。结果表明合成数据显著增强了 LLM 的多学科推理能力。
2 数据处理
三种数据源:
- 专有题库
- 书籍语料库
- 网络语料库
对齐到 75 门学科。
2.1 题库
按照下面的提示词使用 Qwen3-30B-A3B(非思考模式)标注了来自专有题库的超过 1.5 亿个问题,包括学科、难度和问题类型。 为了获得高质量和多样化的子集,使用 Qwen3-Embedding-4B 计算嵌入,并在每个学科内应用 K-means 聚类,聚类数量通过轮廓搜索确定。从每个聚类中,根据固定的难度比例 3:2:1(极难:难:中等)抽样,并使每个学科的数量与整体题库分布保持一致。如果高难度问题不足,则用低难度问题进行补充,以确保各学科的问题总数。 该过程生成了一个包含 132409 个问题的集锦,用于提取设计逻辑。任何问题库,无论其初始质量如何,都可以使用此管道过滤出高质量和多样化子集。
学科分类提示词
You are a professional multidisciplinary data labeling expert specializing in the classification of multidisciplinary academic questions. Please select the ONE most relevant label from the given list of discipline labels for the input question data. For question data that you cannot determine, use the “Unknown Discipline” label. Please directly output “labels”: “(the label you selected)”.
# List of Discipline Labels
[‘Mathematics’, ‘Biology’, ‘Chemistry’, ‘Physics’, ‘Computer Science and Technology’, ‘Philosophy’, ‘Psychology’, ‘Business Administration’, ‘Clinical Medicine’, ‘Economics’, ‘Law’, ‘Political Science’, ‘Statistics’, ‘Electrical Engineering’, ‘Geography’, ‘Mechanical Engineering’, ‘Basic Medicine’, ‘Information and Communication Engineering’, ‘Sociology’, ‘Materials Science and Engineering’, ‘Pharmacy’, ‘Public Health and Preventive Medicine’, ‘Mechanics’, ‘Astronomy’, ‘World History’, ‘Bioengineering’, ‘English and Foreign Languages’, ‘Chemical Engineering and Technology’, ‘Electronic Science and Technology’, ‘Environmental Science and Engineering’, ‘Nuclear Science and Technology’, ‘Control Science and Engineering’, ‘Management Science and Engineering’, ‘Education’, ‘Geophysics’, ‘Art and Design’, ‘Agricultural Engineering’, ‘Aerospace Science and Technology’, ‘Atmospheric Sciences’, ‘Chinese Language and Literature’, ‘Civil Engineering’, ‘Ecology’, ‘Geology’, ‘Nursing’, ‘Optical Engineering’, ‘Public Administration’, ‘Journalism and Communication’, ‘Physical Education’, ‘Marine Sciences’, ‘Safety Science and Engineering’, ‘Architecture’, ‘Transportation Engineering’, ‘Power Engineering and Engineering Thermophysics’, ‘Food Science and Engineering’, ‘Archaeology’, ‘Biomedical Engineering’, ‘Chinese History’, ‘Veterinary Medicine’, ‘Instrument Science and Technology’, ’Hydraulic Engineering’, ‘Stomatology’, ‘Urban and Rural Planning’, ‘Petroleum and Natural Gas Engineering’, ‘Naval Architecture and Ocean Engineering’, ‘Surveying and Mapping Science and Technology’, ‘History of Science and Technology’, ‘Agricultural Resources and Environment’, ‘Remote Sensing Science and Technology’, ‘Information Resources Management’, ‘Mining Engineering’, ‘Forensic Medicine’, ‘Ethnology’, ‘Textile Science and Engineering’, ‘Geological Resources and Geological Engineering’, ‘Animal Husbandry’, ‘Other’, ‘Non-disciplinary’, ‘Unknown Discipline’]
# Example 1
Input: “Consider a photon traveling at the speed of light. How does the photon experience space, and what are the implications of relativistic beaming on its perception of spatial dimensions? Provide a detailed explanation, including any relevant mathematical derivations and physical principles.”
Output: “labels”: “Physics”
# Example 2
Input: “A heavy pole, of mass M and length L, is freely hinged to a wall at the point O. A rope connects the other end of the pole, B, to a fixed point A on the wall above O. The system is in equilibrium, with the pole making an angle of θ with the horizontal, and the rope making an angle of α with the horizontal. Explore how the system’s parameters (M, L, θ, α) affect its equilibrium and stability.”
Output: “labels”: “Mechanics”
# Example 3
Input: “If John rented a car for $150 and had to buy 8 gallons of gas at $3.50 per gallon to fill it up, and the final expense is $0.50 per mile, how much did it cost him to drive 320 miles?”
Output: “labels”: “Mathematics”
# Input Question Data
Input: “{text}”
Output:
给模型一个学科列表和三个例子,让它对给定问题进行分类,替换
{text}占位符。
难度分类提示词
You are an expert in education and examination, specializing in classifying the difficulty levels of multidisciplinary questions. For the given question, please evaluate its difficulty based on the complexity and length of the reasoning required to answer it. Label it as one of the following: **Easy**, **Medium**, **Hard**, or **Very Hard**. Please directly output “Difficulty: (Your chosen label)”.
# Example 1
Input: “Consider a photon traveling at the speed of light. How does the photon experience space, and what are the implications of relativistic beaming on its perception of spatial dimensions? Provide a detailed explanation, including any relevant mathematical derivations and physical principles.”
Output: “Difficulty: Very Hard”
# Example 2
Input: “A heavy pole, of mass M and length L, is freely hinged to a wall at the point O. A rope connects the other end of the pole, B, to a fixed point A on the wall above O. The system is in equilibrium, with the pole making an angle of θ with the horizontal, and the rope making an angle of α with the horizontal. Explore how the system’s parameters (M, L, θ, α) affect its equilibrium and stability.”
Output: “Difficulty: Hard”
# Example 3
Input: “If John rented a car for $150 and had to buy 8 gallons of gas at $3.50 per gallon to fill it up, and the final expense is $0.50 per mile, how much did it cost him to drive 320 miles?”
Output: “Difficulty: Easy”
# Given Question
Input: “{text}”
Output:
给模型三个例子,让它对给定问题难度进行分类,替换
{text}占位符。
问题类型分类提示词
You are an expert in education and examination, specializing in classifying question types. For the given question, please evaluate its question type and label it as one of the following: **Problem-solving question**, **Multiple-choice question**, **Proof question**, or **Other question types**. For any question that you cannot determine, use the “Other question types” label. Please directly output “Question type: (Your chosen label)”.
# Example 1
Input:
“Determine the number of $k$-letter sequences composed of the letters $A$ and $B$ such that the sequence contains at least two consecutive $A$’s.”
Output:
“Question type: Problem-solving question”
# Example 2
Input:
“Consider the function $f(x) = \dfrac{e^x}{x}$. The value of the integral
$$
I = \int_{1}^{\infty} \left( \frac{e^x}{x} - \frac{e^{-x}}{x} \right) dx
$$
is ___.”
Output:
“Question type: Other question types”
# Example 3
Input:
“Given that $a \in \{-1, 2, \tfrac{1}{2}, 3, \tfrac{1}{3}\}$, if $f(x) = x^a$ is an odd function and is monotonically increasing on $(0, +\infty)$, then the possible values of the real number $a$ are ( ).
A: $-1, 3$
B: $\tfrac{1}{3}, 3$
C: $-1, \tfrac{1}{3}, 3$
D: $\tfrac{1}{3}, \tfrac{1}{2}, 3$”
Output:
“Question type: Multiple-choice question”
# Given Question
Input: “{text}”
Output:
提供三个示例给模型,然后要求其对给定问题的类型进行分类,替换
{text}占位符。
2.2 书籍语料库
该语料库以章节为单位进行处理:将超过 5000 字的章节拆分为更小的文本块,并通过 MinHash 去重。学科标签由针对学科分类微调的 ModernBERT-large 分类器分配;可读性由基于 BERT 的模型预测,用于过滤不连贯或组织不当的文本;有用性(0-5)则由 fineweb-edu-classifier 打分,以量化其教育价值。在剔除不可读片段后,对剩余候选按有用性分数排序并抽取,使学科配额与书籍语料库和题库中的频率分布成比例。处理后最终得到 300 万个高质量片段,其中绝大多数有用性 ≥ 2。
2.3 网络语料库
对 FineFineWeb 语料库进行基于推理的过滤,并重新标记学科:使用 Qwen3-30B-A3B(非思考模式)对 65 亿文本打分,采用五级评分标准(如下),保留得分 ≥3 的文本。随后用同一模型对保留文本重新标注学科标签,使其符合 75 个学科分类法。
基于推理的过滤提示词
You will be provided with text from the internet.
Evaluate the following text extract for its potential usefulness for studying reasoning process. Use the following 5-point scoring system described below. Start from 0, points are accumulated based on the satisfaction of each criterion:
(1) Add 1 point if the extract contains any reasoning or thinking process.
(2) Add 1 point if the extract contains any explicit subgoal setting, where the writer breaks down the problem into smaller, intermediate goals. Subgoal setting might look like:
• “First, we need to find ..., then we can determine ...”
• “To solve ..., let’s first ..., then ...”
• “Let’s tackle ... in three parts: (1) ..., (2) ..., and (3) ...”
• “To ..., I’ll first ..., then ...”
(3) Add 1 point if the extract contains any verification steps. We want to mark instances where the writer explicitly checks their own work, such as by comparing the result to a known value or by checking the result of a calculation. Verification steps might look like:
• “Let’s check ...”
• “To verify this is correct, I’ll ...”
• “Let’s test ... with a simple case: ...”
• “To ensure this solution is valid, I’ll check if ...”
(4) Add 1 point if the text contains any backtracking behavior, where the writer realizes a path won’t work and explicitly goes back to try a different approach. An example of backtracking is: “Let me try again”, “Wait”, “I made a mistake”, or “we need to try a different sequence of operations”. We want to mark instances where the writer abandons a thought and backtracks to a previous computation.
(5) Add 1 point if the text contains any backward-chaining behavior, where the writer is working towards a goal but starts from the goal and works backward. It might like:
• “To solve ..., let’s start with what we want to prove: ...Let’s verify this.”
• “If we want to find ..., let’s start with the desired result and work backward.”
• “To determine ..., I know the result ... Working backward from this final state using
# Task Format
Format your response in markdown as follows
## Thoughts
[Brief description describing what behavior was noticed and where subgoal setting may have occurred, less than 100 words]
## Final score
[total points]
# Text to evaluate for reasoning degree
{text}
# Response
3 数据合成
图 2 中的 Phase 2 和 Phase 3 展示了整个数据合成的流程。
3.1 设计逻辑提取
人类教育工作者通过结构化的步骤设计考题,将知识点转化为复杂的挑战,而不仅仅是简单的事实回忆。一个典型的过程:
- 确定目标
- 构建背景
- 设计推理路径
- 形成答案
- 添加干扰项
- 验证问题
解题者必须进行超越记忆的多步骤推理。提出基于设计逻辑的问题合成方法:
- 使用下面的提示词,指示 LLM(DeepSeek-R1-0528)分析真实问题
- 推断设计者的思考过程
- 追踪知识点的构建
- 抽象底层设计原则,使用 Mermaid 图表达
这样生成了一个可重用的设计逻辑池,指导从源材料生成新问题。
提取设计逻辑的提示词
You are an expert educator and a specialist in exam question design. Below, I have provided an exam question. Your task is to deduce the thought process of the question designer. Analyze how they constructed this question based on the relevant knowledge points. You need to go beyond the specific details of the question and its knowledge points to abstract and summarize the underlying design logic and principles behind the question.
The goal is for me to be able to use this abstracted design logic to create other high-quality, challenging questions that require complex logical reasoning for different knowledge points and source materials.
**Finally, you must organize the abstracted question-design logic you have summarized into English Mermaid format.**
--- Analyze the Question Design Logic from the Following Question ---
**Question:**
{text}
模型被指示逆向分析给定问题(替换
{text}占位符)背后的思考过程,并以 Mermaid 格式构建抽象逻辑。
3.2 设计逻辑去重
为了提高设计原则的多样性,通过语义相似度对提取的逻辑去重。每个逻辑都用 Qwen3-Embedding-4B 计算 embedding 和成对相似度生成一个矩阵 \(S \in \mathbb{R}^{n \times n}\)。在每个学科中,构建一个图 \(G = (V, E)\),其中节点代表逻辑,边通过 \(S_{ij} \ge \tau\) 连接节点对。图中的连接部分对应冗余的设计逻辑组。为每个组保留相似性总和最高的项。使用 \(\tau = 0.85\),这个基于图的去重程序(算法如下)产生了 125,328 个独特的设计逻辑。
Graph-based Deduplication via Centroid Selection
Input: A set of items \(\mathcal{D} = \{d_1, \ldots, d_n\}\), a similarity matrix \(S \in \mathbb{R}^{n \times n}\), a similarity threshold \(\tau\). Output: A deduplicated set of representative items \(\mathcal{R}\).
Initialize an undirected graph \(G = (V, E)\) where \(V = \{1, \ldots, n\}\) and \(E = \varnothing\). Initialize the set of representatives \(\mathcal{R} = \varnothing\).
// Build a similarity graph where nodes are items and edges connect similar items.
for i = 1 to n do
for j = i + 1 to n do
if S_{ij} > τ then
Add edge (i, j) to E.
end if
end for
end for
// Identify clusters of duplicates by finding connected components.
Let 𝒞 ← FindConnectedComponents(G).
// Select the most representative item (centroid) from each cluster.
for each connected component C ∈ 𝒞 do
Find centroid index
i* = argmax_{i ∈ C} ∑_{j ∈ C, j ≠ i} S_{ij}.
Add item d_{i*} to 𝓡.
end for
return 𝓡.
3.3 问题合成
为避免将设计逻辑与文本片段进行穷举匹配所导致的组合爆炸,采用检索增强的方法。对于每个学科特定语料库,使用带有任务特定指令的 Qwen3-Embedding-4B 计算文本片段与设计逻辑的 embedding,并据此计算二者的余弦相似度。相似度最高的前五个设计逻辑会被保留为候选。
接着提示 DeepSeek-R1-0528 进行以下操作:
- 从前五个候选中选择最合适的逻辑
- 严格按照其步骤合成一份研究生水平的考试题
- 这两阶段的处理形成了一个粒度从粗到细的排名:基于相似度检索提供粗略回忆,LLM 细化匹配,确保文本与逻辑之间的精确对齐,从而提高问题质量
LLM 还为每个问题生成一个简明的参考答案。
问题的去重和去污
采用两阶段的过滤流水线:
- 用 MinHash 去重
- 针对所有评估基准进行 13-gram 去污染,防止信息泄露
设计逻辑检索指令
Given a book snippet, retrieve the most suitable question-design logic in Mermaid format for creating a challenging exam question from the book snippet.
用于检索最适合给定文本段的设计逻辑的任务特定指令。通过 Qwen3-Embedding-4B 模型计算文本段和设计逻辑的 embedding,实现基于相似度的检索。
3.4 响应合成
为了证明合成的问题能够有增强模型的 CoT 能力,使用 Qwen3-235B-A22B-Thinking-2507-FP8 为每个合成的问题生成相应的长期 CoT 响应。这些问答对随后用于监督微调(SFT)。在拥有高质量问题的情况下,任何模型都可以生成响应,包括未来能够实现更高准确性的更强大的模型。