An Easy Way to Copy Human Reasoning

2025 May 26
Back

Big thanks to Lilian Weng for sparking many ideas here.

Since the Dartmouth Workshop, humanity has relentlessly pursued the “general thinking machine.” For humans, reasoning is a remarkably natural endeavor. We possess rapid judgment capabilities, often termed intuition, allowing us to assess the nature of a situation in mere seconds. Concurrently, we are equipped with the capacity for long-term planning and profound deliberation, enabling us to dedicate years, or even decades, to proving a mathematical theorem or discovering a new law of the universe (i.e., Kahneman 2013).

However, replicating these multifaceted reasoning capabilities in machines has proven immensely challenging. Early Symbolic AI focused on emulating machine reasoning by constructing systems based on explicit symbols, predefined rules, and structured knowledge bases, often employing static planning mechanisms. Although foundational, such approaches demonstrate limited adaptability when faced with uncertainty, the vast scope of commonsense knowledge, or the need to learn novel patterns, thus hindering their ability to generalize effectively across diverse and open-ended domains.

With the recent development of LLMs, the area has found that a combination of techniques enables neural networks to simulate human-like reasoning processes, thereby reproducing certain aspects of human reasoning ability. Moreover, these methods demonstrate notable transferability, applying learned reasoning-like skills effectively to diverse tasks, including those novel to their original training.

Fine-Tuning with Latent Processes

One core idea is “Latent Variable Modeling” (Weng et al. 2025), where a latent variable $z$ is introduced to explain the complex distribution of a visible variable $y$.

For instance, consider a math problem with statement $x$. The latent variable $z$ can be regarded as the intermediate problem-solving process, while the visible variable $y$ serves as the ground truth answer. We can then reformulate the original probabilistic model $P(y\mid x)$ into a richer expression for optimization:

\[P(y\mid x) = \sum\limits_z P(y\mid x,z)P(z\mid x)\]

In this framework, the latent variable $z$, which is characterized by a distribution (e.g., $P(z\mid x)$ as used in the formula), represents a series of intermediate steps taken to solve problem $x$ and arrive at answer $y$. For these intermediate steps, we can adopt the now-standard term chain-of-thought (CoT) (Wei et al. 2022).

Example: Let $x$ be the problem statement $2x + 5 = 13$ and $y$ be the ground truth $x = 4$ .

Latent steps $z_1$:

Subtract 5 from both sides: $2x + 5 - 5 = 13 - 5$

Simplify: $2x = 8$

Divide both sides by 2: $x = 8 / 2$

Final result: $x = 4$

Latent steps $z_2$ (abbreviated but correct reasoning):

Move 5 to the other side: $2x = 13 - 5$

Compute: $2x = 8$

Solve for $x$: $x = 4$

Latent steps $z_3$ (incorrect reasoning leading to wrong answer):

Goal: isolate $x$

See $+5$ on the left, incorrectly add 5 to the right side too: $2x = 13 + 5$

Compute: $2x = 18$

Divide both sides by 2: $x = 9$

Incorrect result: $x = 9$

Another core technique is supervised fine-tuning (SFT; Alec et al. 2018). This typically involves a dataset $D$ of input prompts $x$ (e.g., questions) and corresponding ground truth outputs $y$ (e.g., answers). The objective is to maximize the log-likelihood of the model producing the correct output:

\[L_{SFT}=\sum\limits_{(x,y)\in D} logP(y|x)\]

Now, we can enhance SFT by incorporating hypothesized intermediate reasoning steps, often referred to as Chain-of-Thought steps, denoted by $z$. Even if these CoT steps are not directly observed in our training dataset $D$ (which still consists of $(x,y)$ pairs), we can model $P(y∣x)$ by marginalizing over all possible latent reasoning paths $z$. The SFT objective then becomes to maximize this marginal log-likelihood:

\[L_{\text{SFT}} = \sum\limits_{(x,y)\in D} \log \left( \sum_{z} P(y|x,z)P(z|x) \right)\]

This objective encourages the model to achieve a high probability for the correct answer $y$ by implicitly considering various underlying reasoning paths $z$ that could lead from $x$ to $y$. It acknowledges that for a given pair $(x,y)$, there might be multiple valid reasoning paths (sequences of CoT steps) $z$. For example, a math problem may have more than one way to arrive at the final answer, and this formulation accounts for these possibilities by summing over all potential latent paths.

Data for Constructive Reasoning

While a system’s internal latent variable $z$ (i.e., its intermediate thought processes) cannot be directly observed, we can nonetheless approximate or emulate such reasoning steps. This can be achieved by manually constructing or synthetically generating datasets composed of explicit CoT examples $(x,z_{gt},y)$, where $z_{gt}$ represents a desired sequence of reasoning steps. Such curated CoT data can then be leveraged to more directly guide and supervise the model’s training.

STEM Application Focus

Earlier studies (e.g., Ling et al. 2017; Cobbe et al. 2021; Kojima et al. 2022), along with more recent research (e.g., o1-preview; DeepSeek-AI 2025; Ren et al. 2025), indicate that this approach is frequently employed in STEM disciplines, particularly within the domain of mathematics.

Figure 1: Early research on the construction of Chain-of-Thought data within the mathematics domain. (Image source: (Left) Ling et al. 2017; (Top right) Cobbe et al. 2021; and (Bottom right) Kojima et al. 2022.)

Figure 2: Recent examples of constructed Chain-of-Thought data. (Image source: (Left) DeepSeek-AI 2025; (Right) Ren et al. 2025)

Chain-of-Thought Reasoning Methods

Researchers employ several strategies to elicit or guide the reasoning steps of models, thereby enhancing their reasoning performance.

One class of methods involves structuring the training data to explicitly showcase reasoning. Instead of providing only the final answers, datasets are constructed to include clear, intermediate reasoning steps and processes that lead to the solution (e.g., Wei et al. 2022). Alternatively, reasoning can be prompted at inference time. This can be achieved by incorporating specific prompts designed to significantly improve the model’s reasoning output, ranging from concise instructions like "Let’s think step by step" (Kojima et al. 2022) to more elaborate ones such as "Let's first understand the problem and devise a plan to solve the problem. Then, let's carry out the plan and solve the problem step by step" (Wang et al. 2023). Furthermore, models can be encouraged to retrieve and utilize relevant knowledge or analogous solutions, for instance, by instructing them to "recall relevant exemplars" before generating a formal response (Yasunaga et al. 2023).

Another prominent approach is knowledge distillation (Hinton et al. 2015). In this paradigm, the reasoning capabilities of highly proficient large models (teacher models) are distilled into more specialized or smaller models (student models). This might involve training student models on datasets composed of outputs, including reasoning traces, generated by teacher models (e.g., Open-R1 project with Open-R1 datasets). This distillation process can be augmented with sophisticated sampling strategies, such as generating a diverse set of CoT reasoning traces for a given problem and then employing a selection mechanism like majority voting to identify and utilize the highest-quality or most consistent reasoning paths (Wang et al. 2023).

While studies indicate that longer CoT (i.e., more ‘thinking time’ tokens) positively correlate with downstream task accuracy (Muennighoff et al. 2025), the aforementioned methods primarily enhance reasoning performance in two main ways: 1) they either elicit and make explicit a base model’s inherent, potentially latent, reasoning pathways (i.e., latent variable z), or 2) they train the model to replicate explicit reasoning steps learned from labeled datasets or distilled from teacher models.

Figure 3: Positive correlation between the length of thinking time tokens and downstream task accuracy. (Image source: s1 experiment from Muennighoff et al. 2025)

RL for LLM Reasoning

To cultivate reasoning capabilities with reduced reliance on human-annotated data, alternative approaches have been developed. A prominent strategy in this vein enables models to progressively refine their reasoning abilities by autonomously engaging with a sequence of complex RL challenges. This process circumvents the need for direct human annotation of their solution paths (e.g., DeepSeek-AI 2025; Zeng et al. 2025). Particularly within STEM domains, this self-improvement cycle can be augmented by integrating external tools and utilities for the automated verification and scoring of model-generated answers (Zhao et al. 2025). Subsequently, reasoning traces validated as correct and high-quality can be curated into new datasets, facilitating further SFT on the existing RL-trained checkpoint.

The RL training highly contingent on the design of effective reward signals. For STEM problems, reward functions are commonly designed to score solutions based on both their formal correctness (e.g., format) and content accuracy. More sophisticated designs involve developing Process Reward Models (PRMs). These PRMs evaluate each step within a reasoning chain, allowing subsequent steps to be conditioned on correctly assessed prior ones, thereby pruning or down-weighting erroneous intermediate derivations (Lightman et al. 2023; Jiang et al. 2024).

Figure 4: (Left) An outcome-based reward function emphasizing format and accuracy. (Right) A PRM designed for step-wise evaluation. (Image source: (left) DeepSeek-AI 2025; (right) Jiang et al. 2024)

A significant limitation of such RL-based training is that these reward designs are often highly domain-specific. For instance, while a PRM can be effectively trained for a particular class of mathematical problems, its applicability may not readily generalize to open-world scenarios or across diverse, unverifiable domains.

Reasoning for Unverifiable Domains

For problems situated in domains that are not easily verifiable, a common strategy is to first formalize their core elements into models amenable to logical reasoning and indirect validation. Although fields such as law, healthcare, and open-world games (e.g., Project Zomboid, Minecraft) present complex, open-ended challenges, they often possess inherent, structured logic that can be leveraged for modeling and verification.

For instance, in law, case outcomes can often be inferred by determining if all constituent elements of a relevant statute are satisfied. In healthcare, the accumulation of multiple risk factors significantly alters the probability of disease onset. Similarly, in open-world games (like Minecraft), achieving a specific objective strictly depends on the sequential completion of prerequisite operations that follow explicit recipes and a defined order (Wang et al. 2023; Hafner et al. 2023).

Figure 5: Visualization of the technology tree and associated recipe list for mining operations in Minecraft. (Image source: Altera et al. 2024)

Unlike many STEM problems, merely modeling formal structures and direct causal relationships may not suffice to capture the broader state space inherent in these open domains. Consequently, another prevalent approach involves integrating models with external search tools. These tools can query vast information sources such as the internet (e.g., Yao et al. 2023; DeepResearch) or access local databases and user-specific data (like personal health records). Based on the retrieved information, the model can then engage in step-by-step reasoning, with the added capability to reflect upon new data, correct previous inferences, and refine its reasoning decisions.

The Future of Reasoning

By utilizing the core idea of CoT with a suite of training techniques like SFT and RL, commendable reasoning performance has been achieved in STEM domains, enabling models to generate clear, human-readable intermediate problem-solving steps. Furthermore, by modeling knowledge structures and integrating external search tools, it may be possible to establish rigorous verification conditions even in open-ended domains. This could potentially allow models to exhibit robust reasoning capabilities in such open-world scenarios (e.g., building a diamond palace from scratch in Minecraft or surviving alone for 100 days in Project Zomboid), and it is anticipated that reasoning experiments will increasingly transition from STEM fields into these broader contexts.

Concurrently, a distinct perspective posits that current AI, by learning primarily from human knowledge and data, might also inherit human limitations. To enable AI to potentially surpass human intellect, future AI should increasingly learn from its own experiences interacting with the world. (i.e., Silver et al. 2025). Realizing this vision necessitates substantial and sustained improvements in model performance within open-ended domains. This involves a shift from domain-specific reasoning, which often relies on extensive scaffolding and bespoke engineering, towards a generalized reasoning ability applicable across a vast spectrum of tasks.

Another promising direction involves using the test-time scaling law (e.g., snell et al. 2024; Wu et al. 2025; Muennighoff et al. 2025) to endow models with online learning capabilities. Through RL and sampling strategies, high-quality reasoning processes can be identified, collected, and subsequently fed back to refine the base model, iteratively enhancing specific performance aspects (e.g., Behrouz et al. 2024; DeepSeek-AI 2025).