Microsoft Research Lab – Asia

Teaching LLMs to think: Xian Zhang on advancing mathematical reasoning in AI

Published May 13, 2025

Share this page

Math is more than a school subject—it’s the engine behind scientific discovery, driving advances in everything from climate modeling to AI.

At Microsoft Research Asia, senior researcher Xian Zhang is leading efforts to help AI move beyond surface-level pattern recognition toward deeper, rules-based reasoning. In a recent interview, he explained how this shift could significantly expand what large language models (LLMs) are capable of.

Q: Why is mathematical reasoning important for the development of LLMs and AI in general?

Zhang: Mathematical reasoning plays a central role in AI development. As models acquire this skill, they improve in broader reasoning tasks by learning structured approaches and logical patterns.

Math helps AI manage complexity, improving performance in code optimization, common-sense reasoning, and semantic understanding—with gains in both accuracy and efficiency.

Improving LLMs’ understanding of mathematical structure is a step toward building AI that can handle the rigor and precision required in scientific and technical fields, ultimately accelerating the pace of discovery.

Q: Where does AI currently stand in terms of mathematical reasoning, and what are the main challenges?

Zhang: AI’s ability to reason mathematically still depends heavily on the breadth and quality of its training data. With rich and diverse data, LLMs can solve complex problems—even some at the level of math Olympiads—by generalizing from similar patterns. But when data is sparse or uneven, models can falter even on basic arithmetic problems.

Because LLMs work by recognizing and replicating patterns, they may hallucinate solutions or miss the underlying logic entirely without enough data in a given area. In theory, if their training data were sufficiently comprehensive, their performance could rival top human problem-solvers.

We often compare this to “brute-force” versus “genius” learning. With enough practice, people can solve difficult problems. Geniuses, by contrast, grasp deep patterns quickly. Most high performers combine both—extensive exposure and rapid internalization. Similarly, LLMs need far more training data than humans to achieve comparable results for a single task.

Q: What research has your team conducted in this field?

Zhang: We approach mathematical reasoning from a rules-based rather than a data-driven perspective. Our goal is to help LLMs learn the fundamental principles of math and independently apply them to new problems.

To achieve this, we emphasize formalization and symbolization—translating natural language math problems into formal mathematical expressions (opens in new tab). This process works like a “language translation” for math—once the model comprehends the symbolic representation, it can understand the underlying logic. This way, LLMs can perform various operations like calculators while maintaining strong generalization capabilities.

We’ve successfully applied this process to complex inequality proofs at the Olympiad level (opens in new tab), demonstrating that models can learn and apply mathematical rules. This success establishes a foundation for extending these capabilities to broader areas, including algebra, geometry, and number theory.

We also developed techniques for generating synthetic math data. Through formalization, we created diverse problem-answer pairs (opens in new tab) and composite theorems (opens in new tab), similar to how a teacher designs new variations of a problem to teach students. This approach increases both the volume and diversity of training data, enhancing the model’s exposure and adaptability.

However, relying solely on large-scale data and computation—the so-called “scaling law” approach—is unsustainable. Instead, we favor a structured, rules-based methodology encompassing problem generation, understanding, and developing proofs. This enables LLMs to reason deeply rather than simply mimic patterns.

Q: What’s the difference between mathematical and common-sense reasoning? And why do models struggle with problems like “The city gate and the pole”?

Zhang: Mathematical reasoning relies on structured knowledge, clear rules, and precise procedures, making it highly logical. In contrast, common sense reasoning draws on everyday human experience and intuition, requiring an understanding of physical contexts, language nuances, and practical scenarios.

To solve the “The city gate and the pole” problem, one needs to understand how objects behave in space—not just an ability to perform computation. LLMs trained primarily on text data lack an internal model of the physical world. They don’t truly “know” what a pole or a city looks like, nor do they have spatial awareness. The problem exists in 2D, along the x- and y-axes, but the solution involves using the z-axis. However, the LLM doesn’t consider this, so a 2D solution is given. If the LLM had spatial awareness and understood 3D, it would have recommended moving the pole along the z-axis.

However, LLMs interpret only the surface meaning of words, missing any needed conceptual or spatial reasoning. Addressing this limitation is a major challenge.

Did you know? The “The city gate and the pole” fable originated in China. In the state of Lu, a man attempted to carry a long pole through a city gate. First, he held it upright, but it was too tall to pass through. Then, he turned it horizontally, but it was still too long to fit. While standing there puzzled, an old man approached and said, “I am not a sage, but I’ve seen many things. Why not saw the pole in half and bring it in that way?” The man followed the advice—and it worked. But now the pole is sawed in half.

Q: With so many kinds of math problems, can AI develop a “universal brain” for math? What is the core of mathematical reasoning in LLMs?

Zhang: LLMs can already handle many math problems at the high school and even college level. At these stages, knowledge is relatively structured and the types of problems are predictable. With enough relevant data, models can identify patterns and apply appropriate rules.

However, cutting-edge mathematical research presents a much greater challenge. Consider Gödel’s incompleteness theorem, which shows that within any axiomatic system, there are true statements that can’t be proven by that system. LLMs operate within fixed rule sets and are limited when they encounter such propositions.

What distinguishes human intelligence is its ability to transcend existing systems and invent new ones. Einstein’s theory of relativity emerged by breaking free from the boundaries of classical mechanics. Similarly, for AI to contribute to the frontiers of mathematics, it must evolve beyond rigid systems and construct new axiomatic frameworks—essentially inventing the math needed to solve previously unsolvable problems.

Q: How do you see the future of AI in mathematical reasoning, and what practical value could it create?

Zhang: Just as people use calculators and textbooks when solving math problems, future LLMs will also need the ability to use external tools. This tool-using ability will be vital not just in math but also in programming and decision-making.

The most immediate use for improved mathematical reasoning lies in education. AI models with strong reasoning skills could support personalized learning, explain concepts clearly, and help students build a deeper understanding of math.

In industry, formalized mathematical reasoning could significantly strengthen software development, particularly when it comes to code reliability and stability. This aligns with a growing research trend toward code formalization and verification.

In mathematical research, we don’t expect AI to replace mathematicians. Instead, it could serve as a creative partner—offering new ideas or unconventional approaches that inspire breakthroughs. Several mathematicians are already exploring this kind of collaboration, using AI’s divergent “thinking” to tackle unsolved problems. This human-machine synergy could reshape the future of scientific discovery.