
By U. Chicago
These days, large language models can handle increasingly complex tasks, writing complex code and engaging in sophisticated reasoning.
But when it comes to four-digit multiplication, a task taught in elementary school, even state-of-the-art systems fail. Why?
A new paper by University of Chicago computer science PhD student Xiaoyan Bai and faculty codirector of the Data Science Institute’s Novel Intelligence Research Initiative Chenhao Tan finds answers by reverse-engineering failure and success.
They worked with collaborators from MIT, Harvard University, University of Waterloo and Google DeepMind to probe AI’s “jagged frontier”—a term for its capacity to excel at complex reasoning yet stumble on seemingly simple tasks.
As you may remember (or have forgotten), multiplying larger numbers requires carrying over digits, and mentally “holding on” to partial products so you can add them up to get your final total. Processes that require storing information for later use in this way are called “long-range dependencies.”
Standard large language models work by learning to recognize patterns in the data they’re trained on. But the more complex a problem gets, the less likely a model is to have seen it specifically. So how do you teach a model to not just memorize answers but learn a process?
Models are often taught new tasks with a process known as standard fine-tuning, which relies on scaling up the training data, or adding more steps or “layers.”
But even when the research team tested models with two layers all the way up to 12 layers, they all achieved less than 1% accuracy when multiplying two four-digit numbers. The standard approaches were clearly failing, and researchers wanted to understand why.
They found that under the standard approach, models converge on a “local optimum,” or what they identify as the best solution in each dataset. But tasks like multi-digit multiplication require a model to be able to remember earlier computations while producing later digits.
Without an architecture that can store and retrieve intermediate information, a model gets stuck, unable to move beyond that local optimum—no matter how long it trains or how large it scales.
Next, the researchers identified a model trained using a different method: Implicit Chain of Thought (ICoT).
Where standard fine-tuning achieved less than 1% accuracy, the ICoT model was able to achieve 100% accuracy. To understand what this approach was doing differently, the team took both apart to uncover some fundamental insights.
First, they saw that the ICoT model learns to remember what matters.
Unlike the standard fine-tuning model, the ICoT model learned to track those long-range dependencies, or the information it gradually put together to solve a problem. The team verified this by testing whether they could decode intermediate values, such as running sums, from the models’ internal states. In the ICoT model, they could—but in the standard model, they couldn’t.
The ICoT method gradually removes intermediate reasoning steps during training, in a sense forcing the model to internalize the reasoning process in its hidden states rather than relying on explicit step-by-step tokens.
Next, they saw the ICoT model organizes its attention into distinct pathways across time.
Think of it like a well-organized filing system: In early layers, the model computes products of digit pairs and stores them at specific locations. In later layers, it retrieves exactly the values it needs to calculate each digit of the final answer. The result is an efficient internal structure for carrying out multiplication, one that never emerges in the standard model.
Finally, and perhaps most remarkably, the researchers found the ICoT model internally represents these operations using elegant structures. Instead of treating digits as symbols alone, the model encodes them as wave-like patterns known as Fourier bases and organizes its arithmetic in a visual, spatial way.
When multiplying digit pairs, the model uses a natural geometric operation called a Minkowski sum—something the researchers didn’t program, but rather emerged naturally during training in the ICoT model. It’s as if the successful model derived its own efficient mathematical language for arithmetic.
The researchers reasoned that if the standard fine-tuning models failed because they lacked the right built-in guidance, then providing the right training signal should fix it. To test this, the team introduced a simple solution: an added training objective that teaches the model to track running sums at each step, allowing it to carry intermediate values and partial products forward.
It turned out that making this one addition to the two-layer model that completely failed under standard training did the trick. The result: 99% accuracy without explicit chain-of-thought supervision.
When the researchers examined the model’s attention patterns, they found it had learned mechanisms similar to ICoT’s—structures that store and retrieve partial products as needed. The model also developed additional strategies, including a way to track multiple digit pairs at the same time.
While multiplication might seem a specific kind of task, the findings illuminate fundamental aspects of how large language models learn and “think.”
The long-range dependency problem isn’t unique to arithmetic—it appears throughout language modeling and other sequential tasks. The UChicago team’s approach asks foundational questions about the distinctions between memorization and learning, and what architectural constraints help or hinder models’ performance.
“As AI is increasingly integrated into critical decision-making, it’s essential to understand its unique ways of learning and thinking,” says Tan. “Our research is trying to chart that terrain.”
This paper’s key contribution: Architectural insights and training techniques can overcome obstacles that scaling alone cannot address. The right built-in guidance, not just more parameters or data, are key to pushing AI capabilities forward.
While the solution for the multiplication issue is task-specific, the researchers anticipate future work will develop more general approaches to improve learning on tasks requiring models to keep track of information across many steps.
Source: University of Chicago
—
Previously Published on futurity.org with Creative Commons License
***
–
The world is changing fast. We help you keep up.
We’ll send you 1 post, 3x per week.
Join The Good Men Project as a Premium Member today.
All Premium Members get to view The Good Men Project with NO ADS. Need more info? A complete list of benefits is here.
—
“Here’s the thing about The Good Men Project. We are trying to create big, sweeping, societal changes—–overturn stereotypes, eliminate racism, sexism, homophobia, be a positive force for good for things like education reform and the environment. And we’re also giving individuals the tools they need to make individual change—-with their own relationships, with the way they parent, with their ability to be more conscious, more mindful, and more insightful. For some people, that could get overwhelming. But for those of us here at The Good Men Project, it is not overwhelming. It is simply something we do—–every day. We do it with teamwork, with compassion, with an understanding of systems and how they work, and with shared insights from a diversity of viewpoints.” —– Lisa Hickey, Publisher of The Good Men Project and CEO of Good Men Media Inc.
–
We have pioneered the largest worldwide conversation about what it means to be a good man in the 21st century. Your support of our work is inspiring and invaluable.
The Good Men Project is a mission-driven media platform founded in 2010 that helps writers, brands, agencies, and organizations build credibility, audience, and long-term authority. By publishing stories about masculinity, mental health, relationships, fatherhood, identity, and personal development, GMP provides a trusted ecosystem where ideas gain visibility, trust, and resilience in both search and AI-driven discovery. The platform supports individual contributors as well as high-volume agencies through paid guest posts, sponsored content, and bulk publishing systems designed for scale.
—
Photo credit: iStock
