Tiny Recursive Models beat LLMs on ARC-AGI

Oct 27, 2025

Hi Futurists. Catching up on news bits, starting with a new algorithmic approach to artificial intelligence (and I will have more in upcoming messages), Tiny Recursive Models, so let’s just dive into that.

AI

1. Tiny Recursive Models beat large language models on the ARC-AGI tests of intelligence.

“With only 7M parameters, TRM obtains 45% test-accuracy on ARC-AGI-1 and 8% on ARC-AGI-2, higher than most LLMs (e.g., Deepseek R1, o3-mini, Gemini 2.5 Pro) with less than 0.01% of the parameters.”

The wording of that is very careful. The best LLM/multi-modal model on both ARC-AGI-1 and ARC-AGI-2 is a version of Grok 4 custom-trained for the ARC-AGI-1 and ARC-AGI-2 tests. It gets scores of 79.6 on ARC-AGI-1 and 29.4 on ARC-AGI-2. However, this model has 1.7 trillion parameters. Tiny Recursive Models are able to get 44.6 on ARC-AGI-1 and 7.8 on ARC-AGI-2 with only 7 million parameters. The ability to do so well with so few parameters is what’s noteworthy.

“ARC-AGI-1 and ARC-AGI-2 are geometric puzzles involving monetary prizes. Each puzzle is designed to be easy for a human, yet hard for current AI models. Each puzzle task consists of 2-3 input-output demonstration pairs and 1-2 test inputs to be solved. The final score is computed as the accuracy over all test inputs from two attempts to produce the correct output grid. The maximum grid size is 30x30. ARC-AGI-1 contains 800 tasks, while ARC-AGI-2 contains 1120 tasks. We also augment our data with the 160 tasks from the closely related ConceptARC dataset. We provide results on the public evaluation set for both ARC-AGI-1 and ARC-AGI-2.”

“While these datasets are small, heavy data-augmentation is used in order to improve generalization. ARC-AGI uses 1000 data augmentations (color permutation, dihedral-group, and translations transformations) per data example. The dihedral-group transformations consist of random 90-degree rotations, horizontal/vertical flips, and reflections.”

“Tiny Recursive Model with self-attention obtains 44.6% accuracy on ARC-AGI-1, and 7.8% accuracy on ARC-AGI-2 with 7M parameters. This is significantly higher than the 74.5%, 40.3%, and 5.0% obtained by Hierarchical Reasoning Model using 4 times the number of parameters (27M).”

How does it work?

Well, the actual paper talks a lot about a previous model (which you just saw mentioned in that last quote) called Hierarchical Reasoning Model. Tiny Recursive Model was created by improving upon Hierarchical Reasoning Model.

The philosophy of Hierarchical Reasoning Model is that you actually have two models. One processes inputs at a very high frequency. The second processes outputs from the first at a low frequency. In this manner, you establish a clear hierarchy.

The Tiny Recursive Model dispenses with the explicit hierarchy in favor of “recursion”. There’s a single network. It contains a transformer “attention” system, but combines that with the input (remincent of residual networks), the current best answer, and a hidden latent state (reminscent of recurrent networks -- attention-based “transformers” made recurrent networks just about completely disappear).

Hierarchical Reasoning Models require a complex inner loop with fixed parameters for controlling when the high-level network runs. The Tiny Recursive Model has a simpler inner loop, though it has a fixed parameter for updates to the hidden latent state (6 times through the loop) and another fixed parameter for the number of times it does the “deep recursion “ incorporating the input, current best answer, and hiden state (3 times through that loop).

The Hierarchical Reasoning Model has a complex early stopping mechanism, that in the paper the creators of the Tiny Recursive Model say was both “biologically inspired” (using ideas from neuroscience) and inspired by Q-learning in reinforcement learning. It is computationally expensive to calculate whether to “halt”. The new Tiny Recursive Model uses a simple binary cross-entropy, a commonly used loss function in machine learning. The cross-entropy goes through a sigmoid function and if the result is more than 0.5 (potentially another fixed parameter), then the model considers its answer confident enough to stop.

The Hierarchical Reasoning Model outputs its final answer only from the network at the top of the hierarchy. The Tiny Recursive Model, in contrast, maintains the “current best answer” throughout the process. It maintains latent state throughout the process as well, allowing it to continuously maintain inner “thinking” that is not part of the final answer.

It remains to be seen whether this is a revolution that will revolutionize the field of AI. Since these models are so small, there would seem to be tremendous headroom to scale them up and potentially crush humans on the ARC-AGI-1 and ARC-AGI-2 tests.

https://arxiv.org/abs/2510.04871

2. In the discussion between Richard Sutton, pioneer of reinforcement learning, and Dwarkesh Patel, YouTuber, the two spoke past each other because they were “speaking two different languages”, says Ksenia Se of “Turing Post”.

Words like “prediction”, “goal”, “imitate”, “world model”, and “priors”, have different meanings in the minds of Richard Sutton and Dwarkesh Patel.

Richard Sutton thinks of them in terms of reinforcement learning, and having studied part of his textbook (co-authored with Andrew Barto) (I read about half of it and confess to not having done most of the exercises -- they are quite challenging!), I understand him very clearly, while Dwarkesh Patel thinks in terms of the current large language models.

To me, Dwarkesh Patel’s thinking seems limited because he’s not able to see beyond large language models and their token-oriented, self-supervised training system. It may be fine for language, but other techniques, to come primarily from the reinforcement learning research people, are likely in my mind to make robots competitive with humans in terms of physical dexterity in the physical world.

Here is the link to the original full discussion:

Of note (at least to me), near the end Richard Sutton says:

“I do think succession to digital intelligence or augmented humans is inevitable. I have a four-part argument. Step one is, there’s no government or organization that gives humanity a unified point of view that dominates and that can arrange... There’s no consensus about how the world should be run. Number two, we will figure out how intelligence works. The researchers will figure it out eventually. Number three, we won’t stop just with human-level intelligence. We will reach superintelligence. Number four, it’s inevitable over time that the most intelligent things around would gain resources and power.”

“Put all that together and it’s sort of inevitable. You’re going to have succession to AI or to AI-enabled, augmented humans. Those four things seem clear and sure to happen.”

3. “Today’s LLMs are the epicycles of intelligence: extraordinarily useful for navigation through language, capable of producing predictive charts of our symbolic universe -- but like their astronomical predecessors, perhaps working well without being fundamentally correct.”

“In astronomy, it took two orthogonal insights -- Copernicus’s heliocentrism and Kepler’s ellipses -- spread over seventy years to break free from epicycles, and another eighty for Newton to reveal the logic behind them. By analogy, we may still be in AI’s pre-Copernican era, using parameter-rich approximations that will eventually give way to a more compact and principled foundation.”

Is the possibility that gradient descent and backpropagation aren’t the foundations of intelligence itself keeping you up at night?

https://ashvardanian.com/posts/llm-epicycles/

4. “The AI boom’s reliance on circular deals is raising fears of a bubble.”

“Nvidia plans to invest in OpenAI, which is buying cloud computing from Oracle, which is buying chips from Nvidia, which has a stake in CoreWeave, which is providing artificial intelligence infrastructure to OpenAI.”

“If it starts to become clear that AI productivity gains -- and thus the return on investment -- may be limited or delayed, ‘a sharp correction in tech stocks, with negative knock-ons for the real economy, would be very likely,’ analysts with Oxford Economics research group wrote in a recent note.”

https://www.nbcnews.com/business/economy/openai-nvidia-amd-deals-risks-rcna234806

Billionaire tech investor Orlando Bravo says ‘valuations in AI are at a bubble’:

https://www.cnbc.com/2025/10/07/tech-vc-orlando-bravo-ai-bubble.html

Top analyst ‘very concerned’ about Nvidia fueling an AI bubble and a ‘Cisco moment’ like the dotcom crash: ‘We’re a lot closer to the seventh inning than the first or second inning’

(The “top analyst” is Morgan Stanley Wealth Management chief investment officer Lisa Shalett):

https://fortune.com/2025/10/07/ai-bubble-cisco-moment-dotcom-crash-nvidia-jensen-huang-top-analyst/

America is now one big bet on AI (paywalled, but see below):

https://www.ft.com/content/6cc87bd9-cb2f-4f82-99c5-c38748986a2e

This Yahoo Finance article has quotes from the paywalled Financial Times article:

https://finance.yahoo.com/news/ai-becoming-magic-fix-america-204155018.html

5. xAI, Elon Musk’s AI company, is making an alternative to Wikipedia called Grokipedia.

https://x.com/elonmusk/status/1972992095859433671

6. The claim is being made that at JPMorgan, the shift to agentic AI “favors those who work directly with clients -- a private banker with a roster of rich investors, traders who cater to hedge fund and pension managers, or investment bankers with relationships with Fortune 500 CEOs, for instance.”

“Those at risk of having to find new roles include operations and support staff who mainly deal in rote processes like setting up accounts, fraud detection or settling trades.”

https://www.cnbc.com/2025/09/30/jpmorgan-chase-fully-ai-connected-megabank.html

7. ImageTextEdit lets you “edit text in images with AI magic.”

Now whenever you see people with signs (protesters, for example), you won’t be able to trust that the signs actually say what they look like they say.

Not that you could trust images before, but thanks to this you can trust them even less. ;)

https://imagetextedit.com/

8. camfer (no capitalization) is an AI CAD tool that works with SolidWorks on Windows.

If you’re a SolidWorks user and give it a whirl, let me know how it goes.

https://camfer.dev/

9. AI GIF Generate is an AI animated GIF generator.

https://aigifgenerate.com/

10. The claim is being made that the government of the Caribbean island of Anguilla now gets 47% of its income from registrations of .ai domains.

Honorable mention in the comments section: .io (Indian Ocean), .fm (Federated States of Micronesia), and .tk (Tokelau).

https://www.reddit.com/r/dataisbeautiful/comments/1ntc9un/ai_boom_makes_caribbean_island_rich_anguilla_now/

Cybersecurity

11. “New research by LayerX shows how a single weaponized URL, without any malicious page content, is enough to let an attacker steal any sensitive data that has been exposed in Perplexity’s Comet AI browser.”

“For example, if the user asked Comet to rewrite an email or schedule an appointment, the email content and meeting metadata can be exfiltrated to the attacker.”

“An attacker only needs to get a user to open a crafted link, which can be sent via email, an extension, or a malicious site, and sensitive Comet data can be exposed, extracted, and exfiltrated.”

It’s only been days since I found out Perplexity’s Comet AI browser exists. The Comet browser is supposed to turn your browser into an AI agent that can take actions on the internet on your behalf.

https://layerxsecurity.com/blog/cometjacking-how-one-click-can-turn-perplexitys-comet-ai-browser-against-you/

12. It is alleged (by The Citizen Lab, at the Munk School of Global Affairs and Public Policy at the University of Toronto), that Israel is using AI to create online “influence operations” aimed at “regime change” in Iran, starting with a deepfake of IDF air strikes on Evin Prison in Tehran.

https://citizenlab.ca/2025/10/ai-enabled-io-aimed-at-overthrowing-iranian-regime/

13. A variant of the infamous Petya/NotPetya ransomeware virus has been discovered that is capable of bypassing UEFI Secure Boot on outdated systems. Petya in 2016 encrypted the hard drives of Windows (NTFS) computers and demanded Bitcoin payment. In 2017, a variant was made that targeted a Ukrainian tax filing program. It infected Ukrainian banks, electricity companies, and all kinds of Ukrainian companies before it escaped to the rest of the world. It pretended to demand payment but actually just wiped the hard disks; there was no way to pay and actually get the data back. It is thought to have been created by the Russian GRU specifically to cyberattack Ukraine. It was named NotPetya to distinguish it from the original Petya.

The new variant is being called HybridPetya. It exploits a vulnerability in old versions of UEFI Secure Boot. UEFI stands for “Unified Extensible Firmware Interface” and the “Secure Boot” portion of the specification specifies a procedure for digitally signing the operating system (or more specifically, the OS boot loader), and not allowing the computer to boot up if the digital signature check fails.

https://www.welivesecurity.com/en/eset-research/introducing-hybridpetya-petya-notpetya-copycat-uefi-secure-boot-bypass/

Programming Languages

14. “How functional programming shaped (and twisted) frontend development.”

If it seems like ideas in React and Redux resemble ideas from the “functional languages paradigm” in languages like Haskell, it’s not your imagination.

Some choice quotes:

“There’s a strange irony at the heart of modern web development. The web was born from documents, hyperlinks, and a cascading stylesheet language. It was always messy, mutable, and gloriously side-effectful. Yet over the past decade, our most influential frontend tools have been shaped by engineers chasing functional programming purity: immutability, determinism, and the elimination of side effects.”

“The web is fundamentally side-effectful. CSS cascades globally by design. Styles defined in one place affect elements everywhere, creating emergent patterns through specificity and inheritance. The DOM is a giant mutable tree that browsers optimize obsessively; changing it directly is fast and predictable. User interactions arrive asynchronously and unpredictably: clicks, scrolls, form submissions, network requests, resize events. There’s no pure function that captures ‘user intent.’”

“This messiness is not accidental. It’s how the web scales across billions of devices, remains backwards-compatible across decades, and allows disparate systems to interoperate. The browser is an open platform with escape hatches everywhere. You can style anything, hook into any event, manipulate any node. That flexibility and that refusal to enforce rigid abstractions is the web’s superpower.”

“Functional programming revolves around a few core principles: functions should be pure (same inputs yields same outputs, no side effects), data should be immutable, and state changes should be explicit and traceable. These ideas produce code that’s easier to reason about, test, and parallelize, in the right context of course.”

“CSS was designed to be global. Styles cascade, inherit, and compose across boundaries. This enables tiny stylesheets to control huge documents, and lets teams share design systems across applications. But to functional programmers, global scope is dangerous. It creates implicit dependencies and unpredictable outcomes.”

“React introduced synthetic events to normalize browser inconsistencies and integrate events into its rendering lifecycle. Instead of attaching listeners directly to DOM nodes, React uses event delegation. It listens at the root, then routes events to handlers through its own system.”

“This feels elegant from a functional perspective. Events become data flowing through your component tree. You don’t touch the DOM directly. Everything stays inside React’s controlled universe.”

“But native browser events already work. They bubble, they capture, they’re well-specified. The browser has spent decades optimizing event dispatch.”

https://alfy.blog/2025/10/04/how-functional-programming-shaped-modern-frontend.html

15. Unbeknownst to me, there’s been an effort underway to make a version of C++ called “Safe C++”.

“The goal of this proposal is to advance a superset of C++ with a rigorously safe subset. Begin a new project, or take an existing one, and start writing safe code in C++. Code in the safe context exhibits the same strong safety guarantees as code written in Rust.”

However, the C++ Safety and Security working group voted to prioritize “Profiles”, whatever that is, over Safe C++. So it looks like “Safe C++” is never going to happen.

https://safecpp.org/draft.html

Discussions (on Reddit) on Safe C++ and “Profiles”:

https://www.reddit.com/r/cpp/comments/1lhbqua/comment/mz3u7cr/

https://www.reddit.com/r/cpp/comments/1lhbqua/any_news_on_safe_c/

Bjarne Stroustrup’s collection of links to documents on “Profiles” (I haven’t read any of these):

https://github.com/BjarneStroustrup/profiles

Social Networking

16. “User ban controversy reveals Bluesky’s decentralized aspiration isn’t reality. Bluesky’s protocol is so complicated that not even the biggest alternative network has figured out how to become independent.”

“Bluesky’s engineering team has been moving ahead with its long-promised open source efforts, breaking up its software stack into several pieces to enable a federated Authenticated Transfer Protocol (ATProto) network where anyone with the know-how and funds could run their own copy of Bluesky.”

But...

“The only completely independent implementation of ATProto is Bluesky. But that isn’t for want of trying on the part of Rudy Fraser, the creator of Blacksky.”

“Despite Fraser’s efforts to implement his own PDS, Relay, and App View, however, Blacksky still remains partially dependent upon Bluesky’s application server, largely because while the code to implement the dataplane of posts and users within an application server is released, the open-source version is slower. As a result, Blacksky is dependent on Bluesky’s application server to give users a fast experience, which also means that it is dependent on Bluesky’s labeling system and its moderation choices.”

And the government is trying to influence those moderation choices.

“Federal Communications Commission Brendan Carr’s threats against late night comedian Jimmy Kimmel led to his temporary suspension by ABC, and he was far from the only Republican to issue them. Louisiana Rep. Clay Higgins, chair of the House subcommittee on federal law enforcement, sent a menacing letter to Bluesky and other social media networks demanding that they identify and ban anyone deemed to be celebrating Charlie Kirk’s killing.”

Flux

User ban controversy reveals Bluesky’s decentralized aspiration isn’t reality

When it launched in 2023 in private beta, Bluesky was pitched as a different kind of social network, one that placed openness and user-friendliness at its core. It came along at just the right moment as Elon Musk’s purchase and takeover of Twitter led millions of users head toward any type of exit…

a month ago · 13 likes · Matthew Sheffield

17. MEMS lidar.

“Five years ago, Eric Aguilar was fed up.”

“He had worked on lidar and other sensors for years at Tesla and Google X, but the technology always seemed too expensive and, more importantly, unreliable. He replaced the lidar sensors when they broke -- which was all too often, and seemingly at random -- and developed complex calibration methods and maintenance routines just to keep them functioning and the cars drivable.”

“So, when he reached the end of his rope, he invented a more robust technology -- what he calls the ‘most powerful micromachine ever made.’”

“Aguilar and his team at startup Omnitron Sensors developed new micro-electro-mechanical systems (MEMS) technology that he claims can produce more force per unit area than any other.”

Allegedly by replacing conventional lidar with this MEMS technology, it will be more robust to road vibrations, thermal cycles, and rain.

https://spectrum.ieee.org/mems-lidar

Algorithmic Curiosities

18. Setsum is an order agnostic, additive, subtractive checksum. (An algorithmic curiosity for you.) At first that sounded impossible, but it’s actually simple when you look under the hood. It takes a traditional cryptographic hash function, breaks the output into fixed-size integers (32-bit integers, for example), and then adds or subtracts those integers modulo a prime number. (They give an example here with 29 used as the prime number, but the largest prime number that fits in a 32-bit integer is 4,294,967,291.)

Order doesn’t matter, you can remove items, and you can combine setsums, but “setsum can tell you if states diverged, but not where. To narrow things down, you can split your data into smaller chunks and compare those. Build this into a hierarchical structure and you’re basically back to something like a Merkle tree.”

“You can remove items that never existed. This might or might not be a problem depending on your use case. Given that you’re only maintaining 256 bits of state, it’s a reasonable tradeoff.”

“There’s no history tracking. You can’t tell when or how states diverged, just that they did.”

https://avi.im/blag/2025/setsum/

Fractals

19. “Orbiting the Hénon Attractor.”

This is from 2022 but I just encountered it today. It is a system of equations that generates fractal images, but these, unlike looking like a Mandelbrot or Julia set or somesuch, remind me of the rings of Saturn or the cloud formations of Jupiter or Saturn.

https://observablehq.com/@yurivish/orbiting-the-henon-attractor

Aviation

20. An aircraft that can reduce fuel burn by over 60% by using “laminar-flow aerodynamics” (as well as other technologies like “precision all-carbon-fiber composites”) has been produced, or so it is claimed, by an aerospace startup company, Otto Aerospace, which announced its first fleet customer will be Flexjet, although deliveries won’t begin until 2030.

https://ottoaerospace.com/news/otto-aerospace-secures-historic-300-aircraft-order-with-flexjet/

Neuroscience

21. “A study involving more than nine million pregnancies reported that children whose mothers had gestational diabetes during pregnancy had a higher chance of developing attention deficit-hyperactivity disorder (ADHD) and autism than did children whose mothers didn’t have the condition.”

“The study, presented at the European Association for the Study of Diabetes in Vienna, is under review at a peer-reviewed journal. It is not the first to link gestational diabetes to neurodevelopmental disorders in children, but it is one of the largest. Researchers pooled results from 48 studies across 20 countries, finding that children born to people with gestational diabetes had lower IQ scores, a 36% higher risk of ADHD and a 56% higher risk of autism spectrum disorders. Estimates suggest the prevalence of autism in the general population is one in 127 people and 3-10% of children and teenagers have ADHD.”

“The latest results mirror those of another meta-analysis, published in The Lancet Diabetes & Endocrinology journal in June, which included 56 million mother-child pairs and found that all types of diabetes in pregnancy, including type 1, type 2 and gestational diabetes, increase the risk of the baby developing ADHD and autism. But none of these studies has been able to show that diabetes during pregnancy causes these conditions.”

Note: Not acetaminophen (paracetamol).

https://www.nature.com/articles/d41586-025-03024-5

Domestic Politics

22. According to an NPR/PBS News/Marist poll, more US adults agree with the statement, “Americans may have to resort to violence in order to get the country back on track.”

For self-identified Democrats, between March of 2024 and September of 2025, the percent that agree went from 12% to 28%. For self-identified Republicans, the percent that agree went from 28% to 31%. For self-identified independents, the percent that agree went from 18% to 25%. Overall, for all US adults, the percent that agree went from 20% to 30%. (PBS said 19, NPR said 20. The Marist survey results document said 6% strongly agree + 13% agree, which, if I am capable of doing arithmetic, is 19%.)

The poll was conducted September 22nd through September 26th, 2025, which is after the high-profile assassination of Charlie Kirk.

https://www.npr.org/2025/10/01/nx-s1-5558304/poll-political-violence-free-speech-vaccines-national-guard-epstein-trump

How PBS reported the story:

How Marist reported the story:

https://maristpoll.marist.edu/polls/the-1st-amendment-in-the-u-s-october-2025/

The poll question cited is on page 16 of the survey results document. In the PBS story, they mention young people feel more strongly in favor of political violence. The Marist document shows 41% for the 18-29 age bracket, 41% again for the 30 to 44 age bracket, 23% for the 45 to 59 age bracket, and 21% for 60+.

They also have “generation”, which I guess is based on self-identification? But the results are similar. For “Gen Z” they have 42%, for “Millennials” they have 41%, for “Gen X” they have 23%, and for “Baby Boomers” and up they have 21%.

End Of An Era

23. “AOL’s dial-up internet service is shutting down Tuesday, ending one of the web’s first mainstream access points.”

By “Tuesday”, they mean September 30th, so it’s already shut down by the time you read this.

End of an era.

https://www.nbcnews.com/tech/tech-news/aol-dial-up-silenced-rcna234655

Share Wayne's Future

Wayne's Future