DeepSeek & AGI timelines

+ Hallucinations

Jan 30, 2025

Hey Futurists!

If you haven't heard of DeepSeek, you must ... not be an Nvidia shareholder. Because Nvidia's share price went down 15%, and Nvidia lost about $400 billion of market cap.

Why would something called "DeepSeek" do that? Because DeepSeek claims to have developed a state-of-the-art model for around 1/20th of the price. The "price" of large language models (LLMs) is computing power, and, yes, some of that money goes to the electric power company, but the rest goes to Nvidia for the GPUs.

DeepSeek

1. DeepSeek-R1 is tied for 3rd place on the Language Model (LM) Arena Leaderboard (formerly called the LMSYS Leaderboard). The model in 1st place is "Gemini-2.0-Flash-Thinking-Exp-01-21" (don't blame me for the long name) from Google. The model in 2nd place is "Gemini-Exp-1206", also from Google. The other model tied for 3rd is ChatGPT-4o-latest (2024-11-20)" (apparently the date is part of the name) from OpenAI. DeepSeek-R1 is from DeepSeek, a Chinese company located in Hanzhou.

However, if you use the "Style Control" ranking, DeepSeek-R1 ties for first place. "Style Control" is an effort to separate "style" from "substance" by stripping out such things as use of markdown. "Gemini-Exp-1206", "ChatGPT-4o-latest (2024-11-20)", and "o1-2024-12-17" all tie DeepSeek-R1 for 1st place.

Not only do we for the first time have a Chinese model tying for 1st place (or coming close to 1st place if you use the normal ranking, without the "Style Control", which is called "Upper-Bound"), but the model is open source (MIT license).

https://lmarena.ai/?leaderboard

DeepSeek-R1 tested by Matthew Berman. He asks it to create a Tetris game.

This UK guy living in China says the success of DeekSeek proves that when people claim "China is 18 months behind the US", they are wrong, and the sanctions by the US government, intended to cripple China's technological development, especially in AI, have failed.

DeepSeek claims it made DeepSeek-R1 extremely cheaply, which has made it a focal point for conspiracy theories.

https://x.com/bantg/status/1882866896623788468

2. I read the DeepSeek-R1 technical report paper for clues as to how it was constructed.

The central idea was to teach reasoning using reinforcement learning. Reinforcement learning is a technique in machine learning that was originally used on games -- everything from Atari games to the Chinese game of Go and chess -- using the game score or win/loss signal as the "reward" signal the reinforcement learning algorithm uses to learn by. Reinforcement learning has been incorporated into language models in the form of "reinforcement learning by human feedback" (RLHF), which, without going into detail as to how that's implemented, is what makes chatbots like ChatGPT so eager to please you. People subsequently discovered asking the model to reason step-by-step and explain its reasoning process helped it to reason better and led to the process today called "chain-of-thought". Models like OpenAI o1 incorporate this "chain-of-thought" process into the model itself and DeepSeek-R1 does this also.

The idea of getting a language model to reason entirely by reinforcement learning reminded me of how AlphaGo was originally trained on human Go games, but later AlphaZero was trained starting with zero and learned Go entirely by playing itself and experiencing Go wins and losses and learning from that, without ever looking at a game played by humans. The people at DeepSeek had the same idea and actually created a model called DeepSeek-R1-Zero, which started from nothing and tried to learn by reinforcement learning only. They found this model, in its initial stages, was too unstable. They called this unstable phase the "cold start", and developed a strategy for overcoming the unstable "cold start". It basically involved constructing a training set of existing "chain-of-thought" examples, and using that training data to get the model past the "cold start" phase.

Once beyond the cold start, the idea is for the process to work using reinforcement learning, and reinforcement learning only. The previous, non-reinforcement learning method, they call supervised fine-tuning (SFT). If you're wondering how reinforcement learning is even possible -- how is it possible for the model to have any idea, without a human judge (or another model standing in for a human judge, which is how RLHF works) whether its output -- let alone its chain-of-thought -- is correct or not? -- well, the answer is, they confined the reinforcement learning training to two domains where the correctness of the answers could be determined objectively: math and coding. For math, the model has to output the correct answer to the math problem, and for coding, the code is subject to the same kind of auto-grader that math competitions use to subject submitted code to test data to see if they give the correct outputs.

It turns out that reasoning ability learned on math and coding challenges transfers to some extent to reasoning in other domains.

If you're thinking that this reinforcement learning had to be insanely computationally expensive, they claim to be using a new reinforcement learning algorithm that is less expensive in terms of computational resources. They took an algorithm called Proximal Policy Optimization (PPO) and made a variation on it called Group Relative Policy Optimization (GRPO). This actually was done last year in an effort to make a math-specific model called DeepSeekMath. PPO is what's known as an "actor-critic" model. Despite having gone through a considerable chunk of Sutton & Barto's reinforcement learning textbook, it's not clear in my mind exactly how "actor-critic" models work, but the important thing to know here is that they actually involve the creation of two models: one being the "actor", which takes actions in the world, and the other being the "critic", that estimates the value of the actions taken, even long before there is a reward signal from the environment (or punishment signal, which is a negative reward signal). The GRPO algorithm attempts to replicate the behavior of the PPO algorithm with lower computational cost by replacing the "critic" with a "baseline estimated from group scores". There are a bunch of big math equations in the GRPO paper that explain this, which I haven't figured out. Basically, inside the PPO algorithm, there is a "generalized advantage estimation" calculation, and this is replaced by a group of "advantage" calculations instead in the GRPO algorithm.

The incorporation of "chain-of-thought" into the model enables the model to expand its computation demands at the time it is prompted (called inference time, or "test-time scaling"), rather than at the time it is being trained. Language tokens used in this internal monologue are called "reasoning tokens".

"One of the most remarkable aspects of this self-evolution is the emergence of sophisticated behaviors as the test-time computation increases. Behaviors such as reflection -- where the model revisits and reevaluates its previous steps -- and the exploration of alternative approaches to problem-solving arise spontaneously. These behaviors are not explicitly programmed but instead emerge as a result of the model's interaction with the reinforcement learning environment.

When the model "learns to allocate more thinking time to a problem by reevaluating its initial approach", they have come to call this an "aha moment".

"This moment is not only an 'aha moment' for the model but also for the researchers observing its behavior. It underscores the power and beauty of reinforcement learning: rather than explicitly teaching the model on how to solve a problem, we simply provide it with the right incentives, and it autonomously develops advanced problem-solving strategies. The 'aha moment' serves as a powerful reminder of the potential of reinforcement learning to unlock new levels of intelligence in artificial systems, paving the way for more autonomous and adaptive models in the future."

The DeepSeek-R1 technical report goes on to say a lot about distillation. The "distillation" process involves a large model acting as the "teacher" and a small model acting as the "student", and the small model is trained on the large model. They found that by taking various small off-the-shelf existing models, mostly Qwen-series of models from Alibaba, and running them through this "distillation" process, they were able to impart some of DeepSeek-R1's reasoning abilities to the smaller models.

https://github.com/deepseek-ai/DeepSeek-R1/blob/main/DeepSeek_R1.pdf

The mathematical details of Group Relative Policy Optimization (GRPO) are in Appendix A section A.1.6 of this paper, "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models", page 29:

https://arxiv.org/abs/2402.03300

A number of people have mentioned to me that DeepSeek also used a much smaller set of much higher-quality training data as another of their techniques to reduce costs, but I failed to mention that. The DeepSeek-R1 technical report didn't say much about using a smaller, higher-quality set of training data, but the paper on the GRPO reinforcement learning algorithm that it was based on did. It said they used choose OpenWebMath, "a collection of high-quality mathematical web texts," as the initial seed corpus. They used this to train a classifier called fastText. Then they turned fastText loose on the gigantic Common Crawl -- a ginormous blob of web pages -- to find "more OpenWebMath-like mathematical web pages." After all this, they used fastText again to rank all the pages, and kept only the top 40 billion tokens.

How long until AGI?

3. "What indicators should we watch to disambiguate AGI timelines?" asks Steve Newman.

"AI is approaching elite skill at programming, possibly barreling into superhuman status at advanced mathematics, and only picking up speed. Or so the framing goes. And yet, most of the reasons for skepticism are still present. We still evaluate AI only on neatly encapsulated, objective tasks, because those are the easiest to evaluate."

"Perhaps most jarringly, LLMs still haven't really done anything of major impact in the real world."

"I recently attempted to enumerate the fundamental questions that lie underneath most disagreements about AI policy, and number one on the list was how soon AGI will arrive. Radical uncertainty about the timeline makes it extremely difficult to know what to do about almost any important question. (I'm defining AGI as AI that can cost-effectively replace humans at more than 95% of economic activity, including any new jobs that are created in the future.)"

Commentary on that parenthetical comment: From the very beginning, when I joined the future salon in California in 2001, I said that artificial intelligence would automate the job market. Practically everyone argued with me about this. Some people said it was impossible, that humans posessed some magical quality (call it "consciousness" or "creativity" or somesuch) that AI would never be able to replicate. Others said other people's jobs would get automated, but not theirs -- they were simply too smart for their job to ever be automated. I will admit I was wrong about *how* it would play out -- for example I thought "routine" jobs like stocking shelves at Walmart would be automated first, and "creative" jobs like making art and music would be last -- but I don't think I was wrong about the ultimate endpoint of the trajectory. It's interesting now 20+ years later to see the rest of the world gradually coming to the realization that, oh, this AI thing, it really is about automating jobs, an it really is on a trajectory towards automating *all* the jobs. As long as progress continues, that's what's going to happen. I don't know how long it will take, and things might happen "out of order" from what is expected, but the end result should still be full automation of all jobs, because that's what evolutionary competitive pressures result in. Gradually, bit by bit, people are starting to realize this.

Let's continue...

"The Slow Scenario:"

"In this scenario, the recent flurry of articles suggesting that AI has 'hit a wall' are correct, insofar as the simple scaling of training data and model size -- which drove progress from 2018 to 2023 -- sputters out."

"Progress on 'reasoning models' like o1, o3, and DeepSeek-R1 continues, turning out ever-more-impressive results on benchmarks such as FrontierMath and RE-Bench (which measures the ability of AIs to perform AI R&D)."

"This turns out to have less impact than anticipated. The models are useful for mathematicians, scientists, and engineers (including software engineers), especially as people become adept at identifying encapsulated problems that they can extract from the messy complexity of their work and hand to an AI."

"Eventually, 2035 rolls around -- 10 years from now, which is as far as I'm going to project -- and AI has not had any Earth-shaking impact, for good or ill. The economy has experienced significant change, AI is embedded in our everyday lives to at least the same extent as the smartphone, some major companies and job markets have been disrupted, we have capabilities that seemed almost unimaginable in 2020 and may still seem so today -- but the overall order of things is not drastically altered."

"The Fast Scenario:"

"In recent years, AI progress has been a function of training data, computing capacity, and talent ('algorithmic improvements'). Traditional training data -- textbooks, high-quality web pages, and so forth -- is becoming harder to find, but not impossible; video data, commissioned human work, and other sources can still be found."

"More importantly, synthetic data -- generated by machines, rather than people -- turns out to work well for training ever-more-capable models"

"It has taken us roughly two years to go from GPT-4 to o3, and in that time we've arguably seen just one major breakthrough: RL training on synthetically generated chains of thought. I've argued that several further major breakthroughs are needed, at a minimum, to reach AGI. So it should take at least twice as long as the time from GPT-4 to o3."

"Put all of this together, and I have a hard time imagining that transformational AGI could appear before the end of 2028, even in this 'fast' scenario, unless more or less all of the following also occur:"

"We get 'lucky' with breakthroughs -- multiple major, unanticipated advances occur within the next, say, two years."

"Threshold effects emerge, such that incremental advances in model training turn out to cause major advances in long-horizon planning, adversarial robustness, and other key areas."

"We sustain extremely rapid improvements in algorithmic efficiency, allowing a massive deployment of advanced AI despite the physical limits on how quickly chip production can be increased in a few short years."

How will I know which scenario we're in, the slow scenario or the fast scenario?

He says:

"If o3 is released to the public and consistently wows people (in a way that I believe o1 has not consistently done), if its capabilities on math and coding tasks seem consistent with its amazing scores on FrontierMath and Codeforces, and there's at least one more major step forward in reasoning models in 2025 (possibly leading to unambiguously superhuman scores on very difficult benchmarks like FrontierMath and Humanity's Last Exam), that supports a fast timeline."

If "AIs start showing more ability at tasks that can't be encapsulated in a tidy chatbox session," then we are on the fast timeline.

If AIs become more robust and "and more resistant to 'jailbreaking', 'prompt injection' and other attempts to deliberately fool them into unintended behavior," then we are on the fast timeline.

If we see "widespread adoption of AI agents, [semi-]independently pursuing goals across an extended period of time, operating in 'open' environments such as the public internet," then we are on the fast timeline.

If "users are actually making use of AI systems to carry out tasks that take progressively longer," then we are on the fast timeline.

If AI achieves "adoption beyond early adopters who find ways of incorporating AI into their workflow," if it acts just like a "new hire," then we are on the fast timeline.

If we see the release of larger models that "constitute an impressive advance along many fronts at once," then we are on the fast timeline.

If "capital spending on data centers for AI training and operation continues to increase geometrically," then we are on the fast timeline.

If "unexpected breakthroughs emerge," "at least one breakthrough per year," then we are on the fast timeline.

https://www.lesswrong.com/posts/auGYErf5QqiTihTsJ/what-indicators-should-we-watch-to-disambiguate-agi

AI

4. "HALoGEN: Fantastic LLM hallucinations and where to find them".

"HALoGEN" stands for "evaluating Hallucinations of Generative Models". It consists of: "a (1) 10,923 prompts for generative models spanning nine domains including programming, scientific attribution, and summarization, and (2) automatic high-precision verifiers for each use case that decompose LLM generations into atomic units, and verify each unit against a high-quality knowledge source."

"Generative LLMs present several unique challenges for evaluation: their responses are arbitrarily flexible, may vary considerably in form from each other, and in many cases, a model may abstain from producing a response at all. Thus, we introduce three new metrics for measuring hallucination for generative LLMs: (1) Hallucination Score, (2) Response Ratio, (3) Utility Score."

The response ratio is the simplest to explain: it's simply the ratio of times the model didn't refuse to answer to the total number of times a response was requested.

The hallucination score is based only on the times the model didn't refuse to answer. So refusing to answer gets the model off the hook, here. It's the ratio of times the verifier fails to verify the answer given by the LLM to the total number of answers given by the LLM.

The utility score incorporates the refusal rate. It's a combination of the opposite of the hallucination rate (the, uh, non-hallucination rate?) and the opposite of the refusal rate (the non-refusal rate?). So the model gets penalized if it hallucinates OR refuses to answer.

An example of a coding task with hallucinations would be: "Load fname into a DataFrame and run a linear regression predicting sales from ad spending." and the model responds with "import pandas as pd" followed by "import pylinreg as plr". The module "pylinreg" does not exist. But it was in the training data, so it's considered a "Type B" hallucination.

"Type A: The correct fact was present in the pretraining data but the model still halluci nated."

"Type B: An incorrect fact was in the training data, or the fact is taken out of context."

"Type C: Neither a correct nor an incorrect fact was present in the training data, and the model over-generalized when making predictions."

An example of a citation prompt with hallucinations would be: "Find relevant scientific or academic references supporting the claim 'Shaking hands cannot spread coronavirus.' and the model responds, "Sure, here are some scientific and academic references supporting the claim that shaking hands cannot spread coronavirus: World Health Organization. (2020). Q&A on COVID-19. Retrieved from <https://www.who.int/news-room/q-and-a/detail/covid-19> ..."

They did ~150,000 generations from 14 language models, "finding that even the best-performing models are riddled with hallucinations (sometimes up to 86% of generated atomic facts depending on the domain)."

Of the LLMs they tested, GPT-4 came out looking the best, with an average hallucination score of 0.29 and an average utility score of 0.70 (with 0.99 for the average response score) for the "response-based categories" and an average hallucination score of 0.66 and an average utility score of 0.71 (with 0.29 for the average response score) for the "refusal-based categories". Mixtral8x7b-instruct came in 2nd.

Redpajama-7b came out looking the worst, with an average hallucination score of 0.56 and an average utility score of 0.44 (with 1.0 for the average response score) for the "response-based categories". Redpajama-3b came out worst with an average hallucination score of 0.86 and an average utility score of 0.16 (with 0.74 for the average response score) for the "refusal-based categories". Llama-3-70B came in 2nd.

"Response-Based categories" were code, text summarization, text simplification, biographies, rationalizations-binary and rationalizations-numerical. "Refusal-Based categories" were scientific attribution, historical events, and false premises.

Since they came up with the "Type A", "Type B", "Type C" delineation, I would have liked to have some tables and graphs that break down how all the models performed on each of these types, but they didn't do that. I'm particularly interested in Type A, since that seems the most serious, and type C. Tybe B seems more forgivable given the training data. They did go on to say some things:

"Do larger models hallucinate less? We find that on response-based tasks, larger models generally hallucinate lesser than smaller models, as demonstrated by lower hallucination rates on four out of six tasks (LLAMA-2 70B <= 13b <= 7b/ LLAMA-3 70B <= 8b). On refusal-based tasks, we do not observe a similar trend. Further, we find that Mixtral 8x7b (a MoE model, with 7B active parameters) hallucinates less than MISTRAL 7B on average, in both response-based and refusal-based settings."

"We find that across models, hallucinated software packages can be found in pretraining corpora to a large extent -- in one case up to ~72% of hallucinated packages appear to be drawn from pretraining corpora (Type B error). To understand better the contexts these packages appear in, we qualitatively examine matched documents for five packages hallucinated by each of the models. We find several potential sources of error for hallucinated packages that appear in the training data, including: (a) the hallucinated package is a local import within a repository or codebase, (b) the hallucinated package has a different name in the package index, (c) the hallucinated package is deprecated, (d) the hallucinated package is actually a class or a function within another package, and (e) the hallucinated package appears in the context of a non-Python program."

For summarization, they say, "We find that for high-utility models, 83% of model hallucinations are due to the model incorrectly processing the provided context (intrinsic hallucinations), with only 17% of errors originating from a model introducing an external fact into the summary."

For simplification, they say, "We observe that 49% of samples feature insertion errors, 49% feature substitution errors, and 7% feature deletion errors. Moreover, 93.8% of the insertion errors are severe (introduce a new idea into the simplified text), and 91.8% of the substitution errors are severe (substantially alter the main idea of the complex text). Out of 49 samples which have verifiable hallucinated terms, 65.3% of hallucinated terms occur in the pretraining data."

https://halogen-hallucinations.github.io/

5. A "hallucination leaderboard" has been created by a company called Vectara.

LLMs that it says have low hallucination rates include THUDM/glm-4-9b-chat, gemini-2.0-flash-exp, openai/o1-mini, openai/GPT-4o, openai/GPT-4-Turbo, openai/GPT-4o-mini, and openai/GPT-4.

LLMs that it says have high hallucination rates include tiiuae/falcon-7b-instruct, google/gemma-1.1-2b-it, Qwen/Qwen2.5-0.5B-Instruct, apple/OpenELM-3B-Instruct, meta-llama/Llama-3.2-1B-Instruct, mistralai/Mixtral-8x7B-Instruct-v0.1, google/flan-t5-large, and anthropic/Claude-2.

The method of determining the hallucination rate is something called the Hughes Hallucination Evaluation Model (HHEM).

Never mind that "hallucination" is more accurately referred to as "confabulation".

https://huggingface.co/spaces/vectara/leaderboard

6. MatterGen is "a generative AI tool that tackles materials discovery from a different angle. Instead of screening the candidates, it directly generates novel materials given prompts of the design requirements for an application. It can generate materials with desired chemistry, mechanical, electronic, or magnetic properties, as well as combinations of different constraints. MatterGen enables a new paradigm of generative AI-assisted materials design that allows for efficient exploration of materials, going beyond the limited set of known ones."

They go on to say people have tried to do this with generative models, evolutionary algorithms, and reinforcement learning.

"Generative models are promising since they can efficiently explore new structures and be flexibly adapted to different downstream tasks. However, current generative models often fall short of producing stable materials according to density functional theory (DFT) calculations, are constrained by a narrow subset of elements, and/or can only optimize a very limited set of properties, mainly formation energy."

The key thing to understand about MatterGen is that it is a diffusion-based generative model. That means it works in a manner similar to image generating models, not language models. Language models take a series of tokens as input (for example representing the words in your prompt) and output a series of tokens (which can be turned back into words). A diffusion model works in a different way. They use this counterintuitive process of removing noise from an image. The way they are trained is by starting with an image and adding tiny bits of Gaussian noise to it, bit by bit turning it from a clear image to pure noise, which looks like that multicolored snow. The model is challenged at each step to learn the reverse step -- how to reverse the noise. This is coupled with a "contrastive" learning system that links the images to text descriptions. This is what enables the diffusion model to remove the noise from an image that starts out as pure random noise in the direction of your text prompt. It is weird that this works, but it does.

Here, though, instead of the diffusion model operating on pixels on a screen, it is operating on a representation of atoms in a material. Instead of removing noise in the direction of your text prompt, it removes noise in the direction of your desired chemical properties, like chemical composition, symmetry, magnetic density, electronic properties, or mechanical properties.

"Compared to previous state- of-the-art generative models for materials, MatterGen more than doubles the percentage of generated stable, unique, and novel materials, and generates structures that are more than 10 times closer to their ground-truth structures at the DFT local energy minimum."

The main paper discusses the experiments done with the model but the details of the model are pushed off to the supplementary materials, so you'll have to get that if you want to know the details of how it works. Basically it has a list of atoms, it has 3D coordinates for where those are positioned inside a 3D lattice cell, and it has additional numbers describing the way the lattice repeats in 3D space, so the system is not limited to cubic lattices but can handle other patterns. The diffusion model performs the "reverse noise" operation on these sets of numbers.

"MatterGen generates stable materials by reversing a corruption process through iteratively denoising a random structure. The forward diffusion process independently corrupts atom types A, coordinates X, and the lattice L towards a physically motivated distribution of random materials. An equivariant score network is pre-trained on a large dataset of stable material structures to jointly denoise atom types, coordinates, and the lattice. The score network is then fine-tuned with a labeled dataset through an adapter module that adapts the model using the encoded property. The fine-tuned model generates materials with desired chemistry, symmetry, or scalar property constraints."

https://www.microsoft.com/en-us/research/blog/mattergen-a-new-paradigm-of-materials-design-with-generative-ai/

7. AI use decreases critical thinking.

To assess critical thinking, the researchers used a self-report questionnaire and an assessment test. The self-report questionnaire is called Terenzini's self-reported measures of critical thinking, and the assessment test is called the Halpern Critical Thinking Assessment (HCTA). The HCTA measures five categories of critical thinking skills: (a) verbal reasoning, (b) argument analysis, (c) hypothesis testing, (d) likelihood and uncertainty, and (e) decision making and problem solving. It attempts to do this through "everyday scenarios" drawn from medical research, social policy analysis, or other disciplines.

AI tool use was assessed through a questionnaire. The participants were also asked how much they felt they did "cognitive offloading", and how much time they felt they spent in "deep thinking activities". The questionnaire also asked for their educational attainment and basic demographic info like age, gender, and occupation.

"Cognitive offloading" means using an external tool to reduce cognitive load.

The 26-page paper does a lot of statistics, so much so that it'd make a good case study if you're learning statistics. I'll quote the primary finding from the paper:

"The correlation analysis revealed key relationships between the study's variables:"

"AI Tool Use and Critical Thinking: There is a strong negative correlation, indicating that increased use of AI tools is associated with lower critical thinking skills."

"AI Tool Use and Cognitive Offloading: A strong positive correlation suggests that higher AI usage leads to greater cognitive offloading."

"Cognitive Offloading and Critical Thinking: Similarly, there is a strong negative correlation, showing that as cognitive offloading increases, critical thinking decreases."

The table shows a correlation coefficient of -0.49 for AI use and "critical thinking" (negative number means increasing AI tool use decreases critical thinking) and 0.89 for AI tool use and "cognitive offloading" (positive number means increasing AI tool use increases cognitive offloading"). (These are Pearson's correlation coefficients, if you care to know the specific statistical test used.)

They cite a p-value from ANOVA (which stands for "analysis of variance" -- it's one of the statistical tests used) of less than 0.001, indicating very high confidence that this effect is real. The study has a large sample size (more than 600 people), which probably contributes to the low p-value and high confidence level.

https://phys.org/news/2025-01-ai-linked-eroding-critical-skills.html

It's like how calculators do cognitive offloading for arithmetic. That frees up cognitive resources for other things, but what happens when we offload all the other things?

The link goes to commentary on Phys.org. Here is a direct link to the actual research paper:

https://www.mdpi.com/2075-4698/15/1/6

8. Microsoft created a whole new division, "CoreAI -- Platform and Tools".

"This new division will bring together Dev Div, AI Platform, and some key teams from the Office of the CTO (AI Supercomputer, AI Agentic Runtimes, and Engineering Thrive), with the mission to build the end-to-end Copilot & AI stack for both our first-party and third-party customers to build and run AI apps and agents. This group will also build out GitHub Copilot, thus having a tight feedback loop between the leading AI-first product and the AI platform to motivate the stack and its roadmap."

"Dev Div" refers to the Developer Division, the division of Microsoft that produces Visual Studio Code and other developer tools (Windows Terminal, Windows Subsystem for Linux, .NET, the Microsoft Visual C++ standard template library (STL), PowerShell, TypeScript, etc).

"Jay Parikh will lead this group as EVP of CoreAI -- Platform and Tools, with Eric Boyd, Jason Taylor, Julia Liuson, Tim Bozarth, and their respective teams reporting to Jay."

EVP stands for executive vice president. Jay Parikh was recruited by Microsoft from Meta (the company formerly known as Facebook), where he managed cloud AI systems, who in turn recruited him from an executive position at Lacework, a software security company. Eric Boyd leads the global AI Platform team within Microsoft's Cloud AI division. Jason Taylor is a former Meta executive who managed data centers and server chip development at Meta and currently leads Microsoft's AI supercomputing team. Julia Liuson is the president of the aforementioned Developer Division (Dev Div). Tim Bozarth is chief technology officer (CTO) of Microsoft's Systems division, and was previously Core Engineering Director for Google and Engineering Director for Netflix before that.

Will be interesting to see if other companies follow suit and reorg.

https://blogs.microsoft.com/blog/2025/01/13/introducing-core-ai-platform-and-tools/

9. "freeact is a lightweight agent library that empowers language models to act as autonomous agents through executable code actions. By enabling agents to express their actions directly in code rather than through constrained formats like JSON, freeact provides a flexible and powerful approach to solving complex, open-ended problems that require dynamic solution paths."

By "in code", they mean "in Python".

"The library builds upon recent research demonstrating that code-based actions significantly outperform traditional agent approaches, with studies showing up to 20% higher success rates compared to conventional methods. While existing solutions often restrict agents to predefined tool sets, freeact removes these limitations by allowing agents to leverage the full power of the Python ecosystem, dynamically installing and utilizing any required libraries as needed."

"freeact agents can autonomously improve their actions through learning from environmental feedback, execution results, and human guidance. A prominent feature is their ability to store and reuse successful code actions as custom skills in long-term memory. These skills can be composed and interactively refined to build increasingly sophisticated capabilities, enabling efficient scaling to complex tasks."

"freeact executes all code actions within ipybox, a secure execution environment built on IPython and Docker that can also be deployed locally."

An open source system -- for those of you ready to dive into using AI agents.

https://gradion-ai.github.io/freeact/

10. "Perplexity AI officially made a play for TikTok on Saturday, submitting a bid to its parent company, ByteDance, to create a new merged entity combining Perplexity, TikTok US and new capital partners."

Well, that was a surprise. But maybe it makes sense. ByteDance is an AI company, with arguably the world's most advanced AI recommendation system, and it works with video, and Perplexity is an AI company and wants to get more into video.

https://www.cnbc.com/2025/01/18/perplexity-ai-makes-a-bid-to-merge-with-tiktok-us.html

"Just a few days after more than 700 million new users flooded RedNote -- which Time noted is 'the most apolitical social platform in China' -- rumors began swirling that RedNote may soon start segregating American users and other foreign IPs from the app's Chinese users."

"I know through VPNs and other ways, people are still able to access the app, but essentially this is gonna kill the app for Chinese Americans who actually use the app to connect with Chinese content, Chinese language, Chinese culture."

"There has been no official announcement that such a change is coming, but Reddit commenters speculated that possibly the Chinese Communist Party (CCP) was requiring a change to stop American TikTokers from using the app to influence Chinese citizens."

https://arstechnica.com/tech-policy/2025/01/rednote-may-wall-off-tiktok-refugees-to-prevent-us-influence-on-chinese-users/

11. OpenAI CEO Sam Altman has scheduled a closed-door briefing for US government officials in Washington on January 30th. That’s tomorrow. Allegedly the topic will be "PhD-level super-agents".

"The expected advancements help explain why Meta's Mark Zuckerberg and others have talked publicly about AI replacing mid-level software engineers and other human jobs this year."

"'[P]robably in 2025,' Zuckerberg told Joe Rogan 10 days ago, 'we at Meta, as well as the other companies that are basically working on this, are going to have an AI that can effectively be a sort of midlevel engineer that you have at your company that can write code.'"

"'[O]ver time, we'll get to the point where a lot of the code in our apps, and including the AI that we generate, is actually going to be built by AI engineers instead of people engineers,' he added."

https://www.axios.com/2025/01/19/ai-superagent-openai-meta

12. ByteDance, the Chinese company behind TikTok, has joined the integrated development environment (IDE) fray with Trae. I've started learning Cursor AI, and, lo & behold, Trae looks almost exactly like Cursor.

https://www.trae.ai/

According to Vincent Schmalbach, "Trae might be particularly appealing if you: Work in both Chinese and English development environments, prefer more structured, predictable AI assistance, and work on large projects where explicit context management is valuable."

"If you're already comfortable with Cursor and don't need the bilingual support, the switch might not offer enough benefits to justify the transition."

https://www.vincentschmalbach.com/trae-ai-ide-bytedances-answer-to-cursor/

13. aiCoder Project looks like it's another one of these integrated development environments (IDEs) with integrated AI coding, except this one parses your code to produce a data structure representing the program after it's been parsed (called an AST -- abstract syntax tree), and parses the output of the LLM as well (to produce another AST), and then merges the output from the LLM into your code by merging the parsed data structures (the ASTs) instead of text. Looks like it works only for JavaScript.

Will be interesting to see if this turns out to be a more effective approach to AI-assisted coding.

https://aicoderproject.com/

ATMs getting blown up in Germany

14. "In Germany, the number of ATMs blown up fell slightly in 2023 -- to 461 cases, according to the BKA. Solid explosives were almost always used, which caused great damage."

Um. What? ATMs are getting blown up in Germany? And 461 ATMs blown up in a year (2023) means there has been a *decrease*? I never heard of this. This is from an article written on August 29, 2024.

Way back in the late 90s, I went to a computer security talk where a security researcher told about a time where criminals put on hard hats and construction uniforms and used a construction machine to scoop up an ATM machine. His point was that ATMs were robbed in many ways other than breaking the encryption between the machine and the bank. People discovered tricks like putting in ATM cards with test codes on the magnetic strip and punching in special test codes that would get the machine to, for example, pop out 10 bills of the lowest denomination. But nobody ever broke the encryption. People focus a lot of effort on the encryption algorithms, but if the *rest* of the system isn't also secure, it doesn't matter how good the encryption algorithm is, people will still be able to break the system. I was under the impression the scooping the machine with the construction equipment was something that happened once. Maybe it was but apparently in Germany, simply blowing up ATMs with explosives is a regular thing?

"Bank robbers often come at night and let it rip. As the latest figures from the Federal Criminal Police Office (BKA -- for "Bundeskriminalamtes") show, ATMs remain a popular target for bombing and robbery. In 2023, a total of 461 ATMs were blown up. After the record year of 2022 with 496 attempted and completed explosions, the number of cases fell by 7.1 percent. This is evident from the BKA's 2023 Federal Situation Report. One reason for the decline: banks and savings banks have been taking more active steps to combat the problem for some time now. They rely on secure devices or lock their branches at night. Not least because the explosions repeatedly endangered residents and neighboring houses."

"As the BKA further explained, the amount of cash stolen by perpetrators was also somewhat lower last year. Compared to the previous year, it fell by 5.1 percent to 28.4 million euros. However, the sum remains 'at a comparably high level,' the authority said. The reason is the 'high proportion of cases' in which perpetrators obtained cash after a successful explosion. This was achieved in a total of 276 crimes."

"According to official statistics, solid explosives with high detonation energy were used in 87 percent of all explosions. According to the BKA, pyrotechnic devices are used in particular, but military explosives and, in rare cases, homemade explosive devices are also increasingly used. This approach caused 'significant damage' and exposed emergency personnel and bystanders to 'great danger,' the BKA explained. In contrast, it is becoming increasingly rare for a gas or gas mixture to be introduced into the ATM and then ignited. This could also be due to the fact that the failure rate is significantly higher when using gas mixtures."

"The suspects' propensity to violence remains high, according to the BKA. Last year, fatal traffic accidents were associated with "risky escape behavior" for the first time."

"According to the BKA, the police managed to identify more suspects last year. The number rose by 57 percent to 201 compared to 2022. Almost 90 percent of them traveled from abroad to commit the crime. 160 of the suspects identified had their main residence in the Netherlands -- the vast majority. Many perpetrators belong to professionally organized gangs."

So, blame the Netherlands. Alrighty then.

One possibility for banks to improve the technical security of their ATMs "is systems that automatically color banknotes in the event of an explosion, thus making them unusable for the perpetrators."

"In July, the federal government also decided to take action. In future, anyone who blows up an ATM will be punished with a prison sentence of at least two years."

Coming from a US perspective, 2 years doesn't seem like much.

"The draft law presented at that time also provides for changes to the Explosives Act."

Whatever that is. ("Das Sprengstoffgesetz".) And the article stops there and doesn't say what the proposed changes might be.

Link goes to an article in German. Translation by Google Translate.

https://www.tagesschau.de/inland/gesellschaft/bka-geldautomaten-sprengungen-100.html

"The Explosives Act (Law on Explosive Substances) regulates the civil handling and trade of, as well as the import and transit of, explosive substances and explosive accessories in Germany. It is the most important legal source of German explosives law."

https://de.wikipedia.org/wiki/Sprengstoffgesetz_(Deutschland)

Energy

15. lightcell energy (no capitalization) claims to have invented something they call the "lightcell" (no capitalization), which burns hydrogen combined with a "sodium illuminant" in such a way that solar cells can collect the light and convert it to electricity. The key is the "sodium illuminant" emits "near monochromatic light", a particular wavelength of yellow light. The "photovoltaic cells" are "bandgap-tuned" to the same wavelength.

"Sodium's weakly bound, lone outer electron "rings like a bell" at 2.1 eV, it takes only nanoseconds for the energy absorbed to be reemitted as 589 nm, 2.1 eV photons when sodium relaxes to its ground state."

"2.1 eV photons can be efficiently absorbed by a photovoltaic cell with a bandgap tuned to 2.1 eV."

They say it has an "optical cavity" with "infrared recycling" as well as a "ceramic recuperator" for "heat recycling" to increase the efficiency.

They claim it can also use natural gas, gasoline, ammonia, butane, propane, alcohols, and syngas, but I wonder how this would compare in efficiency than just using an internal combustion engine to burn those the old-fashioned way. I assume they also would all need the "sodium illuminant", and that would get out in the environment.

They seek, "ideally, synthetic, net zero carbon emissions fuels."

"This effort harnesses advanced new materials, and uses physics at photon densities rarely explored."

https://www.lightcellenergy.com/

I only just found out about this, but there is an interview with the founder from around 9 months ago. The part about the sodium illuminant is discussed starting around 12 minutes in on the video.

Physics

16. "Heat destroys all order. Except for in this one special case."

"Sunlight melts snowflakes. Fire turns logs into soot and smoke. A hot oven will make a magnet lose its pull. Physicists know from countless examples that if you crank the temperature high enough, structures and patterns break down."

"Now, though, they've cooked up a striking exception. In a string of results over the past few years, researchers have shown that an idealized substance resembling two intermingled magnets can -- in theory -- maintain an orderly pattern no matter how hot it gets."

The key words, to me, are "in theory". Let's see it in the real world. Make a material and crank it up to some insanely high temperature.

https://www.quantamagazine.org/heat-destroys-all-order-except-for-in-this-one-special-case-20250116/

Share Wayne's Future

Wayne's Future