Measuring the shrinking gap between AI benchmark release and human-level achievement
Happy New Year
Hey Futurists,
Happy new year! The year is 2025, a year number with 15 factors. (Don't believe me? The factors are: 1, 3, 5, 9, 15, 25, 27, 45, 75, 81, 135, 225, 405, 675, and 2025.) The last time the year number had 15 factors was 1936, and it won't happen again until 2500, 475 years into the future. So unless immortality is invented real soon, most us will never live in another 15-factor year. Remember, prime numbers are the set of all numbers where the number of factors is 2 (1 and the number itself). We can generalize this to sets of all numbers where the number of factors is N. For 2025, N = 15. Never mind that all this has to do with a completely arbitrary year picked as year 1 (rumored to be the birth of some religious figure or something). If we started the numbering at some more meaningful event, say the end of the last ice age around 11,600 years ago, the number of factors of the year number would be completely different. How about we start the calendar on November 22, 2022 -- the day ChatGPT came out? In that case, today, as of the moment I am writing this, would be February 25 of Year 3.
I'm recovering from Holiday hecticality, on top of regular hecticality, so the news bits here are from the tail end of December and I'll have another message for you all in mere days with news bits from this year.
AI
1. h-matched Tracker: "Measuring the shrinking gap between AI benchmark release and human-level achievement".
Shows how long it takes from when a benchmark is introduced to when AI systems surpass humans at that benchmark.
The ImageNet Challenge was released on January 1, 2009 and human capability was surpassed on March 15, 2016, for a duration of 7 years and 2 months.
The Massive Multi-discipline Multimodal Understanding and Reasoning (MMMU) benchmark was released on Nov 27, 2023, and AI exceeded humans on Sep 12, 2024, and, the Graduate-Level Google-Proof Q&A Benchmark (GPQA) was released on November 29, 2023, and AI exceeded humans on September 12, 2024, so both of those were about 9 and a half months, and on both benchmarks, the model that beat us humans was OpenAI o1.
Read ARC-AGI's reaction to OpenAI o3's performance on the benchmark here:
https://arcprize.org/blog/oai-o3-pub-breakthrough
2. AI-generated 3D worlds. The technique being used here is Gaussian splatting, but the company (Odyssey ML) doesn't reveal any details beyond that. But because the result is 3D models, they can be used in Unreal Engine or a number of other tools that have been updated to support Gaussian splatting.
"Creative tooling providers have taken notice of the momentum behind gaussian splats, adding early support for visualizing and manipulating splats in tools like Unreal, Houdini, Blender, Maya, 3D Studio Max, After Effects, and more. What this means for our users is that, for any world you generate with Explorer, you'll be able to load it into your preferred creative tool of choice and, if necessary, hand-edit the generative world to achieve your desired goal. It's been incredible to see all the ways artists have used this strength of Explorer."
https://odyssey.systems/introducing-explorer
3. "Genesis is a physics platform designed for general purpose Robotics/Embodied AI/Physical AI applications. It is simultaneously multiple things:"
"A universal physics engine re-built from the ground up, capable of simulating a wide range of materials and physical phenomena."
"A lightweight, ultra-fast, pythonic, and user-friendly robotics simulation platform."
"A powerful and fast photo-realistic rendering system."
"A generative data engine that transforms user-prompted natural language description into various modalities of data."
"Powered by a universal physics engine re-designed and re-built from the ground up, Genesis integrates various physics solvers and their coupling into a unified framework. This core physics engine is further enhanced by a generative agent framework that operates at an upper level, aiming towards fully automated data generation for robotics and beyond."
"Currently, we are open-sourcing the underlying physics engine and the simulation platform."
You don't just generate videos from prompts, the videos are generated by a physics engine, so they are true to real physics. You can give prompts like:
"A miniature Wukong holding a stick in his hand sprints across a table surface for 3 seconds, then jumps into the air, and swings his right arm downward during landing. The camera begins with a close-up of his face, then steadily follows the character while gradually zooming out. When the monkey leaps into the air, at the highest point of the jump, the motion pauses for a few seconds. The camera circles around the character for 360 degrees, and slowly ascends, before the action resumes."
https://genesis-embodied-ai.github.io/
4. "Trying out QvQ -- Qwen's new visual reasoning model."
More evidence the Chinese are only slightly behind the state of the art in this country. I've been using DeepSeek and ChatGLM just about every day, but have yet to dive into the "Qwen" models. (DeepSeek V3 just came out which performs comparably to Claude 3.5 Sonnet on benchmarks -- outperforms on some (MATH500 and Codeforces) and not others (GPQA and SWEBench).)
"I thought we were done for major model releases in 2024, but apparently not: Alibaba's Qwen team just dropped the Apache 2.0 licensed Qwen licensed (the license changed) QvQ-72B-Preview, 'an experimental research model focusing on enhancing visual reasoning capabilities'."
"It's a vision-focused follow-up to QwQ, which I wrote about previously. QwQ is an impressive openly licensed inference-scaling model: give it a prompt and it will think out loud over many tokens while trying to derive a good answer, similar to OpenAI's o1 and o3 models."
"The new QvQ adds vision to the mix. You can try it out on Hugging Face Spaces -- it accepts an image and a single prompt and then streams out a very long response where it thinks through the problem you have posed it. There's no option to send a follow-up prompt."
"My most successful prompt was 'Count the pelicans' with this image."
"I fed in one of the ARC-AGI puzzles that o3 had failed at. It produced a very lengthy chain of thought that was almost entirely incorrect, but had some interesting 'ideas' in it:"
"[...] Let me try to think in terms of cellular automata rules, like Conway's Game of Life. In Game of Life, each cell's state in the next generation is determined by its current state and the states of its eight neighbors. Perhaps a similar rule applies here."
"I asked it to 'Estimate the height of the dinosaur' against this image (which, as it correctly noted, is actually an inflatable dragon)"
https://simonwillison.net/2024/Dec/24/qvq/
Download the DeepSeek V3 technical report here:
https://github.com/deepseek-ai/DeepSeek-V3/blob/main/DeepSeek_V3.pdf
DeepSeek V3 source code:
https://github.com/deepseek-ai/DeepSeek-V3
QvQ as described by its creators here:
https://qwenlm.github.io/blog/qvq-72b-preview/
5. It's ok, ethically, to recreate an actor if you are finishing something they were working on when they died, but not if you recreate a long-dead actor in something they were never working on, says "Malcolm" of the Pentex Productions YouTube channel. He considers the ethical question as a question for the court of public opinion, distinct from the legal question, which is for courts of law. He creates the "Malcolm Measure of Creepiness" that takes into account how long the actor has been dead, whether the actor appeared as this character in their lifetime, whether the actor consented in their lifetime to appear in this specific project, whether the actor began working on this specific project in their lifetime, how much of the project they completed in their lifetime, whether the actors family/estate gave permission, and whether, for the majority of the time the actor is on screen, the audience is seeing existing material performed by the actor, existing material combined with digital effects, or a fully digital re-creation of the actor's likeness.
He says the solution is simply to re-cast the character, allow the new actor to re-interpret the character, and allow audiences to connect with the new interpretation of the character. The live-action "Batman" character has been re-cast 7 times, James Bond has been re-cast 6 times, Catwoman has been re-cast 5 times, Superman has been recast 4 times, Peter Parker has been re-cast 3 times, Obi-Wan has been re-cast twice. Audiences accept this and embrace the new interpretation. No one would have cared if Grand Moff Tarkin was played by another actor in Rogue One instead of being re-created in CGI. New actors open the door to telling new stories.
6. "Writing Doom" is an award-winning short film on Superintelligence. The setting is a writers room for a TV show and the writers are tasked with making a "superintelligence" the "bad guy" for the next season, set 20 years in the future. It's a clever way to show the audience (you) that all the obvious storylines for how you might imagine "superintelligence" to play out won't actually work when you actually think through the implications of what "superintelligence" really means.
Interview with the director here:
7. Eunomia is a Python library that "makes it easy to enforce data governance policies in LLM-based applications by working at the token level."
"While companies in heavily regulated industries are likely familiar with the concept of Data Governance, those who've experienced unintended information disclosures have learned its importance the hard way. With the advent of LLMs, even those who remained uninterested will soon follow suit."
"The more interconnected and integrated a data stack becomes, the more control it requires. When LLM-based applications interact without constant human supervision, there must be a way to control data access and prevent unintended disclosures. Some of our clients initially addressed this by pre-processing datasets for specific LLM applications, ensuring no unnecessary data was included. This approach made sense for early tests and proof-of-concepts but became inefficient as complexity increased."
"Imagine a scenario where a specific dataset is needed for one LLM application but should be restricted for another. Constantly duplicating the dataset is unsustainable. Add another layer of complexity -- a multi-agent architecture interacting with live data in production -- and disaster looms."
"We believe the solution to address this challenge is a dynamic, rule-based system. Each application must be able to interrogate the data source and determine access permissions in real time. It's as if the application asks each piece of data, 'What About You?' before proceeding."
https://whataboutyou-ai.github.io/eunomia/get_started/
Direct link to the code:
https://github.com/whataboutyou-ai/eunomia
8. "AI-assisted development has existed for over ~2 years now, but there's been NO broad example of job replacement with AI."
Claims Addy Osmani, who claims to have advice on how to "future proof" your career.
"Numerous start-ups are emerging with the goal of creating 'AI engineers' (like Devin, Magic.dev, etc). Tools like GitHub Copilot, Claude, and Google's IDX are becoming mainstream. Platforms like Bolt.new and Lovable.dev serve specific use cases but haven't replaced traditional development."
"Contrary to popular speculation, junior engineering roles are unlikely to disappear entirely. However, they will transform significantly."
"Consider a typical junior task: implementing a new API endpoint following existing patterns. Previously, this might have taken a day of coding and testing. With AI assistance, the implementation time might drop to an hour, but the crucial skills become: Understanding the existing system architecture well enough to specify the requirement correctly, reviewing the generated code for security implications and edge cases, ensuring the implementation maintains consistency with existing patterns, and writing comprehensive tests that verify business logic."
"Mid-level engineers face perhaps the most significant pressure to evolve. Many of the tasks that traditionally occupied their time -- implementing features, writing tests, debugging straightforward issues -- are becoming increasingly automatable. This doesn't mean obsolescence; it means elevation."
https://substack.com/home/post/p-153477581
9. "The thing about 'truly fully updating our education system to reflect where AI is headed' is that no one is doing it because it's impossible."
"The timescales involved, especially in early education, are lightyears beyond what is even somewhat foreseeable in AI."
Ponders Miles Brundage. Looks like I'm not the only one who is wondering what we should be teaching young people these days.
"Some small bits are clear: earlier education should increasingly focus on enabling effective citizenship, wellbeing, etc. rather than preparing for paid work, and short-term education should be focused more on physical stuff that will take longer to automate. But that's about it."
Really? Hmm.
https://x.com/Miles_Brundage/status/1870629174303686763
10. "Always jarring now to read old fiction (like from 2016) and see the heroes worrying about whether an AI is waking up, or becoming a new soul to be protected, when some AI shows a sign of intelligence that Claude displays routinely."
Says Eliezer Yudkowsky.
"In the current story I'm reading it's a microrobot swarm asking 'But why?' just before it gets fried."
"Authors were just incredibly mistaken about how much most human beings would care, I guess."
https://x.com/ESYudkowsky/status/1873336063273693438
11. GuessGPT: ChatGPT is asked various trivia questions, and your job is to try to guess ChatGPT's answers to the same questions.
Example: "What is the most important invention ever?"
My first guess was, "the transistor". It said that was wrong. Next I punched in, "the integrated circuit". It said that was wrong and at that point, I started thinking, those are all inventions underpinning the digital revolution, and I think inventions that underpin the digital revolution are the most "important", but maybe it is looking for something earlier, such as from the industrial revolution. So I punched in "the steam engine". It said wrong, and I was out of guesses so it gave me the answer: the wheel.
This is pretty much how the whole thing went. I made guesses that made sense to me, got almost all questions wrong, only to be told the answer was something somehow more obvious than everything I'd thought of.
12. I've started stumbling across AI-generated podcasts. They come across as fake despite the voices being very realistic and talking with emotion, not just flatly following a script.
https://notebooklm.google.com/notebook/c253676a-a4e1-4d38-8a59-f6a0bcda9933/audio
Another: "The ChatGPT Podcast":
13. Indux AI claims to use AI to make "industrial design renders". You upload a picture of your project, and it generates photorealistic renders in 20+ unique design styles.
14. AI green screen creator.
Cryptography
15. Monocypher is a new cryptography library in C with many modern algorithms.
"Authenticated Encryption with XChaCha20 and Poly1305 (RFC 8439), with nonces big enough to be random. Regular ChaCha20 is also implemented."
"Hashing with BLAKE2b, which is as secure as SHA-3 and as fast as MD5."
"Password Hashing with Argon2i, which won the Password Hashing competition."
"Public Key Cryptography with X25519 (Diffie-Hellman key exchange). X25519 uses public keys to compute a symmetric key that can be used for authenticated encryption."
"Public Key Signatures with EdDSA (RFC 8032). By default, EdDSA uses BLAKE2b and Edwards25519. Ed25519 (SHA-512 and Edwards25519) is available as an option."
"Steganography support with Elligator 2. Elligator can hide ephemeral public keys as random noise, which is easier to hide from censors and other such adversaries."
"Password Authenticated Key Exchange (PAKE) support with Elligator 2 (map to point) and scalar inversion (Oblivious Pseudo-Random Function)."
They say Monocypher is:
"Small. Sloccount counts under 2000 lines of code, small enough to allow audits. The binaries can be under 50KB, small enough for many embedded targets."
"Easy to deploy. Just add monocypher.c and monocypher.h to your project. They compile as C99 or C++ and are dedicated to the public domain (CC0-1.0, alternatively 2-clause BSD)."
"Portable. There are no dependencies, not even on libc."
"Honest. The API is small, consistent, and cannot fail on correct input."
"Direct. The abstractions are minimal. A developer with experience in applied cryptography can be productive in minutes."
"Fast. The primitives are fast to begin with and performance wasn't needlessly sacrificed. Monocypher holds up pretty well against libsodium, despite being closer in size to TweetNaCl (see the more detailed benchmark)"
Cliodynamics
16. "States have a shelf life of about 200 years."
This caught my attention because it's similar to the "quantitative history" (cliodynamics) research that Peter Turchin has done a lot of (and in fact, Turchin is mentioned in the piece). However, unfortunately, if you are looking for "signs your civilization might be in decline", this piece doesn't have much to offer: basically just says, watch for lack of resiliency (can't bounce back from setbacks like it used to), excessive complexity, and just an overall "slowing down" which they call "critical slowing down".
Is our society showing a decrease in resiliency, excessive complexity, or "critical slowing down"?
It seems to me like there is a bifurcation. The AI revolution is happening, and things are going extremely fast. But that benefits people at the top. For the average person, life isn't getting better (life expectancy is trending downward -- or is it? I heard deaths from drug overdose have stopped increasing), and I suspect many people feel bogged down in excessive complexity?
I also wonder if, just as Peter Turchin had to come up with a modified model for the world after the industrial revolution, he'll have to do the same thing to properly model the world after AI.
The assumption made by articles like this is that old assumptions hold, but in the new AI world, some assumptions probably no longer hold, and models like Peter Turchin's probably have to be modified to take that into account.
CAPTCHAs
17. "The end of the road for Cloudflare CAPTCHAs."
"CAPTCHA remains an effective tool for differentiating real human users from bots despite the existence of CAPTCHA-solving services."
"CAPTCHAs are also a safe choice because so many other sites use them."
"CAPTCHA is useful because it has a long history of a known and stable baseline. We've tracked a metric called CAPTCHA (or Challenge) Solve Rate for many years. CAPTCHA solve rate is the number of CAPTCHAs solved, divided by the number of page loads. For our purposes both failing or not attempting to solve the CAPTCHA count as a failure, since in either case a user cannot access the content they want to. We find this metric to typically be stable for any particular website. That is, if the solve rate is 1%, it tends to remain at 1% over time. We also find that any change in solve rate -- up or down -- is a strong indicator of an attack in progress."
"Many alternatives to CAPTCHA have been tried, including our own Cryptographic Attestation. However, to date, none have seen the amount of widespread adoption of CAPTCHAs."
"Rather than try to unilaterally deprecate and replace CAPTCHA with a single alternative, we built a platform to test many alternatives and see which had the best potential to replace CAPTCHA. We call this Cloudflare Managed Challenge."
https://blog.cloudflare.com/end-cloudflare-captcha/
Astronomy
18. Current models of planet formation "predict that with so few heavier elements, the disks around stars have a short lifetime, so short in fact that planets cannot grow big. But Hubble did see those planets, so what if the models were not correct and disks could live longer?"
"The team used Webb to observe the massive, star-forming cluster NGC 346 in the Small Magellanic Cloud, a dwarf galaxy and one of the Milky Way's closest neighbors. This star cluster is also known to have relatively low amounts of heavier elements and served as a nearby proxy for stellar environments during the early Universe. Earlier observations of NGC 346 by Hubble revealed that many young stars in the cluster (~20 to 30 million years old) appeared to still have protoplanetary disks around them. This was also surprising since such disks were believed to dissipate after 2 to 3 million years."
"We see that these stars are indeed surrounded by disks and are still in the process of gobbling material, even at the relatively old age of 20 or 30 million years."
Economic consolidation
19. The latest "futurist" trend is... meatpacking plants closing? According to this report, meatpacking plants closed at an unprecedented rate this year. If you're wondering what the cause is, they attribute it to "rising livestock costs, workforce shortages, food safety violations and foodborne illnesses, and ongoing industry consolidation."
Moore's Law update
20. "Seagate's biggest-ever hard drive is finally here, coming with 32TB of capacity courtesy of the company's new heat-assisted magnetic recording (HAMR) technology."
32 TB, wow that's a lot. I guess Moore's Law for storage shows no sign of running out. Storing everything you type online and all your GPS coordinates of all your movements in the physical world has been costing pennies for decades. This much storage is probably for video generation AI.
"A HAMR HDD has long been anticipated since Seagate started experimenting with the technology in 2007. For over a decade, Seagate has predicted HAMR-based drives would be released within years, but these predictions fell apart year after year until now."
"Seagate's main competition in high-capacity HDDs at the moment is Western Digital, which launched energy-assisted perpendicular magnetic recording (ePMR)-based drives in October, which included a 32TB model."