The Shape of the Thing

In October of 2023, I wrote about the “Shape of the Shadow of the Thing,” speculating on the Thing that AI might turn into in the coming years. I think we can see the Thing much more clearly now, and some of the consequences that come with it. As I have been discussing in recent posts, we have entered a new phase of AI. After ChatGPT was introduced, human-AI work took the form of what I called co-intelligence, where humans would prompt AI back-and-forth to get help on tasks. Starting in late 2025, we entered a new era thanks to AI agents like Claude Code, OpenAI’s Codex, and OpenClaw. These are AI systems that you can just give work to, sometimes hours of human work, and get back reasonable and useful results in minutes. This is an era of managing AIs, rather than working with them. This new approach to AI is the outcome of the rapid exponential improvement in AI abilities. That means you can’t understand where we are, and where we might be going, without understanding the increasing capability of AI. Exponential improvements are hard to visualize, so rather than charts or graphs, I want to start with otters. If you have followed my writing on AI, you know about my Otter Test, where I challenge various AI image models to show a picture of an “otter on a plane using wifi.” As you can see below, the progress from 2022 (the year ChatGPT launched) to 2025 was rapid and remarkable. So, what has happened in the time since that April, 2025 image? With nearly perfect images, video has become the new frontier and has also seen exponential gains. To demonstrate, I gave the most advanced (and still unreleased in the US) AI video model from TikTok maker Bytedance, the prompt: A documentary about how otters view Ethan Mollick's "Otter Test" which judges AIs by their ability to create images of otters sitting in planes. This is the very first result — definitely turn on your sound: Aside from a single pronunciation mistake, this is pretty perfect, down to the fact that the otters are animated to have human-like expressions. Of course, video models are cool, but they are not necessarily indicative of what useful agentic AI can do. So, what if we look at the benchmarks of AI ability, do we see the same exponential curve? We certainly do in the most famous evaluation in AI today, the METR Long Tasks graph. It tries to measure AI progress by seeing how much human work an AI can complete autonomously with some measure of reliability. It has attracted its share of critics, and even METR has pointed out potential issues. But if you don’t like the METR graph, you will find most graphs of AI ability have that same curve. As an example, I picked four hard and diverse AI tests and graphed progress over time in the image below. In the upper left are the scores on the Google-Proof Q&A benchmark, a test of knowledge where graduate students using Google only score 34% outside their field and 70% or so inside of it, but the best AIs now score 94%. Or look at GDPval, where industry experts judge AI versus experienced human performance on complex tasks, and where the latest AIs now reach or exceed parity with top-performing humans 82% of the time. The same pattern holds for Humanity’s Last Exam, a set of very hard problems written by college professors that require considerable expertise to answers. Or we can even use the ability of AI to solve puzzles (you can try the puzzles here, they are fun!). Each shows a similar rapid gain in ability with few signs of slowdown, at least until they reach the top possible score on the test. Exponential graphs aside, it is important to recognize that all of these tests have their own flaws, and that AI remains jagged, capable of some tasks at a high level, while messing up others. Further, despite these amazing capabilities in tests, companies are still very early in adopting AI, meaning that, as of yet, remarkably little has changed in most organizations. But “most organizations” doesn’t mean every organization. We are already starting to see the first appearances of new approaches to organizing that take advantage of the new abilities of AI agents. A few weeks ago, a three-person team at StrongDM, a security software company focusing on access control, announced they had built a Software Factory — a way of working with AI agents that relied entirely on the AI to write, test, and ship production software without human involvement. The process included two (quite radical) rules: “Code must not be written by humans” and “Code must not be reviewed by humans.” To power the factory, each human engineer is expected to spend amounts equivalent to their salary on AI tokens, at least $1,000 a day. The basic idea of the Factory is that it takes future product roadmaps, written by humans, and turns those into products. Coding agents use those roadmaps to build software while testing agents try out the software in a simulated customer environment (which the testing agents build as needed). The sets of agents provide feedback to each other, looping back-and-forth until the results satisfy the AI. Then humans review the finished product and the results are shipped to customers without anyone every touching, or even seeing, the underlying code. There are obviously a lot of details here that make this approach work, and the StrongDM team has shared a lot of them publicly. They also invited in some smart outside observers to watch the Factory in operation and comment on what they saw, so you can read the accounts of Simon Willison and Dan Shapiro to get a better sense of the strengths and weaknesses of their approaches. In many ways, however, the particular details of the Software Factory matter less than the fact that such radical experimentation into how we work is now not only possible, but likely necessary. AI is good enough to change how organizations operate, and the experimentation is just getting started, even as models continue to improve. Practical agents, jagged exponential improvement, and the ability to radically experiment with the nature of work combine to form a sort of rolling and unpredictable environment for AI advances. As AI capability crosses thresholds, it unlocks radical new use cases that change people’s views, sometimes overnight, about what AI can do. At the same time, organizations experimenting with AI will figure out how to make it work for them, leading to sudden announcements about new strategies or large-scale shifts in which kinds of employees companies value most. Plus, as AI continues to improve, more policymakers will become interested in AI governance, creating conflicts with AI companies. This isn’t speculation because we saw this all happen in a single week. On February 22nd, a little-known financial firm, Citrini Research, published a fictional scenario about how AI adoption might destroy a number of established businesses by 2028. There were many elements in the piece that were clearly farfetched, but it struck a nerve on Wall Street, leading to major stock market price shifts. On February 26, financial services company Block announced 40% layoffs, implying this was due to AI. It is likely that the role of AI was greatly exaggerated, and AI was merely used as cover for large-scale layoffs. And then, to cap off the week, on February 27 a very public conflict occurred between the Pentagon and AI company Anthropic over who should be able to control the rules for how Claude could be used by the government. In a lot of ways, each of those cases were not what they first appeared to be. The Citrini report was a fictional scenario, the Block layoffs were not about AI, and the conflict over AI at war revolved around a number of complicated issues that are still not completely clear. But I think that single week is a good illustration of what the near future will feel like. Sudden revelations about AI capability leading to rapid market reactions. Increasingly real impacts of AI on jobs (even if there is a lot of debate over wh…

create your storyflo · everywhere you listen.

The Shape of the Thing