Autonomous Software Factories

Beyond the tipping point

With the launch of Claude Opus 4.5, we are undeniably beyond the tipping point for AI coding agents. The notion of "neural scaling laws", which was once AI community folklore, is today so commonplace that it has its own Wikipedia page. This is an instance of "The Bitter Lesson" in practice - more data, more compute, and scalable algorithms win against clever modelling tricks. As high-quality training data became a bottleneck ("there is only one internet"), AI labs turned to reinforcement learning from verifiable rewards (RLVR) to drive post-training improvements. RLVR works extremely well for verifiable domains like math and coding, and is most likely a major contributing factor to the step-change improvement in models surpassing coding abilities at the Opus 4.5 level. On the other hand, the non-verifiability (and arguably the lack of attention from major AI labs) is making LLM intelligence "jagged". This is why LLMs can exhibit an extremely high level of intelligence on some tasks but fail miserably at other tasks that no human would fail at. In other words, verifiability is the limiting factor, as noted by several AI researchers here, here and here.

Software 3.0 and Software Factories

We are now entering a new paradigm of software development, which Andrej Karpathy calls "Software 3.0". If Software 1.0 is humans writing explicit code, and Software 2.0 is humans creating datasets, defining objectives, and programming through weights, then Software 3.0 is programming with LLMs through prompts, context, instructions, and examples. In Software 3.0, the program has self-improvement abilities by updating its Software 1.0, Software 2.0, or both.

Following the trace of Software 3.0, "The Bitter Lesson", and verifiability is leading us towards an autonomous version of Software 3.0, which some refer to as "software factories". Human programmers are rapidly climbing the layers of abstractions to something that much more resembles architecting and coordinating a highly autonomous multi-agent factory than iterative interactions with a single coding agent. In software factories, humans are not "looking at the code" but rather guiding the factory with high-level objectives. The level of autonomy of these factories is, just as with LLM models, going to be limited by verifiability. For example, we can think of three different types of software factories, in order of verifiability:

ML Engineer Software Factories: autonomous agents hill-climbing a well-defined objective function by developing machine learning models and systems.
Data Engineer Software Factories: autonomous agents creating integrations and pipelines, while maintaining "good data quality".
Full-stack Software Factories: autonomous agents researching, implementing and shipping new software product features to humans and agent users.

As we move down this spectrum, verifiability decreases, feedback becomes noisier and more delayed, and achieving high levels of autonomy becomes significantly harder.

What's next?

The software engineering space is moving fast right now. From copy-pasting StackOverflow, to copy-pasting ChatGPT, to the LLM-embedded IDE, to coding agents, to autonomous (and human) orchestration. It's likely that the next version of coding agents developed by the major labs will have mechanisms for "closing the loop" through reward functions. This could be through reinforcement learning, beam search, evolutionary algorithms, or any other derivative-free optimisation algorithm. The focus of the major labs is on obvious, verifiable problems like coding and math. The opportunity for startups is to 1) focus on niche domains, not directly targeted by the labs, and 2) focus on the data and evaluation environments specific to those domains.

In a not-so-distant future, for a company to have its own autonomous software factory, focused on solving problems of importance to that specific company, might be as common as having a company database today. The skill in high demand, will be that of architecting and operating these systems while understanding them well enough to be able to take responsibility for the output they produce. The guiding principle is becoming "you can outsource thinking but you can't outsource understanding". Machines providing the raw intelligence and humans providing the direction, agency and purpose - I think this might be what human-AI collaboration will look like for great teams solving hard problems.