Where AI Agents Actually Pay Off
Posted June 4, 2026 | Updated June 9, 2026
I am starting to get real leverage from AI agents.
Not theoretical leverage. Not "look, the chatbot wrote a function" leverage. I mean the kind where a messy voice note turns into a draft, a repo change, a test, a pull request, a live fix, a follow-up task, and a breadcrumb that gives the next agent more context.
That leverage is exciting. It is also a little cursed.
The cursed part is not that the models are secretly alive or that software engineers are all immediately obsolete. The cursed part is more boring and more important: the economics are starting to work in weird places, especially for individuals and very small teams, and they do not work everywhere. The window is small. The workflow changes are nontrivial. The token bill can get gross fast. And if you do not build the surrounding system, agents can easily become an expensive way to generate unfinishedness.
This is where I think most agent discourse gets a little too smooth. People ask "is AI faster?" as if there is one answer.
Sometimes it is slower. Sometimes the model churns. Sometimes the first answer is plausible but wrong. Sometimes the agent burns twenty minutes going in the wrong direction.
The interesting question is not whether one agent is always faster than one human on one task. The interesting question is:
What happens when a human can specify, run, review, and improve many bounded execution loops in parallel?
That is where the ROI starts showing up. It is also where the danger starts showing up.
George Hotz wrote the sharp negative version of this in "The Eternal Sloptember". His argument, as I read it, is not just "AI code bad." It is that agent output frontloads the impressive part, leaves the hard polish to the human, and produces artifacts that are broken in ways old quality proxies do not catch.
I do not fully buy the permanent claim that agents cannot program, but I do buy the organizational warning. If your feedback loops are slow and you are not carefully error-correcting the output, agents will raise the volume of mediocre work much faster than they raise the quality of good work.
The question is not "agents: yes or no?" The question is "who can absorb the leverage without degrading their own system?"
The ROI Is A System Property
The useful unit of work is not "the model."
The useful unit is the whole system:
Capability = model x harness x tools x environment x evaluator
The model matters. Obviously. A stronger model listens better, repairs better, and survives ambiguity better. GPT-5.5, in particular, has felt like a genuinely good foundational engineering model in my workflow. I can hand it a real codebase, a weird constraint, and a fuzzy product problem, and get back something I can review instead of something I have to babysit.
But different tasks want different combinations. Codex/GPT-5.5 feels deeply steerable for repo engineering, while other models feel better suited for UX exploration or visual design.
The harness and tools matter too. A model with a terminal, browser, GitHub access, docs, image inspection, and a real test suite is a different creature from the same model in a textbox. Tool access changes the shape of cognition. The agent can externalize uncertainty into the world: read the file, run the command, inspect the screenshot, check the deployed page.
The environment matters. A legible repo is agent fuel. Good scripts are agent fuel. Clear boundaries are agent fuel. Stable design primitives, typed connectors, preview/apply workflows, and boring test commands are all forms of intelligence that do not live in the model weights.
And the evaluator matters most of all. A task only becomes delegable when there is a way to tell whether it worked.
Typecheck. Test. Build. Screenshot. Read back the external system. Ask a human to review a tight diff. Run an eval. Compare against a rubric. Verify the live URL. Whatever. Without an evaluator, the agent is not really operating. It is describing completion instead of proving it.
Benchmarks Measure Systems, Not Just Models
If you want proof that ROI is a system property, look at how the industry is starting to measure it.
I am increasingly suspicious of takes that collapse everything into a raw model leaderboard.
Coding benchmarks are useful, but the closer you get to real software engineering, the less you are measuring only "the model." You are measuring a model inside a scaffold, with a prompt, a tool policy, a repo, a timeout, an environment, tests, hidden verifiers, retry rules, and a definition of success.
In other words, the evaluated object is really:
model + harness + task + verifier
OpenAI recently stopped reporting SWE-bench Verified because, in its audit, many remaining failures were test or contamination problems rather than clean signals of frontier coding ability. OpenAI now recommends reporting SWE-bench Pro until better uncontaminated evals exist (OpenAI, February 2026).
SWE-bench Pro is a serious attempt to raise the bar: 1,865 tasks across 41 repositories, including public, held-out, and commercial subsets. It uses human-augmented problem statements, Docker environments, fail-to-pass and pass-to-pass tests, and a public/private split meant to reduce contamination (Scale Labs paper, methodology).
DeepSWE is interesting for a different reason. It uses 113 original
long-horizon tasks across 91 repositories and five languages, with hand-written
behavioral verifiers. Its public table says all models are run on mini-swe-agent
for consistency, and the current June 2026 snapshot reports gpt-5.5[xhigh] at
70% +/- 4%, ahead of claude-opus-4.8[max] at 58% +/- 5%
(DeepSWE,
repository).
That last sentence is the whole article hiding in a benchmark note.
"All models run on mini-swe-agent" is not a minor implementation detail. It is a claim about the harness. It means the score is not "GPT-5.5 in the abstract" or "Claude in the abstract." It is a model, with a particular reasoning setting, inside a particular agent scaffold, under a particular verifier regime.
That is good. We need that kind of specificity.
But it also means benchmark scores should be read as system scores. DeepSWE itself has already drawn methodological audits around reproducibility, denominators, and verdict receipts, which is exactly the kind of pressure a serious benchmark should invite (June Kim's audit). The point is not that one chart is valid and the other is invalid. The point is that the harness is part of the measurement.
If my real workflow uses Codex with repo memory, browser verification, local scripts, PR gates, subagents, and manual review, then a benchmark using mini-swe-agent tells me something. It does not tell me everything. Likewise, if an enterprise agent runs inside a locked-down internal platform with different tools, data boundaries, and approval gates, the model leaderboard is only a starting prior.
That also matches my anecdotal experience with GPT-5.5. I have not seriously used Opus 4.8 yet, so I do not want to overstate the comparison. But compared with the 4.6 and 4.7-era models I was using before, GPT-5.5 has felt better at foundational engineering: reading the system, preserving constraints, building the primitive, and staying steerable over a long repo task. DeepSWE is not proof of my workflow, but it is at least evidence in the same direction.
The benchmark I actually want is closer to:
model + harness + tools + context + verifier + cost + review burden
That is less elegant than a leaderboard.
It is also much closer to reality.
The Factory Game
So how do you actually build that system?
The closest metaphor I have is an old Minecraft automation modpack. You do not start with a giant perfect machine. You start by punching trees. Then you build a tree farm, which feeds a better machine, which unlocks a better material, until the whole base produces things you used to gather manually.
Good agent work feels like playing the factory game.
People sometimes treat manual intervention as failure, as if the AI only counts if it runs entirely alone and returns with a perfect artifact. That is the wrong fantasy. The fastest path usually looks like this:
- Ask for a bounded change.
- Let the agent inspect, edit, and test.
- Manually poke the thing.
- Notice the failure.
- Turn the failure into a durable guardrail.
That last step is the compounding step.
If I catch a bug and only fix the bug, I got one fix. If I catch a bug and then add a test, a lint rule, a PR gate, a repo instruction, a skill, or an eval, I changed the future working conditions. Every later agent now has a slightly narrower path to repeat the same mistake.
For example, in one repo I added a PR compliance pattern that is almost comically literal: repository skills contain attestation words, and the agent has to include those specific words in the PR body's JSON to prove it read the instructions. The CI gate checks the JSON. If the branch changes, the head SHA has to be updated. If the agent tries to hand-wave the process, the gate fails.
It is silly. It works.
You do not need the model to become careful by default. You need the environment to make the desired behavior easier to do than to skip.
You spend your time making primitives. You build scripts, evals, checklists, prompts, skills, adapters, schemas, preview flows, and review surfaces. None of those things are the final artifact. They are the production line.
This is why the highest-leverage work does not always look like feature work. Sometimes the right move is a script, a DSL, a connector, a preview/apply contract, a replayable eval, a design-system rule, a browser smoke test, a publishing script, or a tiny internal language that lets the agent express intent safely.
Mitchell Hashimoto's "Building Block Economy" gets at this from another angle: agents are extremely good at gluing together high-quality, well-documented building blocks. That matches my experience. The better the primitive, the more useful the agent.
My current rule is simple: if a foundation changes a class of work from undelegatable to delegatable, strongly consider building it.
That is the real asset: the growing list of things I can safely hand off.
Every new bullet point on that list compounds. "Can update this doc safely." "Can test this route in a browser." "Can inspect this spreadsheet and propose a diff." "Can ship this kind of UI fix if a screenshot check passes." Each bullet point means a future thought has somewhere to go.
This is why the early phase can feel so slow. You are not only doing the task. You are building the language that makes the next task delegatable.
It also means not building fake foundations. An abstraction that does not reduce future risk is just ceremony.
Parallelism And The Execution Horizon
In serial, agents are not as magical as people want them to be.
If I sit and watch one agent do one thing, I still have to wait. I still have to review. I still have to catch drift. I still have to close the loop. Sometimes I could have done the task myself faster.
The return starts to make sense when the work can run in parallel.
There is the clean version of parallelism: normal decomposition. You write a
planning document, which acts as shared state. You split a project into slices.
You give each slice a narrow success condition. You periodically re-ground in
main.
That is often the right move.
Then there is the less clean, more honest version of parallelism: ambient multiplexing.
I am literally dictating parts of this article via voice notes while agents are fixing other things in the background. I am doing it because I am impatient, and because if I care about maximizing output, the incentive is obvious: keep useful work in flight. While one thread builds, another reviews, a third researches docs, a fourth waits on CI, and I use the dead air to think about the next thing.
This changes the latency math. If one task takes 40 minutes instead of 20, it is painful. If I have several bounded loops running, the real bottleneck is no longer generation. The bottleneck becomes review.
Voice is a huge part of this for me.
I get far more done when I can talk through the mess. A good voice dump can carry intent, priority, frustration, constraints, taste, and emotional salience in one sloppy packet. Typing can do that too, but speech catches the thought while it is still alive.
That matters because agents are hungry for context and I am not always willing to produce a perfect written brief before starting. The workflow that works is closer to:
- Ramble.
- Let the agent structure the ramble.
- Correct the structure.
- Split it into delegatable slices.
- Run the slices.
- Review the evidence.
Input modality changes throughput.
So does reading. If agents can create more output than I can absorb, the next bottleneck is not generation. It is review bandwidth. That is why I keep playing with speed readers and faster reading interfaces. Not because reading text one word at a time is the grand future of civilization. Because the interface between "agents produced a lot of stuff" and "Rico understands what happened" is now a serious part of the system.
As your supported execution rate rises, you eventually hit the execution horizon: the point where agents can generate output faster than you can prioritize and review it.
That used to sound like a productivity problem. In an agentic workflow, it becomes infrastructure. You need to know what is running, what is done, what needs review, what is blocked, what can be delegated next, and what should be killed before it drifts.
The job becomes orchestration. You need an interface for Now, Running, Review, Blocked, and Next. Without that visible work surface, leverage quickly decays into fragmentation.
The Temporary Subsidized Window
This workflow requires taste, error correction, and executive function. That is why agents may currently be a much better deal for high-agency individuals and tiny teams than for the average large organization.
Large organizations have advantages: money, distribution, legal cover, procurement, internal data, and teams of specialists. But they also have slow feedback loops. The person prompting may not own the architecture. The person reviewing may not understand the product context. The person paying the token bill may not see the cleanup burden. The person measuring productivity may count output instead of coherence.
That is how you get the Sloptember failure mode: more code, more features, more artifacts, more surface area, and less understanding.
Small teams have a different advantage. The loop can be brutally short: feel a roadblock, decide whether the roadblock is recurring, ask an agent to build the primitive that removes it, manually test the new path, and use the improved path immediately on the next task.
That loop is hard to buy with headcount.
This is amplified by the weirdness of current token economics. The $200-ish personal AI plan is a strange object. OpenAI introduced ChatGPT Pro as a $200 monthly plan for scaled access to its best models and tools, and the larger industry has been playing with similar heavy-user tiers (OpenAI, December 2024). For an individual doing personal work, contract work, or small agency work, that can feel like access to a subsidized compute well.
This creates funny incentives.
If I am operating as an individual, I may be able to pour a lot of agentic compute into my own work. If I am inside an enterprise, I may not be allowed to use that same personal compute on company code or data. The enterprise may need API billing, compliance, data controls, admin policy, audit logs, and a vendor relationship. That can be the correct boundary, but it changes the economics.
And the scale can get strange quickly. I am not a casual user of this stuff right now. I push close to the weekly Codex budget because I keep thinking, forking, reviewing, and building. Looking at my own usage, a month of this can start to look like something on the order of 20 billion tokens. Priced as raw API-style compute, depending on the model mix, that can look like maybe $15,000 of monthly compute value.
I am treating that as mostly free right now because, for me, it effectively is. That is absurd. It is also part of why the window feels temporary. If I had to pay the unsubsidized bill directly, the ROI math would get much harsher, much faster.
Eventually, some subsidies will go away or get priced more precisely. When that happens, the agent workflows that survive will be the ones where the value is measurable enough to justify the total system cost.
And total cost is not just tokens.
Total cost includes:
- model spend
- latency
- retries
- duplicated work
- review burden
- merge conflicts
- bad abstractions created too quickly
- security and data risk
- the emotional tax of tracking too many half-finished branches of intention.
A cheap model that loops forever is expensive. A premium model that solves the task once and teaches you how to make the cheap route reliable may be cheap. A fast model with the right tool and verifier may beat both.
This is why my model strategy has shifted away from "frontier by default" and toward cost, speed, and sufficient capability. Premium models are still important. They are teachers, judges, ambiguity resolvers, and escalation paths. But ordinary workflows should move toward the fastest route that crosses the quality bar at the lowest acceptable review burden.
That is the actual economic game.
Not "which model is smartest?"
"Which route gets this class of work done with enough correctness, taste, speed, and cost discipline that I would delegate it again?"
That matters personally because "just get a job" does not feel like the stable fallback it is supposed to be right now. Companies are reluctant to hire. Maybe because budgets are weird. Maybe because AI has made everyone unsure how many people they need. Maybe because the middle of the labor market is just having a bad time.
Whatever the reason, it changes the calculation. If the old bargain is less available, then ownership starts to look less like an idealized founder story and more like a practical survival strategy.
That is the uncomfortable capitalism part. One way to reduce dependence on someone else's allocation decision is to own more of the upside. Not everyone can take that risk. Not everyone should. But if you have an idea, a little room to be wrong, and enough taste to keep the machine pointed at something real, this is a pretty interesting window.
There is still a large gap between "AI features" and AI-native products. Most software is still shaped like old software with a chat box taped to the side. Whole experiences can be rethought around voice, ambient context, agent-readable state, reviewable diffs, previews, receipts, and interfaces where chat is one input mode instead of the entire product. Product taste matters a lot here, because the winning interaction pattern is probably not "same app, but with more tokens."
That gap will not stay open forever. The patterns will get copied. The market will saturate. Tokens may get priced more honestly. Frontier intelligence may stay expensive even if everyday intelligence gets cheaper. So the question becomes: what can a small team build while the leverage is temporarily this weird?
The Human Moves Upstream And Downstream
The optimistic version of agents is not that humans disappear.
The optimistic version is that humans move.
Upstream, into intent, taste, architecture, decomposition, and deciding what is worth doing.
Downstream, into review, verification, synthesis, and closure.
The middle gets more delegable. Not all of it. Not uniformly. Not safely by default. But enough of it that the shape of work changes.
This is also why I am wary, and why I do not want this to read like a simple argument for acceleration.
I am not exactly happy about all of this. I am trying to understand it because it is where the work seems to be going, and because I need a livelihood. I would like to keep some version of middle-class security. I would like to afford health care. I would like housing and assets to feel less out of reach than they do.
So yes, I am learning the game. I am trying to get good at the strange new workflow. I am building the primitives and the boards and the review loops. But that is not the same as believing the social direction is cleanly good.
I would not die for this shit. If, collectively, society looked at the tradeoffs and decided "eh, maybe not," I would agree with a lot of the hesitation.
Individually, though, in the world we are actually in, I do not think I can afford to pretend the leverage is not real. The same thing that helps indie builders and tiny agencies can also help already-powerful actors with capital, distribution, compute, and data. If compute access becomes more capital-gated over time, the leverage gap can widen.
So I do not want to make this sound clean.
Agents are not free leverage. They are leverage with a control problem.
The control problem is not only alignment in the sci-fi sense. It is much more ordinary:
- What did I ask for?
- What is running?
- What changed?
- How do I know it worked?
- What should become a reusable primitive?
- What should be thrown away?
- What did this cost?
- Would I delegate this again?
Those are managerial questions, philosophical questions, and software questions all at once.
For coding and engineering work, the current sweet spot looks something like this:
Agents pay off when the task is bounded, the repo is legible, the tools are available, the success condition is verifiable, and the result can be reviewed in a tight loop.
They pay off more when tasks can run in parallel without coordination chaos.
They pay off even more when each failure becomes a durable improvement to the environment: a test, a script, an eval, a clearer tool, a skill, a better primitive, a sharper instruction.
They pay off most when you stop treating them like magic employees, and start treating them like an execution substrate that compounds with your taste.
That is less glamorous than the usual pitch.
It is also more actionable.
Do not just burn tokens.
Build great foundations.