Agents Are a Product, Not a Sum

Three colossal interdependent structures spanning a red desert canyon

I have swapped the model behind my personal assistant maybe a dozen times in the past year. Each upgrade was measurably smarter. And each time, the day-to-day experience barely moved.

That gap bothered me until I wrote down what an agent actually is.

An agent's value is context × intelligence × abilities — what it can perceive, how well it can reason, and what it can do. It is a product, not a sum, and the weakest factor caps the whole thing.

Once you see the multiplication sign, the last two years of the industry look strange. Almost all the money, the benchmarks, and the discourse went into the middle factor. The other two are starving, and they are exactly where the leverage is now.

The formula

Write it out:

agent  =  context  ×  intelligence  ×  abilities
          (perceive)   (reason)        (act)

Diagram of the formula: agent equals context times intelligence times abilities, labeled perceive, reason, and act

The structure matters more than the terms. In a sum, you can compensate: pile enough IQ on top and a blind, handless agent still scores well. In a product, you cannot. If context is near zero — the agent knows nothing about your situation — it does not matter how brilliant the reasoning is; brilliance applied to nothing is nothing. If abilities are near zero — it can only emit text — then every insight still dies in your clipboard.

This is not a metaphor I picked for rhetorical effect. It matches the felt experience. A model upgrade improves the factor that was already at nine. The product barely moves, and that is exactly why my dozen model swaps changed so little.

Everyone is funding the middle term

Reasoning is where the capital went because reasoning is where the benchmarks are. MMLU, SWE-bench, math olympiads — all of them isolate the middle factor and measure it in a vacuum, with context pre-packaged into a prompt and actions reduced to a diff.

I am not saying intelligence does not matter. It matters enormously, and for a while it was the binding constraint. But it is no longer the scarce factor. Frontier reasoning is strong, widely available, and getting cheap fast. The marginal point of IQ is the most expensive and least felt improvement you can buy.

Bar chart comparing two agents: one scores 1, 10, 1 across context, intelligence, and abilities for a product of 10; the other scores 4, 4, 4 for a product of 64

Spend twelve points one way and you get the benchmark hero: a genius in a text box, product of ten. Spend them evenly and you get something that feels alive, product of sixty-four. Same budget. The product does not care where the genius is.

What users feel next is not the increment from 9.0 to 9.2 on reasoning. It is whether the agent knows their situation and can do the thing. Those are the starved factors.

Perception is starving

Look at what a typical agent actually perceives: the text you type into a chat box. That is the entire sensory apparatus. Everything else about your situation — and almost everything relevant is "everything else" — has to be manually carried in by you, the human peripheral.

Meanwhile the real context sits in plain sight. The stack trace on your screen right now. The meeting you just finished. The three documents you had open while you were thinking about the problem. The fact that you have been bouncing between an editor and a dashboard for an hour, which says more about your current task than anything you would bother to type.

You cannot reason about what you cannot see. So the first place to spend is on senses:

Screen as context. Periodically capture the screen, deduplicate frames, run OCR on-device. Now "what was that error I saw this morning" is a query, not an archaeology project — and the agent can know what you are looking at the moment you ask for help.
A capture at the moment of intent. A global hotkey that attaches a screenshot of whatever you are looking at turns "let me explain my situation" into zero keystrokes.
Voice as a first-class input. Speaking is several times faster than typing and far less lossy than the compressed summaries we type into chat boxes. Push-to-talk from anywhere means context arrives at speaking bandwidth, not typing bandwidth.

None of this is exotic research. It is sensors, plumbing, and care about privacy. Which is partly why it gets skipped — it does not benchmark well.

A close-up of a microphone in a dark room lit by purple light

Action is starving too

The other end of the formula is just as neglected. Most agents "act" by generating text, and then the last mile is you: copy the command, click through the flow, paste the draft into the email client, hit send. We built autonomous reasoning and kept manual execution. That demotes the agent to a drafting machine, and it caps the product just as hard as blindness does.

Abilities are concrete and enumerable:

Drive the browser and the desktop. Not by guessing at raw pixels — by reading the accessibility tree and operating on numbered elements, the way a screen reader does. The agent fills the form; you stop reviewing drafts of forms.
Touch the file system. Organize, rename, build, run the tests, open the result.
Connect to services. Calendars, mail, issue trackers — acting where the work actually lives instead of describing the work.
Run while you are away. A scheduled agent that wakes up at 9:00, does the research, and leaves a finished summary is acting on your behalf in the most literal sense.

The gap between advice and outcome is exactly the abilities factor. An agent that tells you what to do is a consultant. An agent that does it is an employee.

A hand reaching down to move a piece on a wooden chessboard

Why now: intelligence got cheap

Here is the part that makes this a 2026 thesis and not a 2024 wish list.

Perception and action were not ignored because nobody thought of them. They were ignored because they were economically impossible. Continuously feeding screen captures through a frontier model at frontier prices is a luxury nobody sane would ship. Fanning out five parallel attempts at a task and keeping the best one — same. Letting an agent run all night on a maybe — same.

Then token prices fell by an order of magnitude, and things that were absurd became routine:

continuous screen understanding instead of one screenshot per question,
best-of-N attempts instead of one shot and a prayer,
background agents that wake on a schedule and work while you sleep,
an agent that orchestrates sub-agents for a single request.

The right response to cheap intelligence is not "the same chat, but cheaper." It is to take the budget that frontier pricing used to consume and spend it on the starved factors. You are not trading reasoning away — it stays strong. You are reinvesting the savings where the multiplication actually pays.

The trust tax

There is an honest objection here: maybe perception and action are starved for a reason. An agent that watches your screen is a privacy hazard; an agent that clicks buttons can click the wrong one. True, and worth taking seriously rather than waving at.

But the answer is design, not abstinence:

Sensitive perception is off by default and opt-in per capability.
Screen understanding runs on-device — frames and OCR text never need to leave the machine.
Consequential actions — sending, deleting, purchasing, submitting — require explicit confirmation, with a stop control always within reach.

Notice what this implies. The hard part of perception and action was never the model. It is the trust engineering around the model. That is precisely why these factors are defensible in a way that raw intelligence no longer is: everyone rents the same reasoning, but earned permission to see and to act is built one careful default at a time.

The loop that compounds

The formula describes a single moment. What makes an agent yours is a fourth thing the formula hides: memory — context the agent writes back to itself.

Diagram of a loop: perceive flows to reason flows to act, act writes back into memory, and memory is injected into perceive on the next turn

Every turn, what the agent perceived and what its actions accomplished can be distilled into durable notes — preferences, projects, the way you like things done — and injected as context on the next turn. Perception feeds memory; memory upgrades perception; better context makes every future multiplication start from a higher base. One-shot tasks become accumulated understanding instead of starting from zero every session. I have written before about wanting memory I can read and edit rather than a black box, and about treating context as the product — this loop is where both ideas cash out.

The per-moment factors are being commoditized for everyone at the same rate. The compounding loop is the part that is personal, and the part that is defensible.

Closing thought

Stop asking how smart your agent is. Ask what it can see and what it can touch.

In a product of three factors, the genius locked in a text box still multiplies out to almost nothing.