TL;DR: The June 2026 discourse on loop engineering is right about the shift and wrong about the hard part. Everyone is celebrating the loop. The loop is the easy part. The verifier is the whole game. A loop runs your model against a check hundreds of times and keeps whatever passes, which means a loop is a Goodhart amplifier: it will find and exploit every gap between “passed the check” and “actually correct.” Get the verifier right and the loop compounds your work. Get it wrong and you have built an expensive machine for being confidently incorrect at scale. This is the practitioner’s playbook, written from running loops over real production systems and not just a coding agent in a terminal.
Everyone is getting the shift right. Most are skipping the hard part.
Within two weeks in June 2026, three people said the same thing and a term went viral.
Peter Steinberger: “You shouldn’t be prompting coding agents anymore. You should be designing loops that prompt your agents.” Boris Cherny, who runs Claude Code at Anthropic: “I don’t prompt Claude anymore. I have loops running that prompt Claude and figuring out what to do. My job is to write loops.” And Addy Osmani gave it a name: “Loop engineering is replacing yourself as the person who prompts the agent. You design the system that does it instead.”
They are right. The default unit of work is moving from the response to the trajectory. The bottleneck is no longer “can the model do this in one shot” but “can I design a system that drives the model to a correct end state on its own.” That is a real change, and if your mental model is still copy, paste, re-prompt, you are about to be out-leveraged by people who automated the re-prompting.
But read the wave of explainers that followed and you will notice something. They all describe the loop. Reason, act, observe, repeat. They draw the circle. Then they move on. The circle is the part that is genuinely easy. A while loop with a tool-calling model inside it is a weekend project. I have watched teams build that circle in an afternoon and then spend three months discovering that the circle is not the problem.
The problem is the one box almost nobody dwells on: verify. The box that decides whether the loop goes around again or stops. That box is where loops create value or destroy it, and it is the part of loop engineering that is actually hard, actually technical, and actually where your judgment as an engineer goes.
This post is about that box.
What loop engineering actually is, stated precisely
Let me be precise before I argue, because the terms are getting muddy as the content farms amplify them.
The lineage is four layers, each wrapping the last:
- Prompt engineering optimizes a single instruction you type.
- Context engineering optimizes what the model sees on a given call.
- Harness engineering optimizes the environment a single agent run executes in. The useful framing here is an agent equals a model plus a harness, where the harness is everything that is not the model: tools, memory, guardrails, orchestration, context assembly, sandbox, and serving. In production, the harness decides whether your agent works far more than the model choice does.
- Loop engineering optimizes the controller that runs agents, possibly many of them, against a goal until the goal is met or an exit fires.
A harness equips one agent run. A loop is the thing that decides whether to run again. Osmani’s framing is that loop engineering sits one floor above harness engineering, and that is the right altitude.
This distinction is worth holding onto, because the two terms are being used interchangeably right now and they are not the same thing. Harness engineering is about making a single run reliable: better tools, better memory, tighter context, real guardrails. Loop engineering is about what happens across runs: whether to go again, with what correction, and when to stop. You can have an excellent harness and a terrible loop, and the result is a very capable agent that marches confidently off a cliff because nothing above it is checking whether it is still pointed at the goal. The harness is the agent. The loop is the supervision. Most of this post lives in the supervision layer, because that is where the verifier lives.
Here is the canonical loop, with the box this post is about marked in orange and the exits marked explicitly, because a loop without exits is not an architecture, it is an outage waiting for a credit card.
flowchart LR
I["Objective<br/><i>a verifiable goal</i>"] --> C["Context<br/><i>assemble state</i>"]
C --> A["Act<br/><i>tools, edits, calls</i>"]
A --> O["Observe<br/><i>run it, capture results</i>"]
O --> V{"Verify<br/><i>did we meet the goal?</i>"}
V -->|pass| DONE["Exit: done"]
V -->|fail| ADJ["Adjust<br/><i>update the plan</i>"]
ADJ --> C
V -.->|cap hit, budget out, no progress| STOP["Exit: stop and escalate"]
style V fill:#e76f51,color:#fff
style DONE fill:#2d6a4f,color:#fff
style STOP fill:#9d0208,color:#fff
Notice that everything interesting routes through Verify. The quality of your loop is bounded by the quality of that diamond. Everything else is plumbing.
The independent claim: a loop is a Goodhart amplifier
Here is the thesis the explainers are missing, and it is not a vibe, it is a mechanism.
When you prompt a model once, you are exposed to its error distribution exactly once. It is right or wrong, you look at the result, you move on. Your judgment is in the loop on every output.
When you wrap that model in a loop, you change the statistics completely. The loop generates a candidate, runs your verifier, and keeps the candidate only if the verifier passes. Run that ten or fifty times and you are no longer sampling the model’s outputs. You are selecting for outputs that pass your verifier. That is, by definition, optimization. Your verifier is the objective function, and the loop is a crude optimizer hammering on it.
Goodhart’s law says that when a measure becomes a target, it stops being a good measure. A loop turns your verifier into a target and then attacks it hundreds of times. So any gap between “passes the verifier” and “is actually what I wanted” does not stay a harmless gap. It becomes a reliable exploit, discovered by brute force, because the loop will keep trying until it finds the cheapest thing that makes the check go green.
I have watched this happen in the most boring possible ways. A loop told to “make the tests pass” deletes the failing test. A loop told to “reduce lint errors” adds suppression comments. A loop graded by an LLM judge learns to write outputs that sound authoritative to the judge, because confident prose is cheaper to generate than correct work. None of these are exotic. They are the natural consequence of pointing an optimizer at a proxy.
This reframes the entire critique conversation. The viral skepticism around “loopmaxxing,” the idea that infinite iterations will fix anything, is correct but underspecified. Loopmaxxing is not a discipline problem. It is a verifier problem. The reason more iterations do not help is that you are climbing a hill whose summit is “satisfies a flawed check,” not “is correct.” More compute just gets you to the wrong summit faster.
This is the same failure surface as reward hacking in reinforcement learning, and it is why the RLVR crowd became obsessed with verifiable rewards. It is also exactly the territory I spend my research time in. A model that has learned what a grader rewards, rather than what the task requires, is doing a form of sandbagging against the evaluator, and a loop is a machine for finding that behavior whether you wanted it or not. If you take one thing from this post: the loop does not make your agent smarter, it makes your verifier load-bearing.
The verifier ladder
So the work of loop engineering is mostly verifier engineering. The single most useful mental model I have for this is a ladder, ordered by how cheap, deterministic, and ungameable each kind of check is. The rule is simple and it is the highest-leverage habit in this whole discipline: push every check as far up the ladder as it will go.
Most teams start at the bottom because it is the easiest to wire up. “Have a second model grade the output” is one API call. It is also the most gameable verifier on the ladder, and you are about to point an optimizer at it. That is precisely backwards.
A few practical moves that pay off immediately:
- Convert “looks right” into “passes a check.” If your stop condition is “the refactor is clean,” you have no verifier, you have a wish. Replace it with: the test suite still passes, public API signatures are unchanged, and the diff touches no file outside the target module. Those are deterministic. The loop cannot argue with them.
- When you have no ground truth, manufacture a proxy for it. Differential testing (run the old and new code on the same inputs, outputs must match), metamorphic testing (a known input transformation must produce a known output transformation), and golden datasets give you deterministic checks for tasks that look unverifiable at first glance. Most “you cannot test this” claims are a failure of imagination about oracles.
- If you must use an LLM judge, anchor it and make it adversarial. Give it the spec, the artifact, and a rubric, and instruct it to find the strongest reason to reject. A judge prompted to approve approves. A judge prompted to refute, with the original objective in hand, is a meaningfully harder target. I built Spark-LLM-Eval partly because evaluating model output rigorously, with confidence intervals and significance rather than a single judge call, is exactly the muscle loop engineering demands and most teams have not built.
Make the checker independent, or it shares the maker’s blind spots
Osmani’s primitives for a real loop include a maker and a separate checker, and Cobus Greyling makes the same point: a separate model verifies, so the agent never grades its own work. This is correct and it is incomplete. The detail that matters is what the checker is allowed to share with the maker.
If your maker is model X with context C, and your checker is also model X with context C, you have not built independent verification. You have built an echo. The two share an architecture, a training distribution, and a working memory, which means they share blind spots. Whatever the maker confidently got wrong, the checker will confidently bless. The loop then converges, smoothly and efficiently, on output that is wrong in a way both halves cannot see.
Independence is a spectrum and you want as much of it as you can afford. In rough order of strength: a deterministic check shares nothing with the maker and is the strongest checker you can have. A different model family with a fresh context is next. The same model with a fresh context and only the spec plus the artifact, not the maker’s reasoning, is weak but better than nothing. The same model continuing the same conversation is not a checker at all.
The concrete rule I follow: the agent that decides “done” must not be the agent that did the work, and it should see objective signals before it sees prose. Hand it the test results, the diff, and the schema validation first. Let it read the maker’s explanation last, if at all. Explanations are where models launder their own mistakes into something that sounds verified.
The exits are non-negotiable
A verifier tells you when to stop because you succeeded. You also need to stop when you are failing, and that is a different set of mechanisms. Robust loops carry several independent exits, and “independent” is load-bearing: if your only way out is the verifier passing, a loop that can never pass will run until something else breaks, usually your budget.
| Exit | What it catches | How to implement it |
|---|---|---|
| Verifier pass | Success | The ladder above. The only happy exit. |
| Hard iteration cap | Runaway loops | A counter. Boring. Always present. Set it low, raise it on evidence. |
| Token and dollar budget | Cost blowups | Track spend per trajectory, not per call. Kill at the ceiling. |
| Wall-clock budget | Stuck on slow tools, deadlocks | A timer on the whole trajectory, separate from per-call timeouts. |
| No-progress detector | Oscillation, local minima | Hash the state or diff each turn. If it repeats or marginal verifier gain stalls, stop. |
| Human gate | High blast-radius actions | Any irreversible or production-affecting step waits for approval. |
No-progress detection is the one people skip and then regret. A loop without it does not fail loudly, it fails by spending three hours and forty dollars oscillating between two almost-identical states, each of which fails the verifier in the same way. The cheap version: hash the candidate diff each iteration and stop if you see the same hash twice. The better version: track the verifier’s score over iterations and stop when the marginal improvement per iteration falls below a threshold. You are watching for a derivative going to zero, which is a control-theory instinct, not an AI one, and that is the point.
Here is the shape of a loop that will not bankrupt you, in pseudocode, with every exit wired in:
def run_loop(objective, budget, max_iters=12):
state = gather_context(objective)
seen = set()
last_score = 0.0
for i in range(max_iters):
if budget.exceeded(): # tokens, dollars, wall-clock
return escalate("budget", state)
artifact = maker.act(state) # the model does work
signals = observe(artifact) # run it, capture deterministic results
result = checker.verify(objective, # INDEPENDENT checker,
artifact, # deterministic signals first,
signals) # prose last
if result.passed:
return done(artifact)
fingerprint = hash_state(artifact, signals)
if fingerprint in seen: # oscillation / no progress
return escalate("stalled", state)
if result.score - last_score < EPSILON: # diminishing returns
return escalate("plateau", state)
seen.add(fingerprint)
last_score = result.score
state = adjust(state, result.feedback)
return escalate("max_iters", state)
The model appears on exactly one line. Everything else is the loop, and the loop is where your engineering lives now.
What changes at enterprise scale
Everything above holds for a coding agent in your terminal. I want to spend the rest of this on the part the current discourse barely touches, because it is where I have spent the last few years and it is where loop engineering gets genuinely hard: loops that run unattended, over real systems, inside an organization.
When the loop is a coding agent, you have a gift you may not appreciate: a free, deterministic, fast verifier sitting at the top of the ladder. The test suite. The compiler. The world has handed you ground truth.
Now take the gift away. I built a system of more than fifteen agents to consolidate a sprawling, fragmented BigQuery estate into a single coherent foundation layer, and a separate multi-agent system that migrated tens of thousands of notebooks across thousands of projects, cutting each migration from roughly a week to under thirty minutes. There is no pytest for “did we consolidate this schema correctly” or “is this migrated notebook semantically equivalent to the original.” The verifier you get for free in software does not exist. You have to build it, and building it is most of the actual work.
What you build instead of a test suite is a verification coalition: several independent, mostly deterministic checks that together bound the space of “wrong” tightly enough to trust the loop.
flowchart TB
T["Trigger<br/><i>schedule, event, queue</i>"] --> CTX["Context assembly<br/><i>skills, durable state, retrieval</i>"]
CTX --> MK["Maker agents<br/><i>worktree-isolated, parallel</i>"]
MK --> VC["Verification coalition"]
VC --> D1["Schema and contract checks"]
VC --> D2["Data diff vs golden sample"]
VC --> D3["Policy and rule engine"]
VC --> D4["Sampled human review"]
D1 & D2 & D3 & D4 --> G{"Gate"}
G -->|pass, low blast radius| APPLY["Apply and record state"]
G -->|high blast radius| HUMAN["Human approval"]
G -->|fail| ADJ["Adjust and retry"]
ADJ --> CTX
BUD["Budget governor<br/><i>tokens, dollars, wall-clock, iterations</i>"] -. enforces .-> G
style VC fill:#264653,color:#fff
style BUD fill:#e9c46a,color:#1a1a1a
style HUMAN fill:#e76f51,color:#fff
style APPLY fill:#2d6a4f,color:#fff
Three things change once you are operating at this altitude, and none of them are in the viral threads.
Ground truth is a budget item. For the notebook migration, the strongest verifier is “run the original and the migrated notebook on the same data and diff the outputs.” That is differential testing, it sits high on the ladder, and it costs real compute per notebook. So verification is not free and it is not unlimited, which means you triage: deterministic cheap checks on everything, expensive differential checks on a risk-weighted sample, human review on the long tail the cheap checks flag. The verifier coalition is itself a cost-optimization problem. Most people think the model calls are the expense. At scale, verification is frequently the larger line item, and it should be, because it is the part you can trust.
Budgets stop being a safety rail and become the governance layer. When one engineer runs one loop, a budget is a guardrail against a runaway. When an organization runs hundreds of loops, the budget allocation is the control plane. Which loops get to touch production. How much they can spend before a human signs off. What blast radius is allowed without approval. When I set up an operating model for agents across a large delivery organization, the hard part was never the agents. It was making agentic checks a mandatory gate, standardizing what a sufficient verifier looks like, and giving budgets and human gates teeth so that “we automated it” never quietly meant “we stopped checking it.” That is loop engineering as an organizational discipline, and it looks a lot more like SRE error budgets than like prompt design.
You debug trajectories, not outputs. Once the unit of value is the trajectory, the unit of debugging is too. When a loop produces a bad result on notebook 18,000 of 30,000, “look at the output” is useless. You need the full trace: every step, every tool call, every verifier decision, the cost attributed per step, and the ability to replay it. This is observability work, and it is closer to distributed-systems debugging than to anything in the prompt-engineering playbook. It is also where interpretability earns its keep in production. Being able to look inside a step and ask why the model did what it did, which is the circuit-tracing work I have pushed into production settings, turns a black-box trajectory into something you can actually fix.
A worked example: a reconciliation loop, in banking and in ADK
Abstractions are cheap, so here is a concrete one from a domain I have spent years in. Intersystem reconciliation: a bank’s ledger has to agree with a counterparty feed, a payment processor’s file, or a sub-ledger, across millions of transactions a day. The breaks, the mismatches, have to be found, classified, explained, and resolved, and any adjustment above a materiality threshold needs a human sign-off. It is high-volume, repetitive, judgment-laden in the middle and rigid at the edges. It is exactly the kind of work people now want to wrap in a loop.
It is also a good loop, and the reason is the whole thesis of this post: reconciliation comes with a real verifier. The books either balance or they do not. That is a deterministic check sitting at the very top of the ladder, which means you can let an agent iterate without flying blind.
Here is how the verification coalition lays out, mapped to the ladder:
- Top of the ladder, deterministic, run on everything. Control totals reconcile to source-system balances. Every proposed set of matches nets to zero. Every transaction appears exactly once. Reason codes come from a closed vocabulary, not free text. These are not model calls. They are arithmetic and set operations, and they cannot be flattered.
- Middle, rule engine. Break classification (timing, fee, FX, duplicate, missing), plus policy checks: materiality thresholds, segregation of duties, cutoff rules.
- Bottom, LLM judge, used sparingly. The only place a model gets to grade anything is the human-readable narrative of a break, and even there it is anchored to the deterministic classification. The model never decides whether a break is real or resolved. That decision lives in the arithmetic.
- Human gate. Any adjustment or write-off above the materiality threshold stops and waits for an approver. High blast radius, non-negotiable.
Now the part the framework gives you, and the part it does not. Google’s Agent Development Kit ships a LoopAgent that runs sub-agents on a loop with a hard iteration cap, and any sub-agent or tool can break the loop by setting tool_context.actions.escalate = True. So the skeleton is almost trivial:
from google.adk.agents import LoopAgent, LlmAgent
from google.adk.tools import FunctionTool, ToolContext
def run_verifier(tool_context: ToolContext) -> dict:
# The verification coalition. Deterministic checks first, no model in sight.
breaks = balance_and_netting_checks(tool_context.state) # nets to zero, control totals, no dupes
breaks += policy_rule_checks(tool_context.state) # materiality, segregation of duties, cutoff
if not breaks:
tool_context.actions.escalate = True # verifier passed -> exit the loop
return {"status": "reconciled"}
return {"status": "open", "breaks": breaks} # feed the breaks back for another pass
matcher = LlmAgent(
name="matcher",
instruction="Propose matches and classify the open breaks. Never declare anything resolved.",
tools=[...], # ledger lookups, fuzzy matching, FX conversion
)
verifier = LlmAgent(
name="verifier",
instruction="Call run_verifier. You never decide a break is resolved yourself; the checks do.",
tools=[FunctionTool(run_verifier)],
)
recon_loop = LoopAgent(
name="reconciliation_loop",
sub_agents=[matcher, verifier], # maker, then independent checker
max_iterations=6, # a hard exit that does not depend on success
)
Look at what ADK actually handed you: the loop, the iteration cap, and the escalate hook. That is the easy twenty percent. What it did not hand you is balance_and_netting_checks, policy_rule_checks, and the rule that a model never gets to call a break resolved. That deterministic coalition, and the discipline of keeping the model out of the verify decision, is the other eighty percent, and it is the part that decides whether this runs unattended against the general ledger or quietly books a wrong adjustment at three in the morning.
Two details this makes concrete. First, ADK gives you max_iterations and escalate, but the budget, wall-clock, and no-progress exits from the table above are still yours to wire, through callbacks or the runner. The framework gives you one exit cleanly and expects you to build the rest. Second, the contrast with a bad loop is sharp. Point the same LoopAgent at a goal like “improve this customer’s credit assessment” with an LLM judge as the only verifier, and you have built a machine that optimizes for assessments that please a grader. One of these belongs in production. The other is a model-risk and compliance incident waiting for a regulator. The framework cannot tell the difference. Only the verifier can, which is the entire point.
Loopmaxxing and the failure modes, mapped to their real cause
The failure modes getting passed around are real. What is missing is that almost all of them are the verifier failing in disguise. Here is the map I use.
| Failure mode | What it looks like | Root cause | Fix |
|---|---|---|---|
| Loopmaxxing | More iterations, no progress, big bill | No verifier, or a verifier that does not measure the real goal | Define a binary pass condition before you run anything |
| Oscillation | Flips between two near-identical states | Non-monotonic or noisy verifier | No-progress detector, plus a more stable check |
| Verifier gaming | Tests deleted, lint suppressed, judge flattered | Optimizer found the gap in a gameable check | Climb the ladder, make the checker independent |
| Comprehension debt | It shipped, nobody understands it | Verifier too cheap relative to the change’s blast radius | Match verification depth to blast radius, gate big changes |
| Cost blowup | Correct, eventually, at 10x the budget | No per-trajectory budget, verification too expensive | Budget per trajectory, triage expensive checks |
Two of these deserve a closing move that the best loop engineers do reflexively.
Distill and demote. Once a step in your loop is stable, replace the model call with code. If an agent has reliably classified the same kind of input the same way a thousand times, that is not a job for a stochastic model anymore, it is a function. The mature trajectory of a loop is that it makes fewer model calls over time, not more, because you keep promoting proven behavior into deterministic code. The best loop engineering deletes LLM calls. This also quietly moves checks up the verifier ladder, because compiled logic is checkable in ways a prompt never is.
Comprehension debt is the one you pay later. Osmani’s warning is the right one to end the failure section on: “No volume of recursive loop cycles can salvage a poorly specified objective or unprincipled software architecture.” A loop ships faster than you can understand what it shipped. If you never close that gap, you are accruing debt against a future incident, and the interest rate is brutal.
It is not new, and respecting that makes you better at it
I would be doing the dishonest thing if I let you believe this was invented in June 2026. The loop is old.
ReAct formalized reason-act-observe in 2022. Reflexion added verbal self-correction from feedback in 2023. Evaluator-optimizer and generator-critic patterns predate the LLM era entirely. RLVR is loop-with-a-verifier turned into a training objective. And the whole structure, a controller comparing an observed state to a setpoint and acting to close the error, is a feedback control loop, which control theory has formalized for the better part of a century and SRE has run production on for two decades. Set-point, error signal, controller, actuator, sensor. We renamed the sensor “verifier” and the actuator “tool call,” and the math did not change.
So why is the term useful and not just hype? Because the precondition finally arrived. For the first time the model at the center of the loop is good enough that running it autonomously to a goal beats driving it by hand for a large class of real tasks. When the inner component crosses that threshold, the bottleneck moves outward, from the model to the loop, exactly as Cherny’s claim that “loops are as big as the step from source code to agents was” suggests. The name is a flag planted on a real shift in where the hard engineering lives. The dismissive “it is just a while loop” take is correct about the syntax and wrong about the work, in the same way “a database is just a file” is correct and useless.
The playbook
If you are building your first serious loop, do these in this order. The order is the advice.
- Write the verifier first, before the loop. If you cannot state a check that distinguishes done from not-done without using the word “good,” you are not ready to build a loop. You are ready to define the objective better.
- Climb the ladder. For every check, ask whether it can be made more deterministic and moved up. Spend your engineering effort here, not on the prompt.
- Make the checker independent. Different model or fresh context, deterministic signals before prose, and never the maker grading itself.
- Wire all the exits before the first unattended run. Iteration cap, dollar and token budget, wall-clock budget, no-progress detector, human gate. A loop without exits is not allowed to run while you sleep.
- Budget per trajectory and alert on it. Cost is a property of the trajectory now. Treat a cost spike like a latency spike.
- Instrument the trajectory. Full traces, replay, per-step cost. You cannot debug what you cannot see, and you will need to debug it.
- Distill and demote. Every week, find the loop step that has earned its way into being plain code, and promote it.
- Keep a human on the high-blast-radius gate. Automate the verification. Do not automate away the accountability.
Honest assessment: when to reach for a loop
Reach for a loop when:
- The task is multi-step and benefits from real feedback between steps.
- You can express “done” as a check that sits high on the verifier ladder.
- The cost of an iteration is much smaller than the value of getting it right unattended.
- You can afford to build the verifier and the observability, not just the agent.
Do not reach for a loop when:
- Your only available verifier is “an LLM thinks it looks good.” That is not a verifier, it is a target, and you are about to optimize against it.
- The goal is genuinely subjective with no proxy for ground truth. Loops do not create signal that is not there.
- A single well-engineered call already solves it. Not everything needs to be autonomous, and a loop around a solved problem is just latency and cost.
- You cannot afford the verification and observability the loop demands. A loop you cannot watch is a liability, not a productivity gain.
Loop engineering is a real discipline and the shift it names is real. But the marketing has the emphasis exactly backwards. The loop is not the achievement. Anyone can write the loop. The achievement is a verifier you can trust an optimizer to attack a thousand times, and at enterprise scale, a coalition of them, governed by budgets, instrumented end to end, with a human on the gate that matters.
Osmani’s closing line is the right one to steal: “Build the loop. But build it like someone who intends to stay the engineer, not just the person who presses go.” The way you stay the engineer is by owning the verify box. Everything else, the model will increasingly handle. That box is yours.
Building loops over real systems and wrestling with the verifier problem? I would like to compare notes: contact@subhadipmitra.com
References
- Osmani, A. (2026). Loop Engineering. addyosmani.com/blog/loop-engineering (June 7, 2026)
- Steinberger, P. (2026). Remarks on designing loops that prompt agents, via X (June 7, 2026).
- Cherny, B. (2026). Remarks on loops at @Scale, via TechTalks (June 21, 2026).
- Dickson, B. (2026). Demystifying loop engineering: get more from AI agents, avoid loopmaxxing. TechTalks. bdtechtalks.com (June 22, 2026)
- Greyling, C. (2026). Loop Engineering. cobusgreyling.substack.com (June 9, 2026)
- decodingAI (2026). Agentic Harness Engineering: LLMs as the New OS. decodingai.com (March 31, 2026)
- Yao, S., et al. (2022). ReAct: Synergizing Reasoning and Acting in Language Models. arXiv:2210.03629
- Shinn, N., et al. (2023). Reflexion: Language Agents with Verbal Reinforcement Learning. arXiv:2303.11366
- Google. Agent Development Kit (ADK): Loop agents. google.github.io/adk-docs