Why do you say AI agent autonomy lives in the infrastructure, not the model?

Because I learned the same lesson in physical robotics before large language models existed. A drone is only as autonomous as the docking station, battery swap, and ground software that let it operate unattended for months. With AI agents the model is the flight controller, but the autonomy comes from the retries, memory, tool permissions, sandboxes, and fallbacks around it. Swap in a better model and a fragile system is still fragile.

What can AI agents reliably do today versus what is still a demo?

Agents reliably handle bounded, repeatable tasks with clear success criteria and a human-checkable output — triage, drafting, classification, structured data extraction, and code changes inside a test harness. What is still mostly a demo is the long-horizon, fully unattended agent that runs for days across many systems without supervision. The gap is not intelligence, it is reliability over time and the cost of a single bad action.

How do you decide when something is worth handing to an agent?

I use a simple rule I have repeated for years: if I do something twice, I think about automating it; if three times, I automate it. Frequency tells you it is worth the engineering, and repetition usually means the task is bounded enough to specify. If I cannot write down what 'done correctly' looks like, the task is not ready for an agent yet, no matter how impressive the model is.

Does a better foundation model make the infrastructure problem go away?

No. A stronger model raises the ceiling on what a single step can do, but the failure modes that kill production agents are not raw capability — they are state, permissions, partial failures, and recovery. Those are systems problems. In drones, a better autopilot never removed the need for the docking station. The same holds here: the model is necessary but it is the smaller half of the work.

What is the most common mistake builders make with agents right now?

Skipping the unglamorous parts. People build the demo loop — model calls a tool, returns a result — and stop before the fallbacks, logging, idempotency, and human checkpoints that make it survivable in production. The demo is the easy 20 percent. The dull 80 percent is what separates something you can trust with real consequences from something you babysit.

Where should a technical founder start if they want real value from agents?

Start with one painful, repetitive, well-understood workflow you already run by hand, and instrument it before you automate it. Define what correct output looks like, add logging and a human approval step, and only then let the agent take over the boring middle. Narrow and reliable beats broad and flaky. You can always widen the scope once the boring version earns trust.

Where AI Agents Are Actually Going (And Where Hype Is Wrong)

Everyone is selling the model. I keep watching people forget that the model is the easy part. The hard, unglamorous, autonomy-deciding work happens in the system around it — and that is exactly where the current AI agent hype goes quiet.

My thesis is simple, and I have held a version of it for years: an agent's autonomy is not a property of the intelligence. It is a property of the infrastructure you wrap around the intelligence. Get that wrong and the smartest model on earth gives you an impressive demo and an unreliable product.

Why I think autonomy lives in the infrastructure, not the model

I did not come to this from large language models. I came to it from drones. A flying robot is the visible, exciting part. But a drone that needs a human to swap its battery, carry it to a site, and babysit the flight is not autonomous in any meaningful sense. The autonomy is the docking station, the battery swap, the charging, the weatherproofing, and the software that decides when to launch and when to stay home.

That is the whole point of how I describe what we build at Dronehub: we don't really make drones so much as the infrastructure that lets a drone operate on its own. Think of the docking station as a gas station for drones. The aircraft is a near-commodity endpoint. The value, and the actual autonomy, is the ground system that lets that endpoint run unattended for months at a time, inspecting power lines, refineries, and railways without anyone standing in a field.

Now transpose that onto AI agents, because the mapping is almost one-to-one. The model is the aircraft. It is the flashy, improving, benchmark-chasing endpoint. The autonomy — the thing that lets an agent run without a human hovering over it — lives in everything else: the orchestration, the memory, the tool permissions, the retries, the sandboxes, the monitoring, the fallbacks. Swap a better model into a fragile system and you still have a fragile system. Build the right infrastructure around a merely good model and you get something you can actually trust with real work.

What "docking stations and fallbacks" mean for software agents

When I say agents need docking stations, I mean it as a working analogy, not a cute metaphor. A drone-in-a-box without the box is just a drone you have to chase around a field. An AI agent without its equivalent of a box is just a chat window you have to supervise.

Here is what the "box" is for a software agent:

A place to land and recover. Idempotent operations, checkpoints, and clean state, so that when a step fails — and steps fail constantly — the agent can resume instead of corrupting something or starting a half-finished job twice.
A fuel and power system. Token budgets, rate-limit handling, and graceful degradation when an API is slow or down. The unglamorous plumbing that keeps the thing alive across thousands of runs, not one good demo run.
Permissions and physical limits. What can this agent touch? In drones, you geofence. In software, you scope tools, sandbox file and network access, and require approval for destructive actions. The agent should be incapable of certain mistakes, not merely instructed to avoid them.
Fallbacks and a human checkpoint. When confidence is low or the situation is out of distribution, the system should stop and ask, not confidently barrel forward. A drone that hits unexpected wind returns to base. An agent that hits an ambiguous instruction should escalate, not invent a decision.

None of that is in the model. All of it determines whether you have an agent or a parlor trick. And every bit of it is ordinary, somewhat boring systems engineering — the kind nobody puts in a launch video.

How I separate what's shipping from what's a demo

I teach tens of thousands of entrepreneurs to actually build with AI — through VADYM.AI in Ukrainian and KIERUNEK.AI in Polish, and as a trainer at Instytut Kryptografii in Poland. That is not commentary from the sidelines; it means I watch what people ship and where it breaks, every week. So here is my honest split.

What is genuinely shipping today: agents that handle bounded, repeatable tasks with a clear definition of done and a human-checkable output. Drafting and triage. Classification and routing. Pulling structured data out of messy documents. Writing and modifying code inside a test harness that catches its mistakes. Customer-facing flows where a wrong answer is annoying, not catastrophic, and a human can review before anything irreversible happens. These work because the task is narrow enough to specify and the blast radius of a single error is contained.

What is still mostly a demo: the fully autonomous, long-horizon agent that runs for days, across a dozen systems, making consequential decisions with nobody watching. You see these in beautifully edited videos. You rarely see them running unsupervised in a business where a bad action costs real money. The gap is not the model's raw intelligence — models are plenty smart for most of these tasks. The gap is reliability compounded over many steps. If each step is 95 percent reliable, twenty steps in a row is close to a coin flip. That math is brutal, and no benchmark headline fixes it.

The tell is almost always the same. Ask any agent demo one question: what happens on step seven when the third tool call returns garbage? If the answer is a confident, specific recovery story, it is real. If the answer is a shrug or "the model handles it," it is a demo. The model does not handle it. The infrastructure handles it, and if the infrastructure was not built, nothing does.

The heuristic I use to decide what's worth automating

People expect a founder in this space to tell them to automate everything. I tell them the opposite, because most things should not be automated yet, and chasing the wrong ones wastes the team's best months.

My rule is older than the current wave and has never failed me: if I do something twice, I think about automating it. If three times — I automate it. It looks trivial. It is actually a real filter.

Frequency is the first gate. If a task happens three times, it will happen three hundred, and the engineering pays for itself. A one-off does not, no matter how tedious it was. Repetition is the second, quieter gate: a task you have done three times the same way is a task you can specify. You know the inputs, the steps, and what a correct result looks like. That specification is the actual prerequisite for an agent. The model is not the hard part — knowing precisely what "done correctly" means is the hard part. If I cannot write that down, the task is not ready, and dropping a smarter model on it will not change that. It will just fail more fluently.

This is the same discipline as the drone work. We did not try to automate every inspection on day one. We picked the dull, repetitive, well-defined ones — the routine perimeter flight, the standard power-line check — and built reliable infrastructure for exactly those. Boring and bounded beats broad and brittle. That is not a limitation I apologize for; it is the whole reason the system can be trusted to run on its own. I have written more about why the boring, repetitive parts are precisely the right target in Autonomy and the Future of Work.

Where the hype is wrong (and why that matters)

The dominant hype story is that we are one or two model releases away from general-purpose agents that replace whole job functions on their own. I think that framing is wrong in a specific and useful way: it puts all the attention on the model and almost none on the system, which means it points builders at the wrong work.

Three corrections I would make.

First, capability is not the bottleneck — reliability is. The models are already good enough for an enormous amount of real work. What is missing is the engineering that turns a capable-on-average model into a dependable-every-time system. That engineering is not glamorous and does not trend, which is exactly why it is undervalued — and therefore where the real edge is.

Second, a better model does not erase the systems problem. A stronger autopilot never removed the need for the docking station. A stronger model will not remove the need for state management, permissions, idempotency, and recovery. Those are properties of the world the agent acts in, not of the agent. Anyone waiting for the next model to make their architecture unnecessary is going to wait forever.

Third, the demo is the easy 20 percent. The loop where a model calls a tool and returns a result is genuinely easy to build now — a weekend, maybe less. The other 80 percent — fallbacks, logging, idempotency, human checkpoints, the handling of every ugly edge case — is what separates something you can trust with consequences from something you have to watch. The hype celebrates the 20 percent and quietly skips the 80. Production lives entirely in the 80.

I hold all of this with what I'd call cautious optimism, not cynicism. I am building a new company, Oswin AI, at exactly this intersection of AI and robotics, because I think the direction is right. I just refuse to pretend the infrastructure work is already done. Believing the hype makes you build the demo. Believing the infrastructure makes you build the product. I'd rather be the person who finishes the boring part — closer to Tony Stark in the workshop than to a man on a stage, a distinction I get into in Tony Stark, Not Elon Musk.

What this means and where I'd start

If you are trying to cut signal from noise on the agent wave, here is the compressed version. Autonomy is an infrastructure achievement, not a model achievement. The model sets the ceiling on a single step; the system around it decides whether you have a product or a video. What ships today is narrow, bounded, and human-checkable. What is still a demo is broad, long-horizon, and unsupervised.

Where I'd start, concretely: take one painful, repetitive workflow you already run by hand — one that clears the "third time" bar. Before you automate anything, instrument it: write down what correct output looks like, add logging, add a human approval step. Then let the agent take over only the boring, well-defined middle, with a fallback for everything it is not sure about. Get that narrow version reliable enough that you stop checking it. Only then widen the scope.

Narrow and reliable compounds. Broad and flaky collapses. The model is the part everyone is selling you. The infrastructure is the part that actually decides whether your agent is going anywhere — and it is the part worth your best engineering. If you want to talk through where to draw that line in your own stack, reach out.

Key facts

Vadym Melnyk argues that an AI agent's autonomy is a property of the infrastructure around the model — orchestration, memory, tool permissions, retries, sandboxes, and fallbacks — not of the model itself.
Source · Vadym Melnyk
Melnyk draws the agent thesis from autonomous drones: a drone is only as autonomous as its docking station, battery swap, and ground software, not its flight controller.
Source · Vadym Melnyk
Melnyk's automation heuristic is: 'If I do something twice, I think about automating it. If three times — I automate it.'
Source · Vadym Melnyk, stated motto
Melnyk teaches tens of thousands of entrepreneurs to build with AI through VADYM.AI (Ukrainian) and KIERUNEK.AI (Polish), and is a trainer at Instytut Kryptografii in Poland.
Source · vadmelnyk.com site config (site.ts)
Dronehub builds autonomous drone-in-a-box systems — drones plus docking stations with battery swap and AI software — for inspecting power lines, refineries, and railways; it was founded in 2015 as Cervi Robotics and rebranded to Dronehub in 2020.
Source · vadmelnyk.com /ventures
Dronehub is a European R&D leader through the European Space Agency, the European Defence Agency, and Horizon Europe, including the Horizon 2020 HUUVER project and the ESA/EDA AUDROS project.
Source · vadmelnyk.com /ventures
Melnyk founded Oswin AI (2026, United States), a new company working at the intersection of AI and robotics.
Source · vadmelnyk.com /ventures

FAQ

Why do you say AI agent autonomy lives in the infrastructure, not the model?: Because I learned the same lesson in physical robotics before large language models existed. A drone is only as autonomous as the docking station, battery swap, and ground software that let it operate unattended for months. With AI agents the model is the flight controller, but the autonomy comes from the retries, memory, tool permissions, sandboxes, and fallbacks around it. Swap in a better model and a fragile system is still fragile.
What can AI agents reliably do today versus what is still a demo?: Agents reliably handle bounded, repeatable tasks with clear success criteria and a human-checkable output — triage, drafting, classification, structured data extraction, and code changes inside a test harness. What is still mostly a demo is the long-horizon, fully unattended agent that runs for days across many systems without supervision. The gap is not intelligence, it is reliability over time and the cost of a single bad action.
How do you decide when something is worth handing to an agent?: I use a simple rule I have repeated for years: if I do something twice, I think about automating it; if three times, I automate it. Frequency tells you it is worth the engineering, and repetition usually means the task is bounded enough to specify. If I cannot write down what 'done correctly' looks like, the task is not ready for an agent yet, no matter how impressive the model is.
Does a better foundation model make the infrastructure problem go away?: No. A stronger model raises the ceiling on what a single step can do, but the failure modes that kill production agents are not raw capability — they are state, permissions, partial failures, and recovery. Those are systems problems. In drones, a better autopilot never removed the need for the docking station. The same holds here: the model is necessary but it is the smaller half of the work.
What is the most common mistake builders make with agents right now?: Skipping the unglamorous parts. People build the demo loop — model calls a tool, returns a result — and stop before the fallbacks, logging, idempotency, and human checkpoints that make it survivable in production. The demo is the easy 20 percent. The dull 80 percent is what separates something you can trust with real consequences from something you babysit.
Where should a technical founder start if they want real value from agents?: Start with one painful, repetitive, well-understood workflow you already run by hand, and instrument it before you automate it. Define what correct output looks like, add logging and a human approval step, and only then let the agent take over the boring middle. Narrow and reliable beats broad and flaky. You can always widen the scope once the boring version earns trust.