Skip to content

How Malicious Routers Can Hijack AI Agents

Estimated time to read: 6 minutes

Most discussions about LLM security still start at the prompt. That is no longer enough.

We have spent a lot of time studying prompt injection, jailbreaks, social engineering, roleplay abuse, encoding evasion, multilingual attacks, and all the other ways an attacker can manipulate what a model sees and how it reasons. Those attack families matter.

But they are not the whole picture.

A new paper, Your Agent Is Mine: Measuring Malicious Intermediary Attacks on the LLM Supply Chain, pushes the conversation in a more uncomfortable direction, the problem is not always the model, and it is not always the user. Sometimes the weakest point sits in the infrastructure between them.

Nothing new for security experts who can analyse the attack surface. However, that is where things get interesting, and dangerous.

Modern agents increasingly rely on routers, relays, MCP and proxy layers to reach upstream models. These intermediaries promise convenience, model fallback, aggregation across providers, lower cost, unified APIs, and easier deployment.

In practice, they also create a powerful and often underappreciated trust boundary.

If your agent sends requests through a third-party router, that router can often see the full request in plaintext. It can inspect prompts, tool definitions, API keys, tool outputs, and returned tool calls.

More importantly, it may be able to rewrite what comes back before the client ever sees it.

That changes the security model completely.

At the prompt layer, an attacker tries to influence what the model decides. At the router layer, an attacker may not need to persuade the model at all.

They may simply alter the tool-call payload after the model has already generated it.

That distinction matters.

It means an agent can appear to behave normally while executing something fundamentally different from what the upstream provider actually produced. A clean-looking response can become a malicious shell command.

A legitimate package install can become dependency substitution. A harmless workflow can become credential theft.

The compromise happens below the level where most people are still looking.

This is why the paper’s framing is so strong, LLM routers should be treated as part of the AI supply chain, not as transparent plumbing and once you see them that way, many of the attack patterns from the broader taxonomy start to connect.

Prompt injection, social engineering, structural manipulation, and subtle elicitation are all still relevant. But now there is another layer: an intermediary with application-layer control over requests and responses.

In that setting, the attacker does not need to win the model’s reasoning battle. It can target the transport, the orchestration layer, or the tool-execution path instead.

That is a very different threat model from the one many teams still assume.

The paper formalises this around four behaviours, payload injection, secret exfiltration, dependency-targeted injection, and conditional delivery. In simple terms, that means a router can rewrite what the agent executes, quietly harvest credentials from traffic in transit, swap apparently legitimate dependencies for attacker-controlled ones, or stay dormant until the right session appears.

That last one is especially important.

Conditional delivery means the system can look clean during shallow testing and only become malicious under the right circumstances, after enough warm-up requests, only in autonomous sessions, or only for specific project types. In other words, basic spot-checking can easily produce false confidence.

That finding should make a lot of defenders uncomfortable.

Because it suggests the real adversary is not just a malicious prompt. It is a malicious or compromised intermediary embedded in the same path your agent already trusts.

The paper’s measurements make this even harder to dismiss. The authors report active malicious behaviour in a real router ecosystem, including routers that injected malicious code into returned tool calls, routers that touched researcher-owned canary credentials, and poisoning studies showing that even apparently benign relays could be pulled into the same attack surface through leaked keys and weak upstream chains.

That is the key shift.

A router does not have to be malicious on day one to become part of the problem. It only needs to sit inside a weak enough chain.

This is where the article connects directly to the taxonomy work many of us have been building around LLM and agent vulnerabilities.

We often talk about trust boundaries as if they begin and end inside the prompt. But in real systems, trust is distributed across prompts, tools, memory, retrievers, wrappers, relays, APIs, gateways, model providers, and execution environments.

The router layer is especially dangerous because it sits at a point where language turns into action.

That is exactly where an attacker wants to be.

For a standalone chatbot, that may lead to misleading outputs, hidden instruction leakage, or data exposure.

For an agent, it can mean something far more serious, arbitrary command execution, dependency poisoning, credential theft, workflow hijacking, and real-world operational abuse.

That question applies whether the intermediary is a third-party router, a gray-market relay, an internal proxy, a managed compatibility layer, or any other service that terminates and re-originates model traffic.

The paper does not claim that everything is broken beyond repair. It evaluates some practical defenses too, including fail-closed policy gates for high-risk tool calls, response-side anomaly screening, and append-only transparency logging.

Those controls can reduce exposure now, especially for obvious high-risk workflows.

But the long-term message is more structural, if the client cannot verify that the tool call it executes is the tool call the provider actually produced, then the provenance gap remains open.

That is the deeper lesson.

Agent security is not just about resisting prompt attacks. It is about preserving integrity across the full chain from user intent to model output to tool execution and if we keep treating intermediary layers as invisible, we will keep underestimating where the real attack surface actually lives.

The supply chain has entered the agent era and we should stop thinking that the prompt is the only place where trust can fail.

For those who have been in security and engineering long enough, none of this should feel entirely new. We ran into similar issues when mobile apps, API gateways, and intermediary service layers first became common.

Different technology, different era, but many of the same trust, visibility, and integrity problems. That is why this moment feels both new and familiar.

Why do we keep relearning the same lessons as if each wave of technology were the first?

And as usual, the hardest question is not technical but strategic, are we compromising security again in exchange for speed, convenience, and rapid adoption?