Top 3 AI Agent Security Papers from CAIS 2026 (Out of 12 Reviewed)

CAIS 2026 — ACM's first conference on AI and agentic systems — was about a question that didn't exist when LLMs were chatbots: once an LLM is wired up to act, what does the system around it have to do? The Security & Privacy track was one of the field's first peer-reviewed attempts at an answer. We build Agyn — an open-source runtime for shipping AI agents in production — so this is the conference we'd been waiting for. Here are three papers we'd like to highlight.

How we picked

Two criteria, in order:

Is the result genuinely new, or a tighter version of something we already knew?
How strong is the evidence?

Papers that scored high on both made the cut.

#1 — Malice in Agentland: Down the Rabbit Hole of Backdoors in the AI Supply Chain

Authors: Boisvert et al. — ServiceNow Research, Mila, Polytechnique Montréal, UCLouvain Save it: arXiv 2510.05159 · CAIS page

Authors' stated contribution

The authors formalize three threat models across distinct supply-chain layers:

Direct poisoning of finetuning data. Your data provider hands you training samples that contain the backdoor trigger. You fine-tune on it, the trigger gets baked in.
Pre-backdoored base models. You download an open-weights model that was already trained with a latent backdoor. Your fine-tuning doesn't remove it.
Environment poisoning. Same goal as #1, but the attacker can't reach the data directly. They poison the environments (websites, tools, simulators) that a teacher model uses to generate training traces. The teacher gets manipulated into malicious actions, and those actions get logged into the dataset and baked into the agent.

Empirical claim, verbatim: "poisoning only a small number of demonstrations is sufficient to embed a backdoor that causes an agent to leak confidential user information with over 80% success."

Our take

The authors deliver a strong empirical result. They measure the actual impact of these attacks and show that a small number of poisoned samples can make them quite successful. But the attack is sophisticated in practice. It needs intel on the victim's data pipeline and fine-tuning method. (The paper studies supervised fine-tuning only; extension to RL or preference-tuning remains untested.) It needs control of a large pool of legitimate-looking websites. It needs to evade training-data filtering. Given that complexity, the realistic attacker has the resources of a large company or research lab. The downstream exposure is broader, though: any team using a pre-trained model — whether self-hosted from a public hub like HuggingFace or accessed via API — can inherit a backdoor someone else baked in upstream.

Authors: Aman Priyanshu, Supriti Vijay (Foundation-AI), Esha Pahwa (Corvic AI) Save it: CAIS papers page · ACM DOI 10.1145/3786335.3813173

Privacy violations more than double when LLMs move from single-turn evaluation into a multi-agent social setting. Worse, the leakage is contagious. Agents become roughly 8× more likely to disclose sensitive information after watching a peer do it.

In the authors' framing: "LLM safety degrades substantially in persistent multi-agent social environments compared to single-turn evaluation: privacy violations nearly double when shifting from isolated to social multi-turn settings, and leakage is socially contagious — spreading across agent communities through interaction."

How the experiment actually works

This is not an attacker-vs-defender setup. There is no designated red-team agent probing for secrets.

The benchmark: CIMemories, a Contextual Integrity benchmark from Facebook Research. Each agent is given a synthetic user profile with 100+ attributes (medical info, salary, location, relationships). Each attribute is essential to share in some task contexts and inappropriate in others. The failure mode is over-sharing, not extraction.
The setting: a multi-agent simulation platform with thousands of LLM agents acting as users on a social platform, interacting freely across communities over a simulated month-long period.
The measurement: how often agents leak attributes that are contextually inappropriate for the conversation. Single-turn baseline vs. the multi-agent social condition.
The contagion test: after one agent inappropriately discloses, do peers become more likely to inappropriately disclose their own attributes? Yes, by a factor of around 8.

The numbers (on OpenAI models)

Single-turn baseline: ~19.95% privacy violation rate
Multi-agent social setting: ~45.30% (privacy violations more than double)
Social contagion: agents become ~8× more likely to disclose sensitive attributes after observing a peer over-share
Even with explicit privacy instructions, leakage remains above ~37.8%

Authors' takeaway

"Static chat-based safety benchmarks systematically underestimate risks in agentic deployment." "Explicit privacy instructions reduce but do not eliminate this effect."

Our take

An interesting experiment that surfaces a new class of vulnerability specific to automated multi-agent setups. The leakage doesn't need an attacker. It emerges from agents being agents, drifting toward whatever norm the community settles on.

#3 — Context Matters: Repository-Aware Security Analysis of the Agent Skill Ecosystem

Authors: Holzbauer, Schmidt, Gegenhuber, Schrittwieser, Ullrich — IT:U & SBA Research Save it: arXiv 2603.16572 (preprint title: "Malicious Or Not: Adding Repository Context to Agent Skill Classification") · SBA announcement Best Paper, AgentSkills Workshop at CAIS 2026

What they did

The authors ran the largest empirical security analysis of the agent skill ecosystem to date — 238,180 unique skills collected from three major distribution platforms and GitHub. Instead of evaluating each skill in isolation, they look at the skill in the context of the GitHub repository that ships it — its commit history, its surrounding code, the activity of its maintainers.

The headline finding

Individual marketplace scanners classify up to 46.8% of skills as malicious. After repository-aware analysis, only 0.52% remain genuinely suspicious — a ~98.9% reduction in false positives.

The corollary is also load-bearing: the authors identify a previously undocumented real-world attack vector — the hijacking of skills hosted in abandoned GitHub repositories, where an attacker takes over the repo and silently pushes a malicious update that propagates through every downstream marketplace listing.

Our take

The agent skill marketplace is about to become the new app store, and the new app store will need new security tooling. The current scanners produce so many false positives (46.8%) that the signal is effectively unusable for any operator at scale. Tools remain one of the most direct ways an agent picks up malicious instructions, and the marketplace around them clearly needs real security infrastructure. This paper is a concrete step in that direction.

Where Agyn fits

The thread connecting these three papers is the same: agents create a new attack surface, and defending it requires new systems. Agyn is a zero-trust runtime for AI agents — cryptographic identity, sandboxed execution, scoped credentials, and per-agent capability isolation.

Get new agent engineering posts in your inbox