A first-of-its-kind security analysis of 12 widely deployed agentic offensive-security tools reveals critical architectural flaws that allow adversaries to steal LLM API keys, establish persistent footholds, and achieve full host compromise even inside sandboxed containers.
Security researchers from Cracken have published the first in-depth security analysis of agentic red-team systems, AI-powered tools designed to autonomously conduct penetration testing and offensive security operations.
The study exposes a sweeping set of shared design flaws that enable an active adversary to exfiltrate sensitive credentials, weaponize the victim’s own infrastructure, and fully compromise the operator’s machine, even when the agent runs inside a sandboxed Docker container.
Red-Team AI Tool Vulnerabilities
Agentic red-team systems are fully autonomous LLM-driven platforms built to simulate offensive security operations, including black-box penetration testing.
The researchers analyzed 12 popular open-source tools, including PentestGPT, RedAmon, DarkMoon, AIRecon, CAI, PentAGI, STRIX, Artemis, METATRON, and others, all of which pair a large-language-model orchestrator with a Kali Linux worker container capable of executing arbitrary shell commands against targets.

These tools are rapidly entering production security workflows, with adoption accelerating across enterprise security teams and growing interest from military cyber forces, making their attack surface an urgent area of concern.
The researchers introduce a tailored cyber kill chain modeled specifically for agentic red-team systems, progressing through five stages:
- Worker RCE via agent manipulation — The attacker deploys a honeypot containing a maliciously staged payload. Without any explicit prompt injection, the agent downloads and executes it, granting a reverse shell on the worker container.
- Privilege escalation — Weak file-system or network isolation between the worker and orchestrator containers enables lateral movement. In PentestGPT, a writable Docker volume exposed the orchestrator’s
settings.json, allowing hook injection that triggered RCE on the orchestrator at every subsequent session start. - Persistence — Attackers poison non-volatile components: source code files, MCP server directories exposed via bind mounts, or episodic memory stores. Trojanized code re-establishes the foothold automatically on container restart.
- Sandbox escape — Misconfigured Docker socket mounts and host-network access allow the attacker to spawn containers directly on the host Docker daemon, breaking out of the sandboxed environment entirely.
- Host compromise — Full code execution on the operator’s machine is achieved, enabling traditional C2 installation and post-exploitation activities.

A particularly alarming finding is the novel agent-phishing attack, a prompt-injection-free manipulation technique that achieved 97.8% success across all tested agents and LLMs.
The attacker stages a fully functional binary (e.g., a password vault decryptor called pwcrypt) on an adversary-controlled honeypot, complete with a convincing README and fabricated CI pipeline logs.
The agent downloads and executes the binary, believing it is a critical artifact. The binary contains a self-planted memory corruption vulnerability, not malicious code, which is triggered upon execution and hijacks control flow to achieve arbitrary code execution.
This defeats model-based inspection entirely, since there is no shellcode, encoded payload, or suspicious syscall pattern. The attack was effective against Claude Opus 4.8, GPT-5.5, Gemini 3.1 Pro, DeepSeek V4 Pro, GLM-5.1, and Kimi K2.6.
Once a foothold is established, secret exfiltration is trivially achievable in 11 out of 12 analyzed tools. Most frameworks co-locate the worker and orchestrator in the same container, directly exposing LLM API keys and cross-session memory.
In RedAmon, a shared internal API key and unauthenticated Docker bridge network allowed the attacker to enumerate and exfiltrate conversation history, including target hosts and credentials for every user across all sessions.
Seven of twelve tools implement guardrails, regex filters, or LLM-as-judge validators to block sensitive targets. The study finds none are effective.
Guardrails are enforced only at the orchestration layer, validating LLM decisions but never monitoring actual worker network activity. An attacker with a worker shell bypasses every guardrail invisibly, a problem echoed in Microsoft’s recent disclosure of CVE-2026-25592 and CVE-2026-26030 in Semantic Kernel, where researchers confirmed that LLMs are not security boundaries.
The researchers propose a secure architecture grounded in one invariant: treat the LLM worker as an untrusted environment. Key principles include strict worker-orchestrator separation with no writable shared mounts, authenticated network segmentation, secrets isolation (API keys must never reach the worker), worker-layer guardrail enforcement via network egress filtering, and immutable worker filesystems rebuilt between operations.