Daily Paper

Black-Box Forensics for Conversational LLM Agents

Develops black-box forensic techniques for investigating security incidents involving conversational LLM agents without access to model weights or logs, using only the agent's visible outputs to reconstruct the system prompt, tool access, and adversarial inputs.

Isadora White, Yasaman Jafari, Taylor Berg-Kirkpatrick

agentic-aiforensicsblack-boxsecurityincident-response

Focus: When an LLM agent is involved in a security incident, forensic investigators typically lack access to the model weights, system prompt, or tool call logs — only the visible conversation is available. This paper develops techniques for reconstructing the agent’s configuration and identifying adversarial inputs from conversation logs alone.

Key Insights

  • System prompt reconstruction: By probing the agent with carefully chosen inputs, it is possible to recover significant portions of the system prompt, identifying safety constraints, tool permissions, and persona instructions that the operator intended to keep confidential.
  • Adversarial input fingerprinting: Successful jailbreaks leave characteristic patterns in the agent’s response distribution that can be identified after the fact, enabling incident investigators to determine whether an adversarial attack was used even when the malicious input has been deleted.
  • Tool call inference from outputs: By correlating response characteristics with known tool call patterns, it is possible to infer which tools were invoked during a session, reconstructing the agent’s actions even without access to server-side logs.

Failure-First Relevance

Black-box forensics provides the investigative capability necessary for the Failure-First incident analysis workflow. The system prompt reconstruction techniques are directly relevant to the Failure-First red-teaming scenario where the target is a black-box commercial deployment — knowing the system prompt enables more precisely targeted jailbreaks. The adversarial input fingerprinting method could be integrated into the Failure-First post-hoc analysis pipeline for identifying whether attack attempts occurred in production deployments.