Detection, Analysis and Mitigation of Model Predicated Exploits at Runtime (DAMPER)

This project focuses on improving the safety and reliability of large language models (LLMs)—the AI systems behind tools like chatbots—by protecting them from malicious inputs. Recent attacks have shown that hidden instructions can be embedded in prompts to manipulate how these systems respond, creating serious risks in sensitive environments such as national security. This research aims to develop new methods to detect and neutralize these threats in real time.

The team is exploring techniques that allow AI systems to automatically reinterpret or “clean” incoming prompts, removing harmful instructions while preserving the user’s original intent. Additional approaches involve restructuring information into safe, fact-based formats and comparing multiple versions of a prompt to identify inconsistencies that may signal manipulation. Together, these methods help ensure that AI systems can operate more securely and reliably, even when faced with adversarial inputs.

Award Number

AWD00002061