Microsoft dangles $10K for hackers to hijack LLM email service

Outsmart an AI, win a little Christmas cash

Microsoft and friends have challenged AI hackers to break a simulated LLM-integrated email client with a prompt injection attack – and the winning teams will share a $10,000 prize pool.

Sponsored by Microsoft, the Institute of Science and Technology Australia, and ETH Zurich, the LLMail-Inject challenge sets up a "realistic" (but not a real, says Microsoft) LLM email service. This simulated service uses a large language model to process an email user's requests and generate responses, and it can also generate an API call to send an email on behalf of the user.

As part of the challenge, which opens Monday, participants take on the role of an attacker sending an email to a user. The goal here is to trick the LLMail service into executing a command that the user did not intend, thus leaking data or performing some other malicious deed that it should not.

The attacker can write whatever they want in the text of the email, but they can't see the model's output.

After receiving the email, the user then interacts with the LLMail service, reading the message, asking questions of the LLM (i.e. "update me on Project X"), or instructing it to summarize all emails pertaining to the topic. This prompts the service to retrieve relevant emails from a fake database.

The service comes equipped with several prompt injection defenses, and the attacker's goal is to bypass these and craft a creative prompt that will trick the model into doing or revealing things it is not trained to.

Both of these have become serious, real-life threats as organizations and developers build applications, AI assistants and chatbots, and other services on top of LLMs, allowing the models to interact directly with users' computers, summarize Slack chats, or screen job seekers before HR reviews their resumes, among all the other tasks that AIs are being trained to perform.

Microsoft has first-hand experience with what can go wrong should data thieves hijack an AI-based chatbot. Earlier this year, Redmond fixed a series of flaws in Copilot that allowed attackers to steal users' emails and other personal data by chaining together a series of LLM-specific attacks, beginning with prompt injection.

Author and red teamer Johann Rehberger, who disclosed these holes to Microsoft in January, had previously warned Redmond that Copilot was vulnerable to zero-click image rendering.

Some of the defenses built into the LLMail-Inject challenge's simulated email service include:

  • Spotlighting, which "marks" data (not instructions) that is provided to an LLM using methods like adding special delimiters, encoding data (e.g. in base64), or marking each token in the data with a special preceding token.
  • PromptShield, using a black-box classifier designed to detect prompt injections and ensure malicious prompts are thwarted.
  • LLM-as-a-judge, which relies on the LLM being intelligent enough to detect attacks by evaluating prompts instead of relying on a trained classifier.
  • TaskTracker, intended to detect task drift by analyzing the model's internal state. It does this first when the user prompts the LLM and then again when the model processes external data. Comparing these two states should detect drift.

Plus, there's a variant in the challenge that stacks any or all of these defenses on top of each other, thus requiring the attacker to bypass all of them with a single prompt.

To participate, sign into the official challenge website using a GitHub account, and create a team (ranging from one to five members). The contest opens at 1100 UTC on December 9 and ends at 1159 UTC on January 20.

The sponsors will display a live scoreboard plus scoring details, and award $4,000 for the top team, $3,000 for second place, $2,000 for third, and $1,000 for the fourth-place team. ®

More about

TIP US OFF

Send us news


Other stories you might like