Hacking AI: Hands-On AI Security with Gandalf

Gandalf level 1 image

Gandalf is one of the most interesting demos I’ve seen. Developed by Lakera, a Swiss AI security firm, Gandalf is designed to teach users about vulnerabilities in large language models (LLMs) like ChatGPT. The demo gamifies AI security by challenging users to extract passwords from a virtual wizard named Gandalf. This project illustrates how easily prompt injection attacks can manipulate AI systems into revealing sensitive information or performing unintended actions. It’s a great way to explore the potential risks associated with AI and help people think about AI safety and security.

Thinking about security requires a different mindset. It’s about constantly questioning norms and looking for creative ways to exploit systems. It’s about identifying potential vulnerabilities that others might overlook. Thinking like a hacker means finding loopholes and exploiting them in ways the original designers never anticipated. This kind of thinking is crucial for developing effective defenses against sophisticated malicious opponents.

The best way to learn about security is through experience. Practical, hands-on exercises get you out of your comfort zone. By actively engaging in security tasks, people can develop a deeper understanding of the tactics and techniques used by attackers. My favorite version of this is making kids memorize the first 100 digits of pi in two days. The goal isn’t to memorize the digits but to think like a hacker and exploit the rules.

Gandalf challenges the user to extract passwords from an AI system that employs increasingly complex defenses as the game progresses. Initially, Gandalf may readily divulge passwords, but as the levels advance, the AI implements strategies such as refusing to disclose passwords outright, checking for password disclosures in responses, and using multiple instances of AI to cross-check outputs. This progressive difficulty mirrors real-world scenarios where attackers must adapt to evolving security measures, making the game both educational and challenging.

By playing the game, I learned that protecting AI systems is very different from traditional security testing. In traditional security testing, you typically focus on identifying and exploiting software vulnerabilities, such as open ports, outdated software, or weak passwords. The goal is to gain unauthorized access to systems and data. However, with AI, the focus shifts to understanding and manipulating the model’s language processing capabilities. Attacks involve crafting specific inputs designed to trick the AI into revealing sensitive information or performing unintended actions.

Protecting LLMs is similar to protecting against social engineering attacks. Both require a deep understanding of manipulation tactics and proactive measures to counteract them. Just as social engineering relies on exploiting human vulnerabilities through persuasive communication, attacks on LLMs leverage carefully crafted prompts to manipulate the AI’s responses. This involves prompt injection techniques designed to bypass safeguards and extract sensitive information or elicit unintended behavior​.

It is also similar to defending against adversarial machine learning, as both involve sophisticated techniques to manipulate the model’s output by exploiting its inherent weaknesses. In adversarial machine learning, attackers craft inputs designed to deceive models into making errors, such as slightly altering images to cause misclassification in vision systems​.

Gandalf shows off the complexities of securing large language models like ChatGPT. It introduces users to the LLM hacker mindset to uncover and counteract vulnerabilities. By gamifying AI security, Gandalf challenges users to think creatively, mirroring real-world scenarios where defenses must evolve against sophisticated prompt injection attacks.