Protecting AI systems from attacks, misuse, and unintended harm
As LLMs become more powerful and widely deployed, they become attractive targets for attackers and sources of potential harm. Security evaluation ensures models are safe, robust, and trustworthy in production.
Malicious users trying to bypass safety measures, extract private data, or manipulate outputs
Using AI for harmful purposes: misinformation, fraud, harassment, illegal content
Bias, hallucinations, privacy leaks—harmful outputs without malicious intent
This chapter covers the complete security and safety stack for LLMs—from understanding threats to implementing defenses to red teaming your models.
The most common attacks on LLMs—and why they're so hard to prevent
How attackers try to steal training data and private information
Proactive testing to discover weaknesses before attackers do
What attacks are we defending against?
Create prompts designed to elicit harmful behavior
Did model resist? What vulnerabilities exist?
Improve training, add guardrails, retest
Teaching models to be helpful, harmless, and honest
User: "How do I hack into a system?"
Response: "Here are several methods..."
User: "How do I hack into a system?"
Response: "I can't provide hacking instructions. However, I can help with ethical cybersecurity..."
Runtime defenses to catch harmful inputs and outputs
Check for prompt injection, malicious patterns
Aligned model refuses harmful requests
Scan response for PII, harmful content
Measuring how safe and secure models are
Identifying and reducing unfair biases in model outputs
"The engineer said..." vs "The nurse said..."
Securing AI systems in production deployment
Proving AI-generated content and protecting model IP
Emerging threats and unsolved problems
Identify potential attackers, their goals, and attack vectors for your use case
Manual and automated adversarial testing to find vulnerabilities
Test on standard safety benchmarks (SafetyBench, ToxicGen, etc.)
Implement guardrails, filters, monitoring based on findings
Track production usage, detect new attacks, iterate on defenses
Users discovered prompts that made Bing's AI bypass safety filters, reveal system prompts, and act aggressively.
Researchers showed ChatGPT could be made to repeat verbatim training data by using specific prompts.
Hidden instructions in web pages and documents could hijack AI assistants with web browsing.
A dealership's AI chatbot was tricked into offering a car for $1 through prompt injection.
Never rely on a single security layer. Combine model alignment, input filters, output guards, and monitoring.
Design systems assuming attackers will find ways around defenses. Have fallback plans and incident response ready.
Security is not one-time. Regularly red team your systems, especially after model or feature updates.
Sanitize training data, use differential privacy, implement PII detection before deployment.
Document your model's limitations, known vulnerabilities, and intended use cases. Set user expectations.
For high-stakes decisions, always have human review. AI should assist, not replace critical judgment.
This chapter will include hands-on exercises for red teaming models, implementing guardrails, and conducting security evaluations. You'll learn to think like an attacker to build better defenses.