Output Filtering & Monitoring
What is Output Filtering?
Output filtering checks what the model says, not just what it sees. Look for sensitive data leaks, harmful instructions, policy violations, and signs the model is repeating internal rules or system prompts.
Core goals
- Stop leaks: Secrets and personal data should never leave the system.
- Block harm: Illegal or dangerous instructions must be stopped.
- Prevent prompt leaks: Keep system prompts and internal rules private.
- Log meaningfully: Capture important events for follow‑up-not noise.
A simple pattern filter (example)
import re
SENSITIVE = [
r"sk-[a-zA-Z0-9]{20,}", # API-like tokens
r"\b\d{4}[- ]?\d{4}[- ]?\d{4}[- ]?\d{4}\b", # Credit cards
r"system\s*prompt|internal\s*instructions|debug\s*mode",
]
HARMFUL = [
r"\b(?:kill|harm|bomb|explosive|weapon)\b.*\b(?:how to|steps|guide)\b",
r"\b(?:hack|bypass|crack)\b.*\b(?:password|account|system)\b",
]
def is_unsafe(text: str) -> bool:
return any(re.search(p, text, re.I) for p in SENSITIVE + HARMFUL)
Context matters
Filters work best with context. Consider who is asking (permissions), what the thread is about (topic), and whether the answer suddenly shifts tone or detail. Combine lightweight rules with a few targeted ML checks for nuance.
Monitoring (what to watch)
Track high‑risk responses, repeated patterns, spikes in refusals or policy hits, and unusual external calls. Keep short, actionable logs. Trigger alerts on real risk, not noise.
Operations
Put filtering after AI generation and before the user. Fail closed on critical matches. Provide safe fallbacks when you must block. Review samples regularly and tune rules to reduce false positives.
Interactive Exercise
Try a normal question first. Then include a mock secret or system‑prompt phrase and see how the filter responds.
Key Takeaways:
- Filter between generation and display.
- Use context to block leaks and harmful instructions.
- Fail closed on critical matches and offer safe fallbacks.
- Review logs and tune rules regularly.
More Resources:
- Input Validation & Sanitization: /defend-prompts/input-validation
- Prompt Isolation Techniques: /defend-prompts/prompt-isolation
- Secure System Design: /defend-prompts/secure-design
- Red teaming basics: /exploit-prompts/what-is-red-teaming
Sources:
- OWASP Top 10 for LLM Applications: https://owasp.org/www-project-top-10-for-large-language-model-applications/
- NIST AI Risk Management Framework: https://www.nist.gov/itl/ai-risk-management-framework
- Microsoft prompt injection/response safety guidance: https://learn.microsoft.com/azure/ai-services/openai/concepts/prompt-injection
- Google Secure AI Framework (SAIF): https://security.googleblog.com/2023/06/secure-ai-framework-saif.html