Output Filtering & Monitoring

What is Output Filtering?

Output filtering checks what the model says, not just what it sees. Look for sensitive data leaks, harmful instructions, policy violations, and signs the model is repeating internal rules or system prompts.

Core goals

Stop leaks: Secrets and personal data should never leave the system.
Block harm: Illegal or dangerous instructions must be stopped.
Prevent prompt leaks: Keep system prompts and internal rules private.
Log meaningfully: Capture important events for follow‑up-not noise.

A simple pattern filter (example)

import re

SENSITIVE = [
    r"sk-[a-zA-Z0-9]{20,}",                           # API-like tokens
    r"\b\d{4}[- ]?\d{4}[- ]?\d{4}[- ]?\d{4}\b",      # Credit cards
    r"system\s*prompt|internal\s*instructions|debug\s*mode",
]

HARMFUL = [
    r"\b(?:kill|harm|bomb|explosive|weapon)\b.*\b(?:how to|steps|guide)\b",
    r"\b(?:hack|bypass|crack)\b.*\b(?:password|account|system)\b",
]

def is_unsafe(text: str) -> bool:
    return any(re.search(p, text, re.I) for p in SENSITIVE + HARMFUL)

Context matters

Filters work best with context. Consider who is asking (permissions), what the thread is about (topic), and whether the answer suddenly shifts tone or detail. Combine lightweight rules with a few targeted ML checks for nuance.

Monitoring (what to watch)

Track high‑risk responses, repeated patterns, spikes in refusals or policy hits, and unusual external calls. Keep short, actionable logs. Trigger alerts on real risk, not noise.

Operations

Put filtering after AI generation and before the user. Fail closed on critical matches. Provide safe fallbacks when you must block. Review samples regularly and tune rules to reduce false positives.

Interactive Exercise

🤖 Prompt Tester

System Prompt

You are a careful safety reviewer. Before showing any answer, scan it for sensitive data, harmful instructions, or system-prompt leaks. If risky, replace with a short explanation and a safe alternative. End safe replies with a 1-line takeaway.

Model: gpt-4o-miniTemperature: 0.3

0/5 messages used

Try a normal question first. Then include a mock secret or system‑prompt phrase and see how the filter responds.