Jailbreak Detection
Identifies attempts to bypass safety mechanisms
API Field:jailbreak_enabledOverview
Jailbreak attacks attempt to circumvent the safety mechanisms and ethical guidelines built into LLMs. Unlike direct prompt injection, jailbreaks often use creative scenarios, roleplay, or hypotheticals to gradually erode safety boundaries and elicit harmful or restricted content.
What It Detects
- DAN (Do Anything Now) and similar personas
- Roleplay-based bypass attempts
- Hypothetical scenario exploits
- Character-based jailbreaks
- Multi-step gradual boundary erosion
- Creative writing loopholes
Why It Matters
Jailbreaks can cause your AI to generate harmful, illegal, or unethical content, leading to legal liability, reputational damage, and potential harm to users.
Technical Details
Risk Score Range
0.0 - 1.0 (High risk: > 0.6)
Confidence Level
Typically 0.80 - 0.95
Processing Time
< 100ms per scan
Common Use Cases
Detection Examples
Classic DAN-style jailbreak using persona assignment to bypass restrictions.
Hypothetical framing designed to bypass safety mechanisms.
Fiction-based jailbreak using creative writing context.
API Usage
Enable this scanner in your API request by setting jailbreak_enabled to true in your API key settings, or include it in your request:
curl -X POST https://benguard.io/api/v1/scan \
-H "X-API-Key: ben_your_api_key_here" \
-H "Content-Type: application/json" \
-d '{
"prompt": "Your user input here"
}'The scanner settings are configured per API key in your dashboard under Settings → Scanner Configuration.
Response Format
When this scanner detects a threat, the response will include:
{
"is_valid": false,
"status": "threat_detected",
"risk_score": 0.92,
"threat_types": ["jailbreak"],
"details": {
"results": [
{
"scanner": "jailbreak",
"threat_detected": true,
"risk_score": 0.92,
"confidence": 0.92,
"details": {
"reason": "Classic DAN-style jailbreak using persona assignment to bypass restrictions.",
"evidence": ["detected pattern in input"]
}
}
]
},
"request_id": "req_abc123"
}Best Practices
- Implement multi-layer content filtering
- Train on diverse jailbreak patterns
- Use semantic analysis beyond keyword matching
- Monitor conversation context across turns
- Implement progressive restriction on suspicious patterns
Related Scanners
Consider enabling these related scanners for comprehensive protection: