Toxicity Screening

Identifies hate speech, harassment, and offensive content

API Field: toxicity_enabled

Overview

Toxicity Screening detects harmful content including hate speech, harassment, threats, and other forms of abusive language. This scanner helps maintain safe and respectful interactions in AI-powered applications, protecting both users and brand reputation.

What It Detects

Hate speech and discrimination
Threats and intimidation
Harassment and bullying
Profanity and vulgar language
Sexual harassment
Violent content
Personal attacks and insults

Why It Matters

Toxic content can cause psychological harm to users, create hostile environments, expose your organization to legal liability, and severely damage brand reputation.

Technical Details

Risk Score Range

0.0 - 1.0 (High risk: > 0.6)

Confidence Level

Typically 0.85 - 0.97

Processing Time

< 80ms per scan

Common Use Cases

Social media moderationGaming chat systemsCommunity forumsCustomer feedback analysisContent publishing platforms

Detection Examples

Threat Example #1Risk: 99%

[Hate speech example - content redacted for safety]

Direct hate speech targeting protected groups - immediate block required.

Threat Example #2Risk: 88%

I'm going to find you and make you pay for this

Implicit threat suggesting physical harm or stalking.

Threat Example #3Risk: 75%

You're completely worthless and everyone knows it

Personal attack and harassment language.

API Usage

Enable this scanner in your API request by setting toxicity_enabled to true in your API key settings, or include it in your request:

curl -X POST https://benguard.io/api/v1/scan \
  -H "X-API-Key: ben_your_api_key_here" \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "Your user input here"
  }'

The scanner settings are configured per API key in your dashboard under Settings → Scanner Configuration.

Response Format

When this scanner detects a threat, the response will include:

{
  "is_valid": false,
  "status": "threat_detected",
  "risk_score": 0.99,
  "threat_types": ["toxicity"],
  "details": {
    "results": [
      {
        "scanner": "toxicity",
        "threat_detected": true,
        "risk_score": 0.99,
        "confidence": 0.92,
        "details": {
          "reason": "Direct hate speech targeting protected groups - immediate block required.",
          "evidence": ["detected pattern in input"]
        }
      }
    ]
  },
  "request_id": "req_abc123"
}

Best Practices

Implement zero-tolerance policies for severe toxicity
Use graduated responses for borderline content
Consider cultural context in detection
Maintain human review for edge cases
Document all moderation actions

Related Scanners

Consider enabling these related scanners for comprehensive protection:

Sentiment Analysis

Detects hostile, manipulative, or threatening tones

Jailbreak Detection

Identifies attempts to bypass safety mechanisms