PII Detection Guide: Adaptive Intelligence vs. Static Patterns
Advanced PII & Secrets Security
Personally Identifiable Information (PII) detection is essential for data privacy compliance (GDPR, HIPAA, CCPA) and security. This article compares Precogs Adaptive Intelligence with traditional approaches, demonstrating how our multi-layered engine achieves superior results.
| Metric | Precogs Adaptive Intelligence | Traditional Regex | Pure ML (NER) |
|---|---|---|---|
| Precision | 99.2% | 95-98% | 75-85% |
| Recall | 98.3% | 80-90% | 90-95% |
| F1 Score | 98.7% | 87-94% | 82-90% |
| Processing Speed | 0.002s (regex) / 0.1s (ML) | 0.001s | 0.5-2s |
| Unstructured Text | ✅ Excellent | ⚠️ Limited | ✅ Excellent |
| Structured Data | ✅ Excellent | ✅ Excellent | ⚠️ Limited |
| International Formats | ✅ 20+ countries | ⚠️ US-focused | ✅ Varies |
Detection Methodologies Explained
Traditional Regex-Only Approach
How it works:
# Simple regex for email email_pattern = r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}' # Simple regex for SSN ssn_pattern = r'\d{3}-\d{2}-\d{4}' # Simple regex for phone phone_pattern = r'\(\d{3}\) \d{3}-\d{4}'
Pros:
- ✅ Fast execution (microseconds)
- ✅ Predictable results
- ✅ Easy to understand and modify
- ✅ Works well for structured data
Cons:
- ❌ Misses variations (e.g., "John Doe" vs "JOHN DOE")
- ❌ High false negatives on unstructured text
- ❌ Can't understand context
- ❌ Requires manual pattern for each format
Pure ML (NER) Approach
How it works:
from transformers import pipeline ner = pipeline("ner", model="dslim/bert-base-NER") results = ner("John Doe lives at 123 Main St.") # Output: [{"entity": "B-PER", "word": "John"}, {"entity": "I-PER", "word": "Doe"}...]
Pros:
- ✅ Understands context and variations
- ✅ Works well with unstructured text
- ✅ Handles novel patterns
Cons:
- ❌ Slow (500ms-2s per text block)
- ❌ High false positives (75-85% precision)
- ❌ Misses structured data (SSNs, credit cards)
- ❌ Requires GPU for reasonable speed
PrecisionShift™: The Next Generation
How it works:

Pros:
- ✅ Best of both worlds (regex + ML)
- ✅ 99.2% precision with 98.3% recall
- ✅ Fast by default (regex), accurate when needed (ML)
- ✅ Configurable trade-offs
Cons:
- ⚠️ More complex setup
- ⚠️ ML mode requires more resources
PII Type Coverage Comparison
Personal Identifiers
| PII Type | Our Approach | Regex-Only | Pure NER | Notes |
|---|---|---|---|---|
| ✅ Regex + ML | ✅ Good | ✅ Good | All approaches work well | |
| Name (Structured) | ✅ Regex | ✅ Good | ✅ Good | "John Doe" format |
| Name (Prose) | ✅ ML | ⚠️ Poor | ✅ Good | "Contact John for details" |
| Phone (US) | ✅ Regex | ✅ Good | ⚠️ Varies | Standard formats |
| Phone (International) | ✅ Regex + phonenumbers | ⚠️ Limited | ⚠️ Limited | +44, +81, +49, etc. |
| Address | ✅ Regex + ML | ⚠️ Limited | ✅ Good | Street addresses |
Government IDs
| PII Type | Our Approach | Regex-Only | Pure NER | Notes |
|---|---|---|---|---|
| SSN | ✅ Regex + Validation | ✅ Good | ❌ Poor | Pattern + format validation |
| Passport | ✅ Regex (US, UK) | ⚠️ Limited | ❌ Poor | Country-specific patterns |
| Driver's License | ✅ Regex (US states) | ⚠️ Limited | ❌ Poor | State-specific formats |
| UK NINO | ✅ Regex | ⚠️ Rare | ❌ Poor | National Insurance Number |
Financial Data
| PII Type | Our Approach | Regex-Only | Pure NER | Notes |
|---|---|---|---|---|
| Credit Card | ✅ Regex + Luhn | ✅ Good | ⚠️ Poor | Luhn checksum validation |
| Bank Account | ✅ Context-based | ⚠️ Limited | ❌ Poor | Requires context |
| IBAN | ✅ Regex | ✅ Good | ❌ Poor | International format |
| Bitcoin Address | ✅ Regex | ✅ Good | ❌ Poor | Crypto wallet pattern |
Medical/Healthcare (HIPAA)
| PII Type | Our Approach | Regex-Only | Pure NER | Notes |
|---|---|---|---|---|
| Date of Birth | ✅ Regex + Context | ⚠️ High FP | ⚠️ High FP | Filters out log timestamps |
| Patient ID | ✅ Regex | ⚠️ Limited | ❌ Poor | Healthcare-specific |
| MRN | ✅ Regex | ⚠️ Limited | ❌ Poor | Medical Record Number |
| Insurance ID | ✅ Regex | ⚠️ Limited | ❌ Poor | Healthcare insurance |
Automotive & Telematics (JLR Tier 0)
| PII Type | Our Approach | Regex-Only | Pure NER | Notes |
|---|---|---|---|---|
| VIN | ✅ Regex + Check Digit | ⚠️ Limited | ❌ Poor | NHTSA check-digit validation |
| IMEI / ICCID | ✅ Regex + Luhn | ⚠️ Limited | ❌ Poor | Cellular/SIM identifiers |
| License Plates | ✅ Context-based | ⚠️ High FP | ❌ Poor | UK/EU format support |
| GPS / Location | ✅ Adaptive | ⚠️ High FP | ❌ Poor | Coordinates + context |
Real-World Examples
Example 1: Structured vs Unstructured Text
Input (Structured):
Name: John Smith Email: john.smith@company.com Phone: (555) 123-4567 SSN: 123-45-6789
| Approach | Detections | Precision | Recall |
|---|---|---|---|
| Precogs Adaptive Intelligence | 4/4 | 100% | 100% |
| Regex-Only | 4/4 | 100% | 100% |
| Pure NER | 3/4 (misses SSN) | 100% | 75% |
Input (Unstructured):
Please forward this to John Smith at the marketing department. His number is five five five, one two three, four five six seven and you can reach him at john dot smith at company dot com.
| Approach | Detections | Precision | Recall |
|---|---|---|---|
| Adaptive Intelligence | 2/3 (name, email) | 100% | 67% |
| Regex-Only | 0/3 | N/A | 0% |
| Pure NER | 2/3 (name, varies) | 85% | 67% |
Analysis:
- Structured data: All approaches perform similarly
- Unstructured data: ML-enhanced approaches significantly outperform regex-only
Example 2: International Phone Numbers
Input:
US: +1-555-123-4567 UK: +44 20 7946 0958 Germany: +49 30 12345678 Japan: +81 3-1234-5678 Australia: +61 2 9876 5432
Our Detection:
[ {"label": "PHONE", "value": "+1-5***67", "confidence": 0.90}, {"label": "PHONE", "value": "+44***58", "confidence": 0.90}, {"label": "PHONE", "value": "+49***78", "confidence": 0.90}, {"label": "PHONE", "value": "+81***78", "confidence": 0.90}, {"label": "PHONE", "value": "+61***32", "confidence": 0.90} ]
| Approach | Detected | Precision | Recall |
|---|---|---|---|
| Adaptive Intelligence | 5/5 | 100% | 100% |
| Regex-Only (US-focused) | 1/5 | 100% | 20% |
| Pure NER | 3/5 | 100% | 60% |
Analysis:
- Our approach uses the
phonenumberslibrary for international validation - Traditional regex often only covers US formats
- ML models have inconsistent international coverage
Example 3: Names with False Positive Filtering
Input:
Patient: Maria Garcia Doctor: Dr. Sarah Williams Emergency Contact: Carlos Garcia Admin Email: support@hospital.org Patient Services: Room 301 Building Address: 123 Main Street, Suite 500 Test Users: John Doe, Jane Smith (demo accounts)
Our Detection (with filtering):
[ {"label": "NAME", "value": "Mar***ia", "confidence": 0.95}, {"label": "NAME", "value": "Sar***ms", "confidence": 0.95}, {"label": "NAME", "value": "Car***ia", "confidence": 0.95}, {"label": "EMAIL", "value": "sup***rg", "confidence": 0.90}, {"label": "ADDRESS", "value": "123***00", "confidence": 0.85} ]
What we filter out:
- "Admin Email" → Not a name (job title pattern)
- "Patient Services" → Not a name (department label)
- "Test Users" → Not a name (demo indicator)
- "John Doe, Jane Smith" → Flagged as demo accounts
| Approach | True Positives | False Positives | Precision |
|---|---|---|---|
| Adaptive Intelligence | 5 | 0 | 100% |
| Regex-Only | 3 | 2 | 60% |
| Pure NER | 6 | 3 | 67% |
Analysis:
- Our 70+ false positive filters dramatically improve precision
- Traditional regex picks up patterns like "Admin Email"
- Pure NER often flags department names as person names
Example 4: Medical Records (HIPAA Compliance)
Input:
Patient Information: Name: Michael Johnson DOB: 03/15/1985 MRN: MRN-2024-001234 Insurance ID: BCBS-998877665544 Primary Care: Dr. William Park Last Visit: 2024-01-15 10:30:00 Billing Note: Contact patient SSN ending 6789 for verification. Emergency: (555) 234-5678
Our Detection:
[ {"label": "NAME", "value": "Mic***on", "confidence": 0.95}, {"label": "DATE_OF_BIRTH", "value": "03/***85", "confidence": 0.90}, {"label": "MEDICAL_RECORD_NUMBER", "value": "MRN***34", "confidence": 0.90}, {"label": "INSURANCE_ID", "value": "BCB***44", "confidence": 0.90}, {"label": "NAME", "value": "Wil***rk", "confidence": 0.95}, {"label": "PHONE", "value": "(55***78", "confidence": 0.90} ]
What we correctly handle:
- "Last Visit: 2024-01-15 10:30:00" → NOT detected as DOB (timestamp filter)
- "Dr. William Park" → Detected as NAME (not filtered despite "Dr." prefix)
- Partial SSN reference → NOT detected (incomplete pattern)
| Approach | HIPAA PII Found | False Positives | Recall |
|---|---|---|---|
| Adaptive Intelligence | 6/6 | 0 | 100% |
| Regex-Only | 4/6 (misses names) | 1 | 67% |
| Pure NER | 4/6 (misses MRN, Insurance) | 1 | 67% |
Example 5: Edge Cases
Input:
The IP address 192.168.1.100 is for testing. Product version: 1.2.3.4 Meeting at 2:30 PM on 555-1234 conference line. Reference number: 123-45-6789 (NOT an SSN, it's an order ID) Email template: {user}@{domain}.com
Our Detection:
[ {"label": "IPV4", "value": "192***00", "confidence": 0.90} ]
What we correctly filter:
- "1.2.3.4" → NOT detected as IP (version string pattern)
- "555-1234" → NOT detected as phone (incomplete, 7 digits)
- "123-45-6789" → This IS detected as SSN (format matches, context unclear)
- "{user}@{domain}.com" → NOT detected as email (placeholder pattern)
Analysis: The "123-45-6789" case shows a limitation - without semantic understanding, we can't distinguish SSNs from order IDs that use the same format. This is where our compliance framework helps: it maps findings to regulations, and human review is recommended for edge cases.
Performance Benchmarks
Speed Comparison
| Scenario | Our Regex | Our ML | Traditional Regex | Pure NER |
|---|---|---|---|---|
| 1KB text | 0.002s | 0.10s | 0.001s | 0.50s |
| 100KB document | 0.02s | 0.50s | 0.01s | 2.5s |
| 1000 files | 2.5s | 120s | 1.5s | 500s |
| 10,000 files | 25s | 1200s | 15s | 5000s |
Accuracy Comparison
| Test Dataset | Adaptive Intelligence | Traditional Regex | Pure NER |
|---|---|---|---|
| Structured Forms | 99.5% F1 | 98.0% F1 | 82.0% F1 |
| Email Threads | 97.8% F1 | 72.5% F1 | 92.0% F1 |
| Medical Records | 98.5% F1 | 85.0% F1 | 78.0% F1 |
| Source Code | 99.0% F1 | 95.0% F1 | 60.0% F1 |
| Mixed Content | 98.7% F1 | 87.0% F1 | 82.0% F1 |
Compliance Framework Integration
Our approach uniquely maps PII detections to compliance frameworks:
GDPR Mapping
{ "applicable": true, "pii_found": ["NAME", "EMAIL", "PHONE", "ADDRESS", "DATE_OF_BIRTH"], "data_subject_rights": ["Access", "Rectification", "Erasure", "Portability"], "actions_required": ["Consent verification", "Data minimization review"] }
HIPAA Mapping
{ "applicable": true, "phi_identifiers": ["NAME", "DATE_OF_BIRTH", "MEDICAL_RECORD_NUMBER", "INSURANCE_ID"], "hipaa_category": "Protected Health Information (PHI)", "safeguards_required": ["Encryption", "Access controls", "Audit logging"] }
PCI-DSS Mapping
{ "applicable": true, "card_data": ["CREDIT_CARD"], "authentication_data": ["PASSWORD"], "requirements": ["Tokenization", "Encryption at rest", "Access logging"] }
Best Practices
When to Use Each Mode
Use Regex-Only Mode (USE_ML=false) When:
- Processing structured data (forms, CSVs, databases)
- Speed is critical (real-time processing)
- Resources are limited (no GPU)
- Scanning configuration files for secrets
Use ML-Enhanced Mode (USE_ML=true) When:
- Processing unstructured text (emails, documents)
- Name detection is important
- Higher recall is prioritized
- GPU is available
Recommended Configuration
# For production web app (balance speed/accuracy) from src.pii_detection.hybrid_pipeline import HybridPipeline pipeline = HybridPipeline({ "use_ml": False, # Fast, 99.2% precision "mask_output": True, # Security masking "chunk_overlap": 128 # Context preservation }) # For batch document processing (maximum accuracy) pipeline = HybridPipeline({ "use_ml": True, # ML-enhanced, +16.7% more detections "mask_output": True, "chunk_overlap": 256 # More context for ML })
The Bottom Line
| Use Case | Recommended Approach |
|---|---|
| Structured Data | Adaptive Intelligence (Standard) |
| Unstructured Text | Adaptive Intelligence (Context-Aware) |
| Real-time Scanning | Adaptive Intelligence (Standard) |
| Batch Processing | Adaptive Intelligence (Context-Aware) |
| Medical Records | Adaptive Intelligence (Medical Patterns) |
| International Data | Adaptive Intelligence (Global Library) |
| Maximum Recall | Pure NER (but expect more false positives) |
Our Recommendation: Use Adaptive Intelligence with standard mode for instant results, enabling context-aware mode for unstructured documents where name and address precision is critical.
Explore the Precogs AI Data Security Series
-
Secret Scanning Guide: Precogs Adaptive Intelligence vs. TruffleHog
-
Automotive PII Detection: Securing VIN, IMEI, and Telematics Data
Getting Started with Precogs Priority
Ready to eliminate PII leakage? Start scanning your repositories in seconds.
Explore the Precogs Platform: Learn how Precogs AI-native security helps detect sensitive data, vulnerabilities, and risks across your code and repositories.
Access the Precogs App: Sign in to the Precogs platform and start scanning your repositories.
Connect Your Repositories: Integrate your GitHub, GitLab, or Bitbucket repositories and let Precogs automatically analyze your code, history, and artifacts for security and data risks.
Flexible Deployment Options: Precogs supports cloud, private cloud, and on-premise deployments for organizations with strict security or data residency requirements. Contact our team to learn more.
