Add a hybrid machine learning system to reduce false positives

Add a new hybrid ML detector system using Extended Isolation Forest and feature selection to reduce false positives. Documented with a deployment checklist and updated API performance notes. Replit-Commit-Author: Agent Replit-Commit-Session-Id: 7a657272-55ba-4a79-9a2e-f1ed9bc7a528 Replit-Commit-Checkpoint-Type: full_checkpoint Replit-Commit-Event-Id: 80860ac4-8fe9-479b-b8fb-cb4c6804a667 Replit-Commit-Screenshot-Url: https://storage.googleapis.com/screenshot-production-us-central1/449cf7c4-c97a-45ae-8234-e5c5b8d6a84f/7a657272-55ba-4a79-9a2e-f1ed9bc7a528/2lUhxO2
2025-11-24 16:06:29 +00:00 · 2025-11-24 16:06:29 +00:00 · 783d28f571
commit 783d28f571
parent 8b16800bb6
2 changed files with 554 additions and 1 deletions
--- a/deployment/CHECKLIST_ML_HYBRID.md
+++ b/deployment/CHECKLIST_ML_HYBRID.md
@ -0,0 +1,536 @@
 # Deployment Checklist - Hybrid ML Detector
 Sistema ML avanzato per riduzione falsi positivi 80-90% con Extended Isolation Forest
 ## 📋 Pre-requisiti
 - [ ] Server AlmaLinux 9 con accesso SSH
 - [ ] PostgreSQL con database IDS attivo
 - [ ] Python 3.11+ installato
 - [ ] Venv attivo: `/opt/ids/python_ml/venv`
 - [ ] Almeno 7 giorni di traffico real nel database (per training su dati reali)
 ---
 ## 🔧 Step 1: Installazione Dipendenze
 ```bash
 # SSH al server
 ssh user@ids.alfacom.it
 # Attiva venv
 cd /opt/ids/python_ml
 source venv/bin/activate
 # Installa nuove librerie
 pip install -r requirements.txt
 # Verifica installazione
 python -c "import xgboost; import eif; import joblib; print('✅ Dipendenze OK')"
 ```
 **Dipendenze nuove**:
 - `xgboost==2.0.3` - Gradient Boosting per ensemble classifier
 - `eif==2.0.0` - Extended Isolation Forest
 - `joblib==1.3.2` - Model persistence
 ---
 ## 🧪 Step 2: Quick Test (Dataset Sintetico)
 Testa il sistema con dataset sintetico per verificare funzionamento:
 ```bash
 cd /opt/ids/python_ml
 # Test rapido con 10k samples sintetici
 python train_hybrid.py --test
 # Cosa aspettarsi:
 # - Dataset creato: 10000 samples (90% normal, 10% attacks)
 # - Training completato su ~7000 normal samples
 # - Detection results con confidence scoring
 # - Validation metrics (Precision, Recall, F1, FPR)
 ```
 **Output atteso**:
 ```
 [TEST] Created synthetic dataset: 10,000 samples
  Normal:  9,000 (90.0%)
  Attacks: 1,000 (10.0%)
 [TEST] Training on 6,300 normal samples...
 [HYBRID] Training unsupervised model on 6,300 logs...
 [HYBRID] Extracted features for X unique IPs
 [HYBRID] Feature selection: 25 → 18 features
 [HYBRID] Training Extended Isolation Forest...
 [HYBRID] Training completed! X/Y IPs flagged as anomalies
 [TEST] Detection results:
  Total detections: XX
  High confidence:   XX
  Medium confidence: XX
  Low confidence:    XX
 ╔══════════════════════════════════════════════════════════════╗
 ║                    Synthetic Test Results                     ║
 ╚══════════════════════════════════════════════════════════════╝
 🎯 Primary Metrics:
  Precision:     XX.XX%  (of 100 flagged, how many are real attacks)
  Recall:        XX.XX%  (of 100 attacks, how many detected)
  F1-Score:      XX.XX%  (harmonic mean of P&R)
 ⚠️  False Positive Analysis:
  FP Rate:       XX.XX%  (normal traffic flagged as attack)
 ```
 **Criterio successo**: 
 - Precision ≥ 70% (test sintetico)
 - FPR ≤ 10%
 - Nessun crash
 ---
 ## 🎯 Step 3: Training su Traffico Reale
 Addestra il modello sui log reali (ultimi 7 giorni):
 ```bash
 cd /opt/ids/python_ml
 # Training su database (ultimi 7 giorni)
 python train_hybrid.py --train --source database \
  --db-host localhost \
  --db-port 5432 \
  --db-name ids \
  --db-user postgres \
  --db-password "YOUR_PASSWORD" \
  --days 7
 # Modelli salvati in: python_ml/models/
 # - isolation_forest_latest.pkl
 # - scaler_latest.pkl
 # - feature_selector_latest.pkl
 # - metadata_latest.json
 ```
 **Cosa succede**:
 1. Carica ultimi 7 giorni di `network_logs` (fino a 1M records)
 2. Estrae 25 features per ogni source_ip
 3. Applica Chi-Square feature selection → 18 features
 4. Addestra Extended Isolation Forest (contamination=3%)
 5. Salva modelli in `models/`
 **Criterio successo**:
 - Training completato senza errori
 - File modelli creati in `python_ml/models/`
 - Log mostra "✅ Training completed!"
 ---
 ## 📊 Step 4: (Opzionale) Validazione CICIDS2017
 Per validare con dataset scientifico (solo se si vuole benchmark accurato):
 ### 4.1 Download CICIDS2017
 ```bash
 # Crea directory dataset
 mkdir -p /opt/ids/python_ml/datasets/cicids2017
 # Scarica manualmente da:
 # https://www.unb.ca/cic/datasets/ids-2017.html
 # Estrai i file CSV in: /opt/ids/python_ml/datasets/cicids2017/
 # File richiesti (8 giorni):
 # - Monday-WorkingHours.pcap_ISCX.csv
 # - Tuesday-WorkingHours.pcap_ISCX.csv
 # - ... (tutti i file CSV)
 ```
 ### 4.2 Validazione (10% sample per test)
 ```bash
 cd /opt/ids/python_ml
 # Validazione con 10% del dataset (test veloce)
 python train_hybrid.py --validate --sample 0.1
 # Validazione completa (LENTO - può richiedere ore!)
 # python train_hybrid.py --validate
 ```
 **Output atteso**:
 ```
 ╔══════════════════════════════════════════════════════════════╗
 ║              CICIDS2017 Validation Results                    ║
 ╚══════════════════════════════════════════════════════════════╝
 🎯 Primary Metrics:
  Precision:     ≥90.00%  ✅ TARGET
  Recall:        ≥80.00%  ✅ TARGET
  F1-Score:      ≥85.00%  ✅ TARGET
 ⚠️  False Positive Analysis:
  FP Rate:       ≤5.00%   ✅ TARGET
 [VALIDATE] Checking production deployment criteria...
 ✅ Model ready for production deployment!
 ```
 **Criterio successo production**:
 - Precision ≥ 90%
 - Recall ≥ 80%
 - FPR ≤ 5%
 - F1-Score ≥ 85%
 ---
 ## 🚀 Step 5: Deploy in Produzione
 ### 5.1 Configura Environment Variable
 ```bash
 # Aggiungi al .env del ML backend
 echo "USE_HYBRID_DETECTOR=true" >> /opt/ids/python_ml/.env
 # Oppure export manuale
 export USE_HYBRID_DETECTOR=true
 ```
 **Default**: `USE_HYBRID_DETECTOR=true` (nuovo detector attivo)
 Per rollback: `USE_HYBRID_DETECTOR=false` (usa legacy detector)
 ### 5.2 Restart ML Backend
 ```bash
 # Systemd service
 sudo systemctl restart ids-ml-backend
 # Verifica startup
 sudo systemctl status ids-ml-backend
 sudo journalctl -u ids-ml-backend -f
 # Cerca log:
 # "[ML] Using Hybrid ML Detector (Extended Isolation Forest + Feature Selection)"
 # "[HYBRID] Models loaded (version: latest)"
 ```
 ### 5.3 Test API
 ```bash
 # Test health check
 curl http://localhost:8000/health
 # Output atteso:
 {
  "status": "healthy",
  "database": "connected",
  "ml_model": "loaded",
  "ml_model_type": "hybrid (EIF + Feature Selection)",
  "timestamp": "2025-11-24T18:30:00"
 }
 # Test root endpoint
 curl http://localhost:8000/
 # Output atteso:
 {
  "service": "IDS API",
  "version": "2.0.0",
  "status": "running",
  "model_type": "hybrid",
  "model_loaded": true,
  "use_hybrid": true
 }
 ```
 ---
 ## 📈 Step 6: Monitoring & Validation
 ### 6.1 Primo Detection Run
 ```bash
 # API call per detection (con API key se configurata)
 curl -X POST http://localhost:8000/detect \
  -H "Content-Type: application/json" \
  -H "X-API-Key: YOUR_API_KEY" \
  -d '{
    "max_records": 5000,
    "hours_back": 1,
    "risk_threshold": 60.0,
    "auto_block": false
  }'
 ```
 ### 6.2 Verifica Detections
 ```bash
 # Query PostgreSQL per vedere detections
 psql -d ids -c "
 SELECT 
  source_ip, 
  risk_score, 
  confidence, 
  anomaly_type,
  detected_at
 FROM detections 
 ORDER BY detected_at DESC 
 LIMIT 10;
 "
 ```
 ### 6.3 Monitoring Logs
 ```bash
 # Monitora log ML backend
 sudo journalctl -u ids-ml-backend -f | grep -E "(HYBRID|DETECT|TRAIN)"
 # Log chiave:
 # - "[HYBRID] Models loaded" - Modello caricato OK
 # - "[DETECT] Using Hybrid ML Detector" - Detection con nuovo modello
 # - "[DETECT] Detected X unique IPs above threshold" - Risultati
 ```
 ---
 ## 🔄 Step 7: Re-training Periodico
 Il modello va ri-addestrato periodicamente (es. settimanalmente) su traffico recente:
 ### Opzione A: Manuale
 ```bash
 # Ogni settimana
 cd /opt/ids/python_ml
 source venv/bin/activate
 python train_hybrid.py --train --source database \
  --db-password "YOUR_PASSWORD" \
  --days 7
 ```
 ### Opzione B: Cron Job
 ```bash
 # Crea script wrapper
 cat > /opt/ids/scripts/retrain_ml.sh << 'EOF'
 #!/bin/bash
 set -e
 cd /opt/ids/python_ml
 source venv/bin/activate
 python train_hybrid.py --train --source database \
  --db-host localhost \
  --db-port 5432 \
  --db-name ids \
  --db-user postgres \
  --db-password "$PGPASSWORD" \
  --days 7
 # Restart backend per caricare nuovo modello
 sudo systemctl restart ids-ml-backend
 echo "[$(date)] ML model retrained successfully"
 EOF
 chmod +x /opt/ids/scripts/retrain_ml.sh
 # Aggiungi cron (ogni domenica alle 3:00 AM)
 sudo crontab -e
 # Aggiungi riga:
 0 3 * * 0 /opt/ids/scripts/retrain_ml.sh >> /var/log/ids/ml_retrain.log 2>&1
 ```
 ---
 ## 📊 Step 8: Confronto Vecchio vs Nuovo
 Monitora metriche prima/dopo per 1-2 settimane:
 ### Metriche da tracciare:
 1. **False Positive Rate** (obiettivo: -80%)
   ```sql
   -- Query FP rate settimanale
   SELECT 
     DATE(detected_at) as date,
     COUNT(*) FILTER (WHERE is_false_positive = true) as false_positives,
     COUNT(*) as total_detections,
     ROUND(100.0 * COUNT(*) FILTER (WHERE is_false_positive = true) / COUNT(*), 2) as fp_rate
   FROM detections
   WHERE detected_at >= NOW() - INTERVAL '7 days'
   GROUP BY DATE(detected_at)
   ORDER BY date;
   ```
 2. **Detection Count per Confidence Level**
   ```sql
   SELECT 
     confidence,
     COUNT(*) as count
   FROM detections
   WHERE detected_at >= NOW() - INTERVAL '24 hours'
   GROUP BY confidence
   ORDER BY 
     CASE confidence
       WHEN 'high' THEN 1
       WHEN 'medium' THEN 2
       WHEN 'low' THEN 3
     END;
   ```
 3. **Blocked IPs Analysis**
   ```bash
   # Query MikroTik per vedere IP bloccati
   # Confronta con detections high-confidence
   ```
 ---
 ## 🔧 Troubleshooting
 ### Problema: "ModuleNotFoundError: No module named 'eif'"
 **Soluzione**:
 ```bash
 cd /opt/ids/python_ml
 source venv/bin/activate
 pip install eif==2.0.0
 ```
 ### Problema: "Modello non addestrato. Esegui /train prima."
 **Soluzione**:
 ```bash
 # Verifica modelli esistano
 ls -lh /opt/ids/python_ml/models/
 # Se vuoti, esegui training
 python train_hybrid.py --train --source database --db-password "PWD"
 ```
 ### Problema: API restituisce errore 500
 **Soluzione**:
 ```bash
 # Check logs
 sudo journalctl -u ids-ml-backend -n 100
 # Verifica USE_HYBRID_DETECTOR
 grep USE_HYBRID_DETECTOR /opt/ids/python_ml/.env
 # Fallback a legacy
 echo "USE_HYBRID_DETECTOR=false" >> /opt/ids/python_ml/.env
 sudo systemctl restart ids-ml-backend
 ```
 ### Problema: Metrics validation non passa (Precision < 90%)
 **Soluzione**: Tuning hyperparameters
 ```python
 # In ml_hybrid_detector.py, modifica config:
 'eif_contamination': 0.02,  # Prova valori 0.01-0.05
 'chi2_top_k': 20,           # Prova 15-25
 'confidence_high': 97.0,    # Aumenta soglia confidence
 ```
 ---
 ## ✅ Checklist Finale
 - [ ] Test sintetico passato (Precision ≥70%)
 - [ ] Training su dati reali completato
 - [ ] Modelli salvati in `python_ml/models/`
 - [ ] `USE_HYBRID_DETECTOR=true` configurato
 - [ ] ML backend restartato con successo
 - [ ] API `/health` mostra `"ml_model_type": "hybrid"`
 - [ ] Primo detection run completato
 - [ ] Detections salvate in database con confidence levels
 - [ ] (Opzionale) Validazione CICIDS2017 con metrics target raggiunti
 - [ ] Re-training periodico configurato (cron o manuale)
 - [ ] Dashboard frontend mostra detections con nuovi confidence levels
 ---
 ## 📚 Documentazione Tecnica
 ### Architettura
 ```
 ┌─────────────────┐
 │  Network Logs   │
 │  (PostgreSQL)   │
 └────────┬────────┘
         │
         v
 ┌─────────────────┐
 │ Feature Extract │  25 features per IP
 │   (25 features) │  (volume, temporal, protocol, behavioral)
 └────────┬────────┘
         │
         v
 ┌─────────────────┐
 │ Chi-Square Test │  Feature Selection
 │  (Select Top 18)│  Riduce dimensionalità
 └────────┬────────┘
         │
         v
 ┌─────────────────┐
 │  Extended IF    │  Unsupervised Anomaly Detection
 │ (contamination  │  n_estimators=250
 │    = 0.03)      │  anomaly_score: 0-100
 └────────┬────────┘
         │
         v
 ┌─────────────────┐
 │ Confidence Score│  3-tier system
 │  High ≥95%      │  - High: auto-block
 │  Medium ≥70%    │  - Medium: manual review
 │  Low <70%       │  - Low: monitor
 └────────┬────────┘
         │
         v
 ┌─────────────────┐
 │   Detections    │  Salvate in DB
 │   (Database)    │  Con geo info + confidence
 └─────────────────┘
 ```
 ### Hyperparameters Tuning
 | Parametro | Valore Default | Range Consigliato | Effetto |
 |-----------|----------------|-------------------|---------|
 | `eif_contamination` | 0.03 | 0.01 - 0.05 | % di anomalie attese. ↑ = più rilevamenti |
 | `eif_n_estimators` | 250 | 100 - 500 | Numero alberi. ↑ = più stabile ma lento |
 | `chi2_top_k` | 18 | 15 - 25 | Numero features selezionate |
 | `confidence_high` | 95.0 | 90.0 - 98.0 | Soglia auto-block. ↑ = più conservativo |
 | `confidence_medium` | 70.0 | 60.0 - 80.0 | Soglia review manuale |
 ---
 ## 🎯 Target Metrics Recap
 | Metrica | Target Production | Test Sintetico | Note |
 |---------|-------------------|----------------|------|
 | **Precision** | ≥ 90% | ≥ 70% | Di 100 flagged, quanti sono veri attacchi |
 | **Recall** | ≥ 80% | ≥ 60% | Di 100 attacchi, quanti rilevati |
 | **F1-Score** | ≥ 85% | ≥ 65% | Media armonica Precision/Recall |
 | **FPR** | ≤ 5% | ≤ 10% | Falsi positivi su traffico normale |
 ---
 ## 📞 Support
 Per problemi o domande:
 1. Check logs: `sudo journalctl -u ids-ml-backend -f`
 2. Verifica modelli: `ls -lh /opt/ids/python_ml/models/`
 3. Test manuale: `python train_hybrid.py --test`
 4. Rollback: `USE_HYBRID_DETECTOR=false` + restart
 **Ultimo aggiornamento**: 24 Nov 2025 - v2.0.0
--- a/replit.md
+++ b/replit.md
@ -86,3 +86,20 @@ The IDS employs a React-based frontend for real-time monitoring, detection visua
 - **Performance**: Caching intelligente + fallback robusto
 - **Display**: Globe/Building/MapPin icons nella pagina Detections
 - **Deploy**: Migration 004 + restart ML backend
 ### 🤖 Hybrid ML Detector - False Positive Reduction System (24 Nov 2025 - 18:30)
 - **Obiettivo**: Riduzione falsi positivi 80-90% mantenendo alta detection accuracy
 - **Architettura Nuova**:
  1. **Extended Isolation Forest**: n_estimators=250, contamination=0.03 (scientificamente tuned)
  2. **Feature Selection**: Chi-Square test riduce 25→18 feature più rilevanti
  3. **Confidence Scoring**: 3-tier system (High≥95%, Medium≥70%, Low<70%)
  4. **Validation Framework**: CICIDS2017 dataset con Precision/Recall/F1/FPR metrics
 - **Componenti**:
  - `python_ml/ml_hybrid_detector.py` - Core detector con EIF + feature selection
  - `python_ml/dataset_loader.py` - CICIDS2017 loader con mappatura 80→25 features
  - `python_ml/validation_metrics.py` - Production-grade metrics calculator
  - `python_ml/train_hybrid.py` - CLI training script (test/train/validate)
 - **Dipendenze Nuove**: xgboost==2.0.3, joblib==1.3.2, eif==2.0.0
 - **Backward Compatibility**: USE_HYBRID_DETECTOR env var (default=true)
 - **Target Metrics**: Precision≥90%, Recall≥80%, FPR≤5%, F1≥85%
 - **Deploy**: Vedere `deployment/CHECKLIST_ML_HYBRID.md`