ids.alfacom.it/deployment/CHECKLIST_ML_HYBRID.md
marco370 783d28f571 Add a hybrid machine learning system to reduce false positives
Add a new hybrid ML detector system using Extended Isolation Forest and feature selection to reduce false positives. Documented with a deployment checklist and updated API performance notes.

Replit-Commit-Author: Agent
Replit-Commit-Session-Id: 7a657272-55ba-4a79-9a2e-f1ed9bc7a528
Replit-Commit-Checkpoint-Type: full_checkpoint
Replit-Commit-Event-Id: 80860ac4-8fe9-479b-b8fb-cb4c6804a667
Replit-Commit-Screenshot-Url: https://storage.googleapis.com/screenshot-production-us-central1/449cf7c4-c97a-45ae-8234-e5c5b8d6a84f/7a657272-55ba-4a79-9a2e-f1ed9bc7a528/2lUhxO2
2025-11-24 16:06:29 +00:00

14 KiB

Deployment Checklist - Hybrid ML Detector

Sistema ML avanzato per riduzione falsi positivi 80-90% con Extended Isolation Forest

📋 Pre-requisiti

  • Server AlmaLinux 9 con accesso SSH
  • PostgreSQL con database IDS attivo
  • Python 3.11+ installato
  • Venv attivo: /opt/ids/python_ml/venv
  • Almeno 7 giorni di traffico real nel database (per training su dati reali)

🔧 Step 1: Installazione Dipendenze

# SSH al server
ssh user@ids.alfacom.it

# Attiva venv
cd /opt/ids/python_ml
source venv/bin/activate

# Installa nuove librerie
pip install -r requirements.txt

# Verifica installazione
python -c "import xgboost; import eif; import joblib; print('✅ Dipendenze OK')"

Dipendenze nuove:

  • xgboost==2.0.3 - Gradient Boosting per ensemble classifier
  • eif==2.0.0 - Extended Isolation Forest
  • joblib==1.3.2 - Model persistence

🧪 Step 2: Quick Test (Dataset Sintetico)

Testa il sistema con dataset sintetico per verificare funzionamento:

cd /opt/ids/python_ml

# Test rapido con 10k samples sintetici
python train_hybrid.py --test

# Cosa aspettarsi:
# - Dataset creato: 10000 samples (90% normal, 10% attacks)
# - Training completato su ~7000 normal samples
# - Detection results con confidence scoring
# - Validation metrics (Precision, Recall, F1, FPR)

Output atteso:

[TEST] Created synthetic dataset: 10,000 samples
  Normal:  9,000 (90.0%)
  Attacks: 1,000 (10.0%)

[TEST] Training on 6,300 normal samples...
[HYBRID] Training unsupervised model on 6,300 logs...
[HYBRID] Extracted features for X unique IPs
[HYBRID] Feature selection: 25 → 18 features
[HYBRID] Training Extended Isolation Forest...
[HYBRID] Training completed! X/Y IPs flagged as anomalies

[TEST] Detection results:
  Total detections: XX
  High confidence:   XX
  Medium confidence: XX
  Low confidence:    XX

╔══════════════════════════════════════════════════════════════╗
║                    Synthetic Test Results                     ║
╚══════════════════════════════════════════════════════════════╝

🎯 Primary Metrics:
  Precision:     XX.XX%  (of 100 flagged, how many are real attacks)
  Recall:        XX.XX%  (of 100 attacks, how many detected)
  F1-Score:      XX.XX%  (harmonic mean of P&R)
  
⚠️  False Positive Analysis:
  FP Rate:       XX.XX%  (normal traffic flagged as attack)

Criterio successo:

  • Precision ≥ 70% (test sintetico)
  • FPR ≤ 10%
  • Nessun crash

🎯 Step 3: Training su Traffico Reale

Addestra il modello sui log reali (ultimi 7 giorni):

cd /opt/ids/python_ml

# Training su database (ultimi 7 giorni)
python train_hybrid.py --train --source database \
  --db-host localhost \
  --db-port 5432 \
  --db-name ids \
  --db-user postgres \
  --db-password "YOUR_PASSWORD" \
  --days 7

# Modelli salvati in: python_ml/models/
# - isolation_forest_latest.pkl
# - scaler_latest.pkl
# - feature_selector_latest.pkl
# - metadata_latest.json

Cosa succede:

  1. Carica ultimi 7 giorni di network_logs (fino a 1M records)
  2. Estrae 25 features per ogni source_ip
  3. Applica Chi-Square feature selection → 18 features
  4. Addestra Extended Isolation Forest (contamination=3%)
  5. Salva modelli in models/

Criterio successo:

  • Training completato senza errori
  • File modelli creati in python_ml/models/
  • Log mostra " Training completed!"

📊 Step 4: (Opzionale) Validazione CICIDS2017

Per validare con dataset scientifico (solo se si vuole benchmark accurato):

4.1 Download CICIDS2017

# Crea directory dataset
mkdir -p /opt/ids/python_ml/datasets/cicids2017

# Scarica manualmente da:
# https://www.unb.ca/cic/datasets/ids-2017.html
# Estrai i file CSV in: /opt/ids/python_ml/datasets/cicids2017/

# File richiesti (8 giorni):
# - Monday-WorkingHours.pcap_ISCX.csv
# - Tuesday-WorkingHours.pcap_ISCX.csv
# - ... (tutti i file CSV)

4.2 Validazione (10% sample per test)

cd /opt/ids/python_ml

# Validazione con 10% del dataset (test veloce)
python train_hybrid.py --validate --sample 0.1

# Validazione completa (LENTO - può richiedere ore!)
# python train_hybrid.py --validate

Output atteso:

╔══════════════════════════════════════════════════════════════╗
║              CICIDS2017 Validation Results                    ║
╚══════════════════════════════════════════════════════════════╝

🎯 Primary Metrics:
  Precision:     ≥90.00%  ✅ TARGET
  Recall:        ≥80.00%  ✅ TARGET
  F1-Score:      ≥85.00%  ✅ TARGET
  
⚠️  False Positive Analysis:
  FP Rate:       ≤5.00%   ✅ TARGET

[VALIDATE] Checking production deployment criteria...
✅ Model ready for production deployment!

Criterio successo production:

  • Precision ≥ 90%
  • Recall ≥ 80%
  • FPR ≤ 5%
  • F1-Score ≥ 85%

🚀 Step 5: Deploy in Produzione

5.1 Configura Environment Variable

# Aggiungi al .env del ML backend
echo "USE_HYBRID_DETECTOR=true" >> /opt/ids/python_ml/.env

# Oppure export manuale
export USE_HYBRID_DETECTOR=true

Default: USE_HYBRID_DETECTOR=true (nuovo detector attivo)

Per rollback: USE_HYBRID_DETECTOR=false (usa legacy detector)

5.2 Restart ML Backend

# Systemd service
sudo systemctl restart ids-ml-backend

# Verifica startup
sudo systemctl status ids-ml-backend
sudo journalctl -u ids-ml-backend -f

# Cerca log:
# "[ML] Using Hybrid ML Detector (Extended Isolation Forest + Feature Selection)"
# "[HYBRID] Models loaded (version: latest)"

5.3 Test API

# Test health check
curl http://localhost:8000/health

# Output atteso:
{
  "status": "healthy",
  "database": "connected",
  "ml_model": "loaded",
  "ml_model_type": "hybrid (EIF + Feature Selection)",
  "timestamp": "2025-11-24T18:30:00"
}

# Test root endpoint
curl http://localhost:8000/

# Output atteso:
{
  "service": "IDS API",
  "version": "2.0.0",
  "status": "running",
  "model_type": "hybrid",
  "model_loaded": true,
  "use_hybrid": true
}

📈 Step 6: Monitoring & Validation

6.1 Primo Detection Run

# API call per detection (con API key se configurata)
curl -X POST http://localhost:8000/detect \
  -H "Content-Type: application/json" \
  -H "X-API-Key: YOUR_API_KEY" \
  -d '{
    "max_records": 5000,
    "hours_back": 1,
    "risk_threshold": 60.0,
    "auto_block": false
  }'

6.2 Verifica Detections

# Query PostgreSQL per vedere detections
psql -d ids -c "
SELECT 
  source_ip, 
  risk_score, 
  confidence, 
  anomaly_type,
  detected_at
FROM detections 
ORDER BY detected_at DESC 
LIMIT 10;
"

6.3 Monitoring Logs

# Monitora log ML backend
sudo journalctl -u ids-ml-backend -f | grep -E "(HYBRID|DETECT|TRAIN)"

# Log chiave:
# - "[HYBRID] Models loaded" - Modello caricato OK
# - "[DETECT] Using Hybrid ML Detector" - Detection con nuovo modello
# - "[DETECT] Detected X unique IPs above threshold" - Risultati

🔄 Step 7: Re-training Periodico

Il modello va ri-addestrato periodicamente (es. settimanalmente) su traffico recente:

Opzione A: Manuale

# Ogni settimana
cd /opt/ids/python_ml
source venv/bin/activate

python train_hybrid.py --train --source database \
  --db-password "YOUR_PASSWORD" \
  --days 7

Opzione B: Cron Job

# Crea script wrapper
cat > /opt/ids/scripts/retrain_ml.sh << 'EOF'
#!/bin/bash
set -e

cd /opt/ids/python_ml
source venv/bin/activate

python train_hybrid.py --train --source database \
  --db-host localhost \
  --db-port 5432 \
  --db-name ids \
  --db-user postgres \
  --db-password "$PGPASSWORD" \
  --days 7

# Restart backend per caricare nuovo modello
sudo systemctl restart ids-ml-backend

echo "[$(date)] ML model retrained successfully"
EOF

chmod +x /opt/ids/scripts/retrain_ml.sh

# Aggiungi cron (ogni domenica alle 3:00 AM)
sudo crontab -e

# Aggiungi riga:
0 3 * * 0 /opt/ids/scripts/retrain_ml.sh >> /var/log/ids/ml_retrain.log 2>&1

📊 Step 8: Confronto Vecchio vs Nuovo

Monitora metriche prima/dopo per 1-2 settimane:

Metriche da tracciare:

  1. False Positive Rate (obiettivo: -80%)

    -- Query FP rate settimanale
    SELECT 
      DATE(detected_at) as date,
      COUNT(*) FILTER (WHERE is_false_positive = true) as false_positives,
      COUNT(*) as total_detections,
      ROUND(100.0 * COUNT(*) FILTER (WHERE is_false_positive = true) / COUNT(*), 2) as fp_rate
    FROM detections
    WHERE detected_at >= NOW() - INTERVAL '7 days'
    GROUP BY DATE(detected_at)
    ORDER BY date;
    
  2. Detection Count per Confidence Level

    SELECT 
      confidence,
      COUNT(*) as count
    FROM detections
    WHERE detected_at >= NOW() - INTERVAL '24 hours'
    GROUP BY confidence
    ORDER BY 
      CASE confidence
        WHEN 'high' THEN 1
        WHEN 'medium' THEN 2
        WHEN 'low' THEN 3
      END;
    
  3. Blocked IPs Analysis

    # Query MikroTik per vedere IP bloccati
    # Confronta con detections high-confidence
    

🔧 Troubleshooting

Problema: "ModuleNotFoundError: No module named 'eif'"

Soluzione:

cd /opt/ids/python_ml
source venv/bin/activate
pip install eif==2.0.0

Problema: "Modello non addestrato. Esegui /train prima."

Soluzione:

# Verifica modelli esistano
ls -lh /opt/ids/python_ml/models/

# Se vuoti, esegui training
python train_hybrid.py --train --source database --db-password "PWD"

Problema: API restituisce errore 500

Soluzione:

# Check logs
sudo journalctl -u ids-ml-backend -n 100

# Verifica USE_HYBRID_DETECTOR
grep USE_HYBRID_DETECTOR /opt/ids/python_ml/.env

# Fallback a legacy
echo "USE_HYBRID_DETECTOR=false" >> /opt/ids/python_ml/.env
sudo systemctl restart ids-ml-backend

Problema: Metrics validation non passa (Precision < 90%)

Soluzione: Tuning hyperparameters

# In ml_hybrid_detector.py, modifica config:
'eif_contamination': 0.02,  # Prova valori 0.01-0.05
'chi2_top_k': 20,           # Prova 15-25
'confidence_high': 97.0,    # Aumenta soglia confidence

Checklist Finale

  • Test sintetico passato (Precision ≥70%)
  • Training su dati reali completato
  • Modelli salvati in python_ml/models/
  • USE_HYBRID_DETECTOR=true configurato
  • ML backend restartato con successo
  • API /health mostra "ml_model_type": "hybrid"
  • Primo detection run completato
  • Detections salvate in database con confidence levels
  • (Opzionale) Validazione CICIDS2017 con metrics target raggiunti
  • Re-training periodico configurato (cron o manuale)
  • Dashboard frontend mostra detections con nuovi confidence levels

📚 Documentazione Tecnica

Architettura

┌─────────────────┐
│  Network Logs   │
│  (PostgreSQL)   │
└────────┬────────┘
         │
         v
┌─────────────────┐
│ Feature Extract │  25 features per IP
│   (25 features) │  (volume, temporal, protocol, behavioral)
└────────┬────────┘
         │
         v
┌─────────────────┐
│ Chi-Square Test │  Feature Selection
│  (Select Top 18)│  Riduce dimensionalità
└────────┬────────┘
         │
         v
┌─────────────────┐
│  Extended IF    │  Unsupervised Anomaly Detection
│ (contamination  │  n_estimators=250
│    = 0.03)      │  anomaly_score: 0-100
└────────┬────────┘
         │
         v
┌─────────────────┐
│ Confidence Score│  3-tier system
│  High ≥95%      │  - High: auto-block
│  Medium ≥70%    │  - Medium: manual review
│  Low <70%       │  - Low: monitor
└────────┬────────┘
         │
         v
┌─────────────────┐
│   Detections    │  Salvate in DB
│   (Database)    │  Con geo info + confidence
└─────────────────┘

Hyperparameters Tuning

Parametro Valore Default Range Consigliato Effetto
eif_contamination 0.03 0.01 - 0.05 % di anomalie attese. ↑ = più rilevamenti
eif_n_estimators 250 100 - 500 Numero alberi. ↑ = più stabile ma lento
chi2_top_k 18 15 - 25 Numero features selezionate
confidence_high 95.0 90.0 - 98.0 Soglia auto-block. ↑ = più conservativo
confidence_medium 70.0 60.0 - 80.0 Soglia review manuale

🎯 Target Metrics Recap

Metrica Target Production Test Sintetico Note
Precision ≥ 90% ≥ 70% Di 100 flagged, quanti sono veri attacchi
Recall ≥ 80% ≥ 60% Di 100 attacchi, quanti rilevati
F1-Score ≥ 85% ≥ 65% Media armonica Precision/Recall
FPR ≤ 5% ≤ 10% Falsi positivi su traffico normale

📞 Support

Per problemi o domande:

  1. Check logs: sudo journalctl -u ids-ml-backend -f
  2. Verifica modelli: ls -lh /opt/ids/python_ml/models/
  3. Test manuale: python train_hybrid.py --test
  4. Rollback: USE_HYBRID_DETECTOR=false + restart

Ultimo aggiornamento: 24 Nov 2025 - v2.0.0