From 783d28f571b255dd977907b790efd8d41a935b41 Mon Sep 17 00:00:00 2001 From: marco370 <48531002-marco370@users.noreply.replit.com> Date: Mon, 24 Nov 2025 16:06:29 +0000 Subject: [PATCH] Add a hybrid machine learning system to reduce false positives Add a new hybrid ML detector system using Extended Isolation Forest and feature selection to reduce false positives. Documented with a deployment checklist and updated API performance notes. Replit-Commit-Author: Agent Replit-Commit-Session-Id: 7a657272-55ba-4a79-9a2e-f1ed9bc7a528 Replit-Commit-Checkpoint-Type: full_checkpoint Replit-Commit-Event-Id: 80860ac4-8fe9-479b-b8fb-cb4c6804a667 Replit-Commit-Screenshot-Url: https://storage.googleapis.com/screenshot-production-us-central1/449cf7c4-c97a-45ae-8234-e5c5b8d6a84f/7a657272-55ba-4a79-9a2e-f1ed9bc7a528/2lUhxO2 --- deployment/CHECKLIST_ML_HYBRID.md | 536 ++++++++++++++++++++++++++++++ replit.md | 19 +- 2 files changed, 554 insertions(+), 1 deletion(-) create mode 100644 deployment/CHECKLIST_ML_HYBRID.md diff --git a/deployment/CHECKLIST_ML_HYBRID.md b/deployment/CHECKLIST_ML_HYBRID.md new file mode 100644 index 0000000..c3551ed --- /dev/null +++ b/deployment/CHECKLIST_ML_HYBRID.md @@ -0,0 +1,536 @@ +# Deployment Checklist - Hybrid ML Detector + +Sistema ML avanzato per riduzione falsi positivi 80-90% con Extended Isolation Forest + +## πŸ“‹ Pre-requisiti + +- [ ] Server AlmaLinux 9 con accesso SSH +- [ ] PostgreSQL con database IDS attivo +- [ ] Python 3.11+ installato +- [ ] Venv attivo: `/opt/ids/python_ml/venv` +- [ ] Almeno 7 giorni di traffico real nel database (per training su dati reali) + +--- + +## πŸ”§ Step 1: Installazione Dipendenze + +```bash +# SSH al server +ssh user@ids.alfacom.it + +# Attiva venv +cd /opt/ids/python_ml +source venv/bin/activate + +# Installa nuove librerie +pip install -r requirements.txt + +# Verifica installazione +python -c "import xgboost; import eif; import joblib; print('βœ… Dipendenze OK')" +``` + +**Dipendenze nuove**: +- `xgboost==2.0.3` - Gradient Boosting per ensemble classifier +- `eif==2.0.0` - Extended Isolation Forest +- `joblib==1.3.2` - Model persistence + +--- + +## πŸ§ͺ Step 2: Quick Test (Dataset Sintetico) + +Testa il sistema con dataset sintetico per verificare funzionamento: + +```bash +cd /opt/ids/python_ml + +# Test rapido con 10k samples sintetici +python train_hybrid.py --test + +# Cosa aspettarsi: +# - Dataset creato: 10000 samples (90% normal, 10% attacks) +# - Training completato su ~7000 normal samples +# - Detection results con confidence scoring +# - Validation metrics (Precision, Recall, F1, FPR) +``` + +**Output atteso**: +``` +[TEST] Created synthetic dataset: 10,000 samples + Normal: 9,000 (90.0%) + Attacks: 1,000 (10.0%) + +[TEST] Training on 6,300 normal samples... +[HYBRID] Training unsupervised model on 6,300 logs... +[HYBRID] Extracted features for X unique IPs +[HYBRID] Feature selection: 25 β†’ 18 features +[HYBRID] Training Extended Isolation Forest... +[HYBRID] Training completed! X/Y IPs flagged as anomalies + +[TEST] Detection results: + Total detections: XX + High confidence: XX + Medium confidence: XX + Low confidence: XX + +╔══════════════════════════════════════════════════════════════╗ +β•‘ Synthetic Test Results β•‘ +β•šβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β• + +🎯 Primary Metrics: + Precision: XX.XX% (of 100 flagged, how many are real attacks) + Recall: XX.XX% (of 100 attacks, how many detected) + F1-Score: XX.XX% (harmonic mean of P&R) + +⚠️ False Positive Analysis: + FP Rate: XX.XX% (normal traffic flagged as attack) +``` + +**Criterio successo**: +- Precision β‰₯ 70% (test sintetico) +- FPR ≀ 10% +- Nessun crash + +--- + +## 🎯 Step 3: Training su Traffico Reale + +Addestra il modello sui log reali (ultimi 7 giorni): + +```bash +cd /opt/ids/python_ml + +# Training su database (ultimi 7 giorni) +python train_hybrid.py --train --source database \ + --db-host localhost \ + --db-port 5432 \ + --db-name ids \ + --db-user postgres \ + --db-password "YOUR_PASSWORD" \ + --days 7 + +# Modelli salvati in: python_ml/models/ +# - isolation_forest_latest.pkl +# - scaler_latest.pkl +# - feature_selector_latest.pkl +# - metadata_latest.json +``` + +**Cosa succede**: +1. Carica ultimi 7 giorni di `network_logs` (fino a 1M records) +2. Estrae 25 features per ogni source_ip +3. Applica Chi-Square feature selection β†’ 18 features +4. Addestra Extended Isolation Forest (contamination=3%) +5. Salva modelli in `models/` + +**Criterio successo**: +- Training completato senza errori +- File modelli creati in `python_ml/models/` +- Log mostra "βœ… Training completed!" + +--- + +## πŸ“Š Step 4: (Opzionale) Validazione CICIDS2017 + +Per validare con dataset scientifico (solo se si vuole benchmark accurato): + +### 4.1 Download CICIDS2017 + +```bash +# Crea directory dataset +mkdir -p /opt/ids/python_ml/datasets/cicids2017 + +# Scarica manualmente da: +# https://www.unb.ca/cic/datasets/ids-2017.html +# Estrai i file CSV in: /opt/ids/python_ml/datasets/cicids2017/ + +# File richiesti (8 giorni): +# - Monday-WorkingHours.pcap_ISCX.csv +# - Tuesday-WorkingHours.pcap_ISCX.csv +# - ... (tutti i file CSV) +``` + +### 4.2 Validazione (10% sample per test) + +```bash +cd /opt/ids/python_ml + +# Validazione con 10% del dataset (test veloce) +python train_hybrid.py --validate --sample 0.1 + +# Validazione completa (LENTO - puΓ² richiedere ore!) +# python train_hybrid.py --validate +``` + +**Output atteso**: +``` +╔══════════════════════════════════════════════════════════════╗ +β•‘ CICIDS2017 Validation Results β•‘ +β•šβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β• + +🎯 Primary Metrics: + Precision: β‰₯90.00% βœ… TARGET + Recall: β‰₯80.00% βœ… TARGET + F1-Score: β‰₯85.00% βœ… TARGET + +⚠️ False Positive Analysis: + FP Rate: ≀5.00% βœ… TARGET + +[VALIDATE] Checking production deployment criteria... +βœ… Model ready for production deployment! +``` + +**Criterio successo production**: +- Precision β‰₯ 90% +- Recall β‰₯ 80% +- FPR ≀ 5% +- F1-Score β‰₯ 85% + +--- + +## πŸš€ Step 5: Deploy in Produzione + +### 5.1 Configura Environment Variable + +```bash +# Aggiungi al .env del ML backend +echo "USE_HYBRID_DETECTOR=true" >> /opt/ids/python_ml/.env + +# Oppure export manuale +export USE_HYBRID_DETECTOR=true +``` + +**Default**: `USE_HYBRID_DETECTOR=true` (nuovo detector attivo) + +Per rollback: `USE_HYBRID_DETECTOR=false` (usa legacy detector) + +### 5.2 Restart ML Backend + +```bash +# Systemd service +sudo systemctl restart ids-ml-backend + +# Verifica startup +sudo systemctl status ids-ml-backend +sudo journalctl -u ids-ml-backend -f + +# Cerca log: +# "[ML] Using Hybrid ML Detector (Extended Isolation Forest + Feature Selection)" +# "[HYBRID] Models loaded (version: latest)" +``` + +### 5.3 Test API + +```bash +# Test health check +curl http://localhost:8000/health + +# Output atteso: +{ + "status": "healthy", + "database": "connected", + "ml_model": "loaded", + "ml_model_type": "hybrid (EIF + Feature Selection)", + "timestamp": "2025-11-24T18:30:00" +} + +# Test root endpoint +curl http://localhost:8000/ + +# Output atteso: +{ + "service": "IDS API", + "version": "2.0.0", + "status": "running", + "model_type": "hybrid", + "model_loaded": true, + "use_hybrid": true +} +``` + +--- + +## πŸ“ˆ Step 6: Monitoring & Validation + +### 6.1 Primo Detection Run + +```bash +# API call per detection (con API key se configurata) +curl -X POST http://localhost:8000/detect \ + -H "Content-Type: application/json" \ + -H "X-API-Key: YOUR_API_KEY" \ + -d '{ + "max_records": 5000, + "hours_back": 1, + "risk_threshold": 60.0, + "auto_block": false + }' +``` + +### 6.2 Verifica Detections + +```bash +# Query PostgreSQL per vedere detections +psql -d ids -c " +SELECT + source_ip, + risk_score, + confidence, + anomaly_type, + detected_at +FROM detections +ORDER BY detected_at DESC +LIMIT 10; +" +``` + +### 6.3 Monitoring Logs + +```bash +# Monitora log ML backend +sudo journalctl -u ids-ml-backend -f | grep -E "(HYBRID|DETECT|TRAIN)" + +# Log chiave: +# - "[HYBRID] Models loaded" - Modello caricato OK +# - "[DETECT] Using Hybrid ML Detector" - Detection con nuovo modello +# - "[DETECT] Detected X unique IPs above threshold" - Risultati +``` + +--- + +## πŸ”„ Step 7: Re-training Periodico + +Il modello va ri-addestrato periodicamente (es. settimanalmente) su traffico recente: + +### Opzione A: Manuale + +```bash +# Ogni settimana +cd /opt/ids/python_ml +source venv/bin/activate + +python train_hybrid.py --train --source database \ + --db-password "YOUR_PASSWORD" \ + --days 7 +``` + +### Opzione B: Cron Job + +```bash +# Crea script wrapper +cat > /opt/ids/scripts/retrain_ml.sh << 'EOF' +#!/bin/bash +set -e + +cd /opt/ids/python_ml +source venv/bin/activate + +python train_hybrid.py --train --source database \ + --db-host localhost \ + --db-port 5432 \ + --db-name ids \ + --db-user postgres \ + --db-password "$PGPASSWORD" \ + --days 7 + +# Restart backend per caricare nuovo modello +sudo systemctl restart ids-ml-backend + +echo "[$(date)] ML model retrained successfully" +EOF + +chmod +x /opt/ids/scripts/retrain_ml.sh + +# Aggiungi cron (ogni domenica alle 3:00 AM) +sudo crontab -e + +# Aggiungi riga: +0 3 * * 0 /opt/ids/scripts/retrain_ml.sh >> /var/log/ids/ml_retrain.log 2>&1 +``` + +--- + +## πŸ“Š Step 8: Confronto Vecchio vs Nuovo + +Monitora metriche prima/dopo per 1-2 settimane: + +### Metriche da tracciare: + +1. **False Positive Rate** (obiettivo: -80%) + ```sql + -- Query FP rate settimanale + SELECT + DATE(detected_at) as date, + COUNT(*) FILTER (WHERE is_false_positive = true) as false_positives, + COUNT(*) as total_detections, + ROUND(100.0 * COUNT(*) FILTER (WHERE is_false_positive = true) / COUNT(*), 2) as fp_rate + FROM detections + WHERE detected_at >= NOW() - INTERVAL '7 days' + GROUP BY DATE(detected_at) + ORDER BY date; + ``` + +2. **Detection Count per Confidence Level** + ```sql + SELECT + confidence, + COUNT(*) as count + FROM detections + WHERE detected_at >= NOW() - INTERVAL '24 hours' + GROUP BY confidence + ORDER BY + CASE confidence + WHEN 'high' THEN 1 + WHEN 'medium' THEN 2 + WHEN 'low' THEN 3 + END; + ``` + +3. **Blocked IPs Analysis** + ```bash + # Query MikroTik per vedere IP bloccati + # Confronta con detections high-confidence + ``` + +--- + +## πŸ”§ Troubleshooting + +### Problema: "ModuleNotFoundError: No module named 'eif'" + +**Soluzione**: +```bash +cd /opt/ids/python_ml +source venv/bin/activate +pip install eif==2.0.0 +``` + +### Problema: "Modello non addestrato. Esegui /train prima." + +**Soluzione**: +```bash +# Verifica modelli esistano +ls -lh /opt/ids/python_ml/models/ + +# Se vuoti, esegui training +python train_hybrid.py --train --source database --db-password "PWD" +``` + +### Problema: API restituisce errore 500 + +**Soluzione**: +```bash +# Check logs +sudo journalctl -u ids-ml-backend -n 100 + +# Verifica USE_HYBRID_DETECTOR +grep USE_HYBRID_DETECTOR /opt/ids/python_ml/.env + +# Fallback a legacy +echo "USE_HYBRID_DETECTOR=false" >> /opt/ids/python_ml/.env +sudo systemctl restart ids-ml-backend +``` + +### Problema: Metrics validation non passa (Precision < 90%) + +**Soluzione**: Tuning hyperparameters +```python +# In ml_hybrid_detector.py, modifica config: +'eif_contamination': 0.02, # Prova valori 0.01-0.05 +'chi2_top_k': 20, # Prova 15-25 +'confidence_high': 97.0, # Aumenta soglia confidence +``` + +--- + +## βœ… Checklist Finale + +- [ ] Test sintetico passato (Precision β‰₯70%) +- [ ] Training su dati reali completato +- [ ] Modelli salvati in `python_ml/models/` +- [ ] `USE_HYBRID_DETECTOR=true` configurato +- [ ] ML backend restartato con successo +- [ ] API `/health` mostra `"ml_model_type": "hybrid"` +- [ ] Primo detection run completato +- [ ] Detections salvate in database con confidence levels +- [ ] (Opzionale) Validazione CICIDS2017 con metrics target raggiunti +- [ ] Re-training periodico configurato (cron o manuale) +- [ ] Dashboard frontend mostra detections con nuovi confidence levels + +--- + +## πŸ“š Documentazione Tecnica + +### Architettura + +``` +β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” +β”‚ Network Logs β”‚ +β”‚ (PostgreSQL) β”‚ +β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜ + β”‚ + v +β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” +β”‚ Feature Extract β”‚ 25 features per IP +β”‚ (25 features) β”‚ (volume, temporal, protocol, behavioral) +β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜ + β”‚ + v +β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” +β”‚ Chi-Square Test β”‚ Feature Selection +β”‚ (Select Top 18)β”‚ Riduce dimensionalitΓ  +β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜ + β”‚ + v +β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” +β”‚ Extended IF β”‚ Unsupervised Anomaly Detection +β”‚ (contamination β”‚ n_estimators=250 +β”‚ = 0.03) β”‚ anomaly_score: 0-100 +β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜ + β”‚ + v +β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” +β”‚ Confidence Scoreβ”‚ 3-tier system +β”‚ High β‰₯95% β”‚ - High: auto-block +β”‚ Medium β‰₯70% β”‚ - Medium: manual review +β”‚ Low <70% β”‚ - Low: monitor +β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜ + β”‚ + v +β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” +β”‚ Detections β”‚ Salvate in DB +β”‚ (Database) β”‚ Con geo info + confidence +β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ +``` + +### Hyperparameters Tuning + +| Parametro | Valore Default | Range Consigliato | Effetto | +|-----------|----------------|-------------------|---------| +| `eif_contamination` | 0.03 | 0.01 - 0.05 | % di anomalie attese. ↑ = piΓΉ rilevamenti | +| `eif_n_estimators` | 250 | 100 - 500 | Numero alberi. ↑ = piΓΉ stabile ma lento | +| `chi2_top_k` | 18 | 15 - 25 | Numero features selezionate | +| `confidence_high` | 95.0 | 90.0 - 98.0 | Soglia auto-block. ↑ = piΓΉ conservativo | +| `confidence_medium` | 70.0 | 60.0 - 80.0 | Soglia review manuale | + +--- + +## 🎯 Target Metrics Recap + +| Metrica | Target Production | Test Sintetico | Note | +|---------|-------------------|----------------|------| +| **Precision** | β‰₯ 90% | β‰₯ 70% | Di 100 flagged, quanti sono veri attacchi | +| **Recall** | β‰₯ 80% | β‰₯ 60% | Di 100 attacchi, quanti rilevati | +| **F1-Score** | β‰₯ 85% | β‰₯ 65% | Media armonica Precision/Recall | +| **FPR** | ≀ 5% | ≀ 10% | Falsi positivi su traffico normale | + +--- + +## πŸ“ž Support + +Per problemi o domande: +1. Check logs: `sudo journalctl -u ids-ml-backend -f` +2. Verifica modelli: `ls -lh /opt/ids/python_ml/models/` +3. Test manuale: `python train_hybrid.py --test` +4. Rollback: `USE_HYBRID_DETECTOR=false` + restart + +**Ultimo aggiornamento**: 24 Nov 2025 - v2.0.0 diff --git a/replit.md b/replit.md index 316379f..741343a 100644 --- a/replit.md +++ b/replit.md @@ -85,4 +85,21 @@ The IDS employs a React-based frontend for real-time monitoring, detection visua - **API**: ip-api.com con batch async lookup (100 IP in ~1.5s invece di 150s!) - **Performance**: Caching intelligente + fallback robusto - **Display**: Globe/Building/MapPin icons nella pagina Detections -- **Deploy**: Migration 004 + restart ML backend \ No newline at end of file +- **Deploy**: Migration 004 + restart ML backend + +### πŸ€– Hybrid ML Detector - False Positive Reduction System (24 Nov 2025 - 18:30) +- **Obiettivo**: Riduzione falsi positivi 80-90% mantenendo alta detection accuracy +- **Architettura Nuova**: + 1. **Extended Isolation Forest**: n_estimators=250, contamination=0.03 (scientificamente tuned) + 2. **Feature Selection**: Chi-Square test riduce 25β†’18 feature piΓΉ rilevanti + 3. **Confidence Scoring**: 3-tier system (Highβ‰₯95%, Mediumβ‰₯70%, Low<70%) + 4. **Validation Framework**: CICIDS2017 dataset con Precision/Recall/F1/FPR metrics +- **Componenti**: + - `python_ml/ml_hybrid_detector.py` - Core detector con EIF + feature selection + - `python_ml/dataset_loader.py` - CICIDS2017 loader con mappatura 80β†’25 features + - `python_ml/validation_metrics.py` - Production-grade metrics calculator + - `python_ml/train_hybrid.py` - CLI training script (test/train/validate) +- **Dipendenze Nuove**: xgboost==2.0.3, joblib==1.3.2, eif==2.0.0 +- **Backward Compatibility**: USE_HYBRID_DETECTOR env var (default=true) +- **Target Metrics**: Precisionβ‰₯90%, Recallβ‰₯80%, FPR≀5%, F1β‰₯85% +- **Deploy**: Vedere `deployment/CHECKLIST_ML_HYBRID.md` \ No newline at end of file