Add a new hybrid ML detector system using Extended Isolation Forest and feature selection to reduce false positives. Documented with a deployment checklist and updated API performance notes. Replit-Commit-Author: Agent Replit-Commit-Session-Id: 7a657272-55ba-4a79-9a2e-f1ed9bc7a528 Replit-Commit-Checkpoint-Type: full_checkpoint Replit-Commit-Event-Id: 80860ac4-8fe9-479b-b8fb-cb4c6804a667 Replit-Commit-Screenshot-Url: https://storage.googleapis.com/screenshot-production-us-central1/449cf7c4-c97a-45ae-8234-e5c5b8d6a84f/7a657272-55ba-4a79-9a2e-f1ed9bc7a528/2lUhxO2
537 lines
14 KiB
Markdown
537 lines
14 KiB
Markdown
# Deployment Checklist - Hybrid ML Detector
|
|
|
|
Sistema ML avanzato per riduzione falsi positivi 80-90% con Extended Isolation Forest
|
|
|
|
## 📋 Pre-requisiti
|
|
|
|
- [ ] Server AlmaLinux 9 con accesso SSH
|
|
- [ ] PostgreSQL con database IDS attivo
|
|
- [ ] Python 3.11+ installato
|
|
- [ ] Venv attivo: `/opt/ids/python_ml/venv`
|
|
- [ ] Almeno 7 giorni di traffico real nel database (per training su dati reali)
|
|
|
|
---
|
|
|
|
## 🔧 Step 1: Installazione Dipendenze
|
|
|
|
```bash
|
|
# SSH al server
|
|
ssh user@ids.alfacom.it
|
|
|
|
# Attiva venv
|
|
cd /opt/ids/python_ml
|
|
source venv/bin/activate
|
|
|
|
# Installa nuove librerie
|
|
pip install -r requirements.txt
|
|
|
|
# Verifica installazione
|
|
python -c "import xgboost; import eif; import joblib; print('✅ Dipendenze OK')"
|
|
```
|
|
|
|
**Dipendenze nuove**:
|
|
- `xgboost==2.0.3` - Gradient Boosting per ensemble classifier
|
|
- `eif==2.0.0` - Extended Isolation Forest
|
|
- `joblib==1.3.2` - Model persistence
|
|
|
|
---
|
|
|
|
## 🧪 Step 2: Quick Test (Dataset Sintetico)
|
|
|
|
Testa il sistema con dataset sintetico per verificare funzionamento:
|
|
|
|
```bash
|
|
cd /opt/ids/python_ml
|
|
|
|
# Test rapido con 10k samples sintetici
|
|
python train_hybrid.py --test
|
|
|
|
# Cosa aspettarsi:
|
|
# - Dataset creato: 10000 samples (90% normal, 10% attacks)
|
|
# - Training completato su ~7000 normal samples
|
|
# - Detection results con confidence scoring
|
|
# - Validation metrics (Precision, Recall, F1, FPR)
|
|
```
|
|
|
|
**Output atteso**:
|
|
```
|
|
[TEST] Created synthetic dataset: 10,000 samples
|
|
Normal: 9,000 (90.0%)
|
|
Attacks: 1,000 (10.0%)
|
|
|
|
[TEST] Training on 6,300 normal samples...
|
|
[HYBRID] Training unsupervised model on 6,300 logs...
|
|
[HYBRID] Extracted features for X unique IPs
|
|
[HYBRID] Feature selection: 25 → 18 features
|
|
[HYBRID] Training Extended Isolation Forest...
|
|
[HYBRID] Training completed! X/Y IPs flagged as anomalies
|
|
|
|
[TEST] Detection results:
|
|
Total detections: XX
|
|
High confidence: XX
|
|
Medium confidence: XX
|
|
Low confidence: XX
|
|
|
|
╔══════════════════════════════════════════════════════════════╗
|
|
║ Synthetic Test Results ║
|
|
╚══════════════════════════════════════════════════════════════╝
|
|
|
|
🎯 Primary Metrics:
|
|
Precision: XX.XX% (of 100 flagged, how many are real attacks)
|
|
Recall: XX.XX% (of 100 attacks, how many detected)
|
|
F1-Score: XX.XX% (harmonic mean of P&R)
|
|
|
|
⚠️ False Positive Analysis:
|
|
FP Rate: XX.XX% (normal traffic flagged as attack)
|
|
```
|
|
|
|
**Criterio successo**:
|
|
- Precision ≥ 70% (test sintetico)
|
|
- FPR ≤ 10%
|
|
- Nessun crash
|
|
|
|
---
|
|
|
|
## 🎯 Step 3: Training su Traffico Reale
|
|
|
|
Addestra il modello sui log reali (ultimi 7 giorni):
|
|
|
|
```bash
|
|
cd /opt/ids/python_ml
|
|
|
|
# Training su database (ultimi 7 giorni)
|
|
python train_hybrid.py --train --source database \
|
|
--db-host localhost \
|
|
--db-port 5432 \
|
|
--db-name ids \
|
|
--db-user postgres \
|
|
--db-password "YOUR_PASSWORD" \
|
|
--days 7
|
|
|
|
# Modelli salvati in: python_ml/models/
|
|
# - isolation_forest_latest.pkl
|
|
# - scaler_latest.pkl
|
|
# - feature_selector_latest.pkl
|
|
# - metadata_latest.json
|
|
```
|
|
|
|
**Cosa succede**:
|
|
1. Carica ultimi 7 giorni di `network_logs` (fino a 1M records)
|
|
2. Estrae 25 features per ogni source_ip
|
|
3. Applica Chi-Square feature selection → 18 features
|
|
4. Addestra Extended Isolation Forest (contamination=3%)
|
|
5. Salva modelli in `models/`
|
|
|
|
**Criterio successo**:
|
|
- Training completato senza errori
|
|
- File modelli creati in `python_ml/models/`
|
|
- Log mostra "✅ Training completed!"
|
|
|
|
---
|
|
|
|
## 📊 Step 4: (Opzionale) Validazione CICIDS2017
|
|
|
|
Per validare con dataset scientifico (solo se si vuole benchmark accurato):
|
|
|
|
### 4.1 Download CICIDS2017
|
|
|
|
```bash
|
|
# Crea directory dataset
|
|
mkdir -p /opt/ids/python_ml/datasets/cicids2017
|
|
|
|
# Scarica manualmente da:
|
|
# https://www.unb.ca/cic/datasets/ids-2017.html
|
|
# Estrai i file CSV in: /opt/ids/python_ml/datasets/cicids2017/
|
|
|
|
# File richiesti (8 giorni):
|
|
# - Monday-WorkingHours.pcap_ISCX.csv
|
|
# - Tuesday-WorkingHours.pcap_ISCX.csv
|
|
# - ... (tutti i file CSV)
|
|
```
|
|
|
|
### 4.2 Validazione (10% sample per test)
|
|
|
|
```bash
|
|
cd /opt/ids/python_ml
|
|
|
|
# Validazione con 10% del dataset (test veloce)
|
|
python train_hybrid.py --validate --sample 0.1
|
|
|
|
# Validazione completa (LENTO - può richiedere ore!)
|
|
# python train_hybrid.py --validate
|
|
```
|
|
|
|
**Output atteso**:
|
|
```
|
|
╔══════════════════════════════════════════════════════════════╗
|
|
║ CICIDS2017 Validation Results ║
|
|
╚══════════════════════════════════════════════════════════════╝
|
|
|
|
🎯 Primary Metrics:
|
|
Precision: ≥90.00% ✅ TARGET
|
|
Recall: ≥80.00% ✅ TARGET
|
|
F1-Score: ≥85.00% ✅ TARGET
|
|
|
|
⚠️ False Positive Analysis:
|
|
FP Rate: ≤5.00% ✅ TARGET
|
|
|
|
[VALIDATE] Checking production deployment criteria...
|
|
✅ Model ready for production deployment!
|
|
```
|
|
|
|
**Criterio successo production**:
|
|
- Precision ≥ 90%
|
|
- Recall ≥ 80%
|
|
- FPR ≤ 5%
|
|
- F1-Score ≥ 85%
|
|
|
|
---
|
|
|
|
## 🚀 Step 5: Deploy in Produzione
|
|
|
|
### 5.1 Configura Environment Variable
|
|
|
|
```bash
|
|
# Aggiungi al .env del ML backend
|
|
echo "USE_HYBRID_DETECTOR=true" >> /opt/ids/python_ml/.env
|
|
|
|
# Oppure export manuale
|
|
export USE_HYBRID_DETECTOR=true
|
|
```
|
|
|
|
**Default**: `USE_HYBRID_DETECTOR=true` (nuovo detector attivo)
|
|
|
|
Per rollback: `USE_HYBRID_DETECTOR=false` (usa legacy detector)
|
|
|
|
### 5.2 Restart ML Backend
|
|
|
|
```bash
|
|
# Systemd service
|
|
sudo systemctl restart ids-ml-backend
|
|
|
|
# Verifica startup
|
|
sudo systemctl status ids-ml-backend
|
|
sudo journalctl -u ids-ml-backend -f
|
|
|
|
# Cerca log:
|
|
# "[ML] Using Hybrid ML Detector (Extended Isolation Forest + Feature Selection)"
|
|
# "[HYBRID] Models loaded (version: latest)"
|
|
```
|
|
|
|
### 5.3 Test API
|
|
|
|
```bash
|
|
# Test health check
|
|
curl http://localhost:8000/health
|
|
|
|
# Output atteso:
|
|
{
|
|
"status": "healthy",
|
|
"database": "connected",
|
|
"ml_model": "loaded",
|
|
"ml_model_type": "hybrid (EIF + Feature Selection)",
|
|
"timestamp": "2025-11-24T18:30:00"
|
|
}
|
|
|
|
# Test root endpoint
|
|
curl http://localhost:8000/
|
|
|
|
# Output atteso:
|
|
{
|
|
"service": "IDS API",
|
|
"version": "2.0.0",
|
|
"status": "running",
|
|
"model_type": "hybrid",
|
|
"model_loaded": true,
|
|
"use_hybrid": true
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## 📈 Step 6: Monitoring & Validation
|
|
|
|
### 6.1 Primo Detection Run
|
|
|
|
```bash
|
|
# API call per detection (con API key se configurata)
|
|
curl -X POST http://localhost:8000/detect \
|
|
-H "Content-Type: application/json" \
|
|
-H "X-API-Key: YOUR_API_KEY" \
|
|
-d '{
|
|
"max_records": 5000,
|
|
"hours_back": 1,
|
|
"risk_threshold": 60.0,
|
|
"auto_block": false
|
|
}'
|
|
```
|
|
|
|
### 6.2 Verifica Detections
|
|
|
|
```bash
|
|
# Query PostgreSQL per vedere detections
|
|
psql -d ids -c "
|
|
SELECT
|
|
source_ip,
|
|
risk_score,
|
|
confidence,
|
|
anomaly_type,
|
|
detected_at
|
|
FROM detections
|
|
ORDER BY detected_at DESC
|
|
LIMIT 10;
|
|
"
|
|
```
|
|
|
|
### 6.3 Monitoring Logs
|
|
|
|
```bash
|
|
# Monitora log ML backend
|
|
sudo journalctl -u ids-ml-backend -f | grep -E "(HYBRID|DETECT|TRAIN)"
|
|
|
|
# Log chiave:
|
|
# - "[HYBRID] Models loaded" - Modello caricato OK
|
|
# - "[DETECT] Using Hybrid ML Detector" - Detection con nuovo modello
|
|
# - "[DETECT] Detected X unique IPs above threshold" - Risultati
|
|
```
|
|
|
|
---
|
|
|
|
## 🔄 Step 7: Re-training Periodico
|
|
|
|
Il modello va ri-addestrato periodicamente (es. settimanalmente) su traffico recente:
|
|
|
|
### Opzione A: Manuale
|
|
|
|
```bash
|
|
# Ogni settimana
|
|
cd /opt/ids/python_ml
|
|
source venv/bin/activate
|
|
|
|
python train_hybrid.py --train --source database \
|
|
--db-password "YOUR_PASSWORD" \
|
|
--days 7
|
|
```
|
|
|
|
### Opzione B: Cron Job
|
|
|
|
```bash
|
|
# Crea script wrapper
|
|
cat > /opt/ids/scripts/retrain_ml.sh << 'EOF'
|
|
#!/bin/bash
|
|
set -e
|
|
|
|
cd /opt/ids/python_ml
|
|
source venv/bin/activate
|
|
|
|
python train_hybrid.py --train --source database \
|
|
--db-host localhost \
|
|
--db-port 5432 \
|
|
--db-name ids \
|
|
--db-user postgres \
|
|
--db-password "$PGPASSWORD" \
|
|
--days 7
|
|
|
|
# Restart backend per caricare nuovo modello
|
|
sudo systemctl restart ids-ml-backend
|
|
|
|
echo "[$(date)] ML model retrained successfully"
|
|
EOF
|
|
|
|
chmod +x /opt/ids/scripts/retrain_ml.sh
|
|
|
|
# Aggiungi cron (ogni domenica alle 3:00 AM)
|
|
sudo crontab -e
|
|
|
|
# Aggiungi riga:
|
|
0 3 * * 0 /opt/ids/scripts/retrain_ml.sh >> /var/log/ids/ml_retrain.log 2>&1
|
|
```
|
|
|
|
---
|
|
|
|
## 📊 Step 8: Confronto Vecchio vs Nuovo
|
|
|
|
Monitora metriche prima/dopo per 1-2 settimane:
|
|
|
|
### Metriche da tracciare:
|
|
|
|
1. **False Positive Rate** (obiettivo: -80%)
|
|
```sql
|
|
-- Query FP rate settimanale
|
|
SELECT
|
|
DATE(detected_at) as date,
|
|
COUNT(*) FILTER (WHERE is_false_positive = true) as false_positives,
|
|
COUNT(*) as total_detections,
|
|
ROUND(100.0 * COUNT(*) FILTER (WHERE is_false_positive = true) / COUNT(*), 2) as fp_rate
|
|
FROM detections
|
|
WHERE detected_at >= NOW() - INTERVAL '7 days'
|
|
GROUP BY DATE(detected_at)
|
|
ORDER BY date;
|
|
```
|
|
|
|
2. **Detection Count per Confidence Level**
|
|
```sql
|
|
SELECT
|
|
confidence,
|
|
COUNT(*) as count
|
|
FROM detections
|
|
WHERE detected_at >= NOW() - INTERVAL '24 hours'
|
|
GROUP BY confidence
|
|
ORDER BY
|
|
CASE confidence
|
|
WHEN 'high' THEN 1
|
|
WHEN 'medium' THEN 2
|
|
WHEN 'low' THEN 3
|
|
END;
|
|
```
|
|
|
|
3. **Blocked IPs Analysis**
|
|
```bash
|
|
# Query MikroTik per vedere IP bloccati
|
|
# Confronta con detections high-confidence
|
|
```
|
|
|
|
---
|
|
|
|
## 🔧 Troubleshooting
|
|
|
|
### Problema: "ModuleNotFoundError: No module named 'eif'"
|
|
|
|
**Soluzione**:
|
|
```bash
|
|
cd /opt/ids/python_ml
|
|
source venv/bin/activate
|
|
pip install eif==2.0.0
|
|
```
|
|
|
|
### Problema: "Modello non addestrato. Esegui /train prima."
|
|
|
|
**Soluzione**:
|
|
```bash
|
|
# Verifica modelli esistano
|
|
ls -lh /opt/ids/python_ml/models/
|
|
|
|
# Se vuoti, esegui training
|
|
python train_hybrid.py --train --source database --db-password "PWD"
|
|
```
|
|
|
|
### Problema: API restituisce errore 500
|
|
|
|
**Soluzione**:
|
|
```bash
|
|
# Check logs
|
|
sudo journalctl -u ids-ml-backend -n 100
|
|
|
|
# Verifica USE_HYBRID_DETECTOR
|
|
grep USE_HYBRID_DETECTOR /opt/ids/python_ml/.env
|
|
|
|
# Fallback a legacy
|
|
echo "USE_HYBRID_DETECTOR=false" >> /opt/ids/python_ml/.env
|
|
sudo systemctl restart ids-ml-backend
|
|
```
|
|
|
|
### Problema: Metrics validation non passa (Precision < 90%)
|
|
|
|
**Soluzione**: Tuning hyperparameters
|
|
```python
|
|
# In ml_hybrid_detector.py, modifica config:
|
|
'eif_contamination': 0.02, # Prova valori 0.01-0.05
|
|
'chi2_top_k': 20, # Prova 15-25
|
|
'confidence_high': 97.0, # Aumenta soglia confidence
|
|
```
|
|
|
|
---
|
|
|
|
## ✅ Checklist Finale
|
|
|
|
- [ ] Test sintetico passato (Precision ≥70%)
|
|
- [ ] Training su dati reali completato
|
|
- [ ] Modelli salvati in `python_ml/models/`
|
|
- [ ] `USE_HYBRID_DETECTOR=true` configurato
|
|
- [ ] ML backend restartato con successo
|
|
- [ ] API `/health` mostra `"ml_model_type": "hybrid"`
|
|
- [ ] Primo detection run completato
|
|
- [ ] Detections salvate in database con confidence levels
|
|
- [ ] (Opzionale) Validazione CICIDS2017 con metrics target raggiunti
|
|
- [ ] Re-training periodico configurato (cron o manuale)
|
|
- [ ] Dashboard frontend mostra detections con nuovi confidence levels
|
|
|
|
---
|
|
|
|
## 📚 Documentazione Tecnica
|
|
|
|
### Architettura
|
|
|
|
```
|
|
┌─────────────────┐
|
|
│ Network Logs │
|
|
│ (PostgreSQL) │
|
|
└────────┬────────┘
|
|
│
|
|
v
|
|
┌─────────────────┐
|
|
│ Feature Extract │ 25 features per IP
|
|
│ (25 features) │ (volume, temporal, protocol, behavioral)
|
|
└────────┬────────┘
|
|
│
|
|
v
|
|
┌─────────────────┐
|
|
│ Chi-Square Test │ Feature Selection
|
|
│ (Select Top 18)│ Riduce dimensionalità
|
|
└────────┬────────┘
|
|
│
|
|
v
|
|
┌─────────────────┐
|
|
│ Extended IF │ Unsupervised Anomaly Detection
|
|
│ (contamination │ n_estimators=250
|
|
│ = 0.03) │ anomaly_score: 0-100
|
|
└────────┬────────┘
|
|
│
|
|
v
|
|
┌─────────────────┐
|
|
│ Confidence Score│ 3-tier system
|
|
│ High ≥95% │ - High: auto-block
|
|
│ Medium ≥70% │ - Medium: manual review
|
|
│ Low <70% │ - Low: monitor
|
|
└────────┬────────┘
|
|
│
|
|
v
|
|
┌─────────────────┐
|
|
│ Detections │ Salvate in DB
|
|
│ (Database) │ Con geo info + confidence
|
|
└─────────────────┘
|
|
```
|
|
|
|
### Hyperparameters Tuning
|
|
|
|
| Parametro | Valore Default | Range Consigliato | Effetto |
|
|
|-----------|----------------|-------------------|---------|
|
|
| `eif_contamination` | 0.03 | 0.01 - 0.05 | % di anomalie attese. ↑ = più rilevamenti |
|
|
| `eif_n_estimators` | 250 | 100 - 500 | Numero alberi. ↑ = più stabile ma lento |
|
|
| `chi2_top_k` | 18 | 15 - 25 | Numero features selezionate |
|
|
| `confidence_high` | 95.0 | 90.0 - 98.0 | Soglia auto-block. ↑ = più conservativo |
|
|
| `confidence_medium` | 70.0 | 60.0 - 80.0 | Soglia review manuale |
|
|
|
|
---
|
|
|
|
## 🎯 Target Metrics Recap
|
|
|
|
| Metrica | Target Production | Test Sintetico | Note |
|
|
|---------|-------------------|----------------|------|
|
|
| **Precision** | ≥ 90% | ≥ 70% | Di 100 flagged, quanti sono veri attacchi |
|
|
| **Recall** | ≥ 80% | ≥ 60% | Di 100 attacchi, quanti rilevati |
|
|
| **F1-Score** | ≥ 85% | ≥ 65% | Media armonica Precision/Recall |
|
|
| **FPR** | ≤ 5% | ≤ 10% | Falsi positivi su traffico normale |
|
|
|
|
---
|
|
|
|
## 📞 Support
|
|
|
|
Per problemi o domande:
|
|
1. Check logs: `sudo journalctl -u ids-ml-backend -f`
|
|
2. Verifica modelli: `ls -lh /opt/ids/python_ml/models/`
|
|
3. Test manuale: `python train_hybrid.py --test`
|
|
4. Rollback: `USE_HYBRID_DETECTOR=false` + restart
|
|
|
|
**Ultimo aggiornamento**: 24 Nov 2025 - v2.0.0
|