marco/ids.alfacom.it

marco370 102113e950 Improve ML dependency installation script for robust deployment

Update deployment script to correctly activate virtual environment, install Cython and numpy as build dependencies before eif, and ensure sequential installation for the ML hybrid detector.

Replit-Commit-Author: Agent
Replit-Commit-Session-Id: 7a657272-55ba-4a79-9a2e-f1ed9bc7a528
Replit-Commit-Checkpoint-Type: full_checkpoint
Replit-Commit-Event-Id: 8b4c76c7-3a42-4713-8396-40f5db530225
Replit-Commit-Screenshot-Url: https://storage.googleapis.com/screenshot-production-us-central1/449cf7c4-c97a-45ae-8234-e5c5b8d6a84f/7a657272-55ba-4a79-9a2e-f1ed9bc7a528/2lUhxO2

2025-11-24 17:02:15 +00:00

14 KiB

Raw Blame History

Deployment Checklist - Hybrid ML Detector

Sistema ML avanzato per riduzione falsi positivi 80-90% con Extended Isolation Forest

📋 Pre-requisiti

Server AlmaLinux 9 con accesso SSH
PostgreSQL con database IDS attivo
Python 3.11+ installato
Venv attivo: /opt/ids/python_ml/venv
Almeno 7 giorni di traffico real nel database (per training su dati reali)

🔧 Step 1: Installazione Dipendenze

⚠️ IMPORTANTE: Usare lo script dedicato che attiva venv e gestisce build dependencies

# SSH al server
ssh user@ids.alfacom.it

# Esegui script installazione ML dependencies
cd /opt/ids
chmod +x deployment/install_ml_deps.sh
./deployment/install_ml_deps.sh

# Output atteso:
# 🔧 Attivazione virtual environment...
# 📍 Python in uso: /opt/ids/python_ml/venv/bin/python
# ✅ Cython installato con successo
# ✅ numpy 1.26.2 già installato
# ✅ Dipendenze ML installate con successo
# ✅ eif importato correttamente
# ✅ TUTTO OK! Hybrid ML Detector pronto per l'uso

Dipendenze nuove:

Cython==3.0.5 - Build dependency per eif (installato Step 1)
numpy==1.26.2 - Build dependency per eif (verificato Step 2)
xgboost==2.0.3 - Gradient Boosting per ensemble classifier (Step 3)
eif==2.0.2 - Extended Isolation Forest (Step 3)
joblib==1.3.2 - Model persistence (Step 3)

Perché lo script in 3 fasi?

eif si compila da source e richiede Cython durante setup.py
eif setup.py fa import numpy → richiede numpy pre-installato
Lo script attiva venv e installa sequenzialmente: Cython → verifica numpy → eif

🧪 Step 2: Quick Test (Dataset Sintetico)

Testa il sistema con dataset sintetico per verificare funzionamento:

cd /opt/ids/python_ml

# Test rapido con 10k samples sintetici
python train_hybrid.py --test

# Cosa aspettarsi:
# - Dataset creato: 10000 samples (90% normal, 10% attacks)
# - Training completato su ~7000 normal samples
# - Detection results con confidence scoring
# - Validation metrics (Precision, Recall, F1, FPR)

Output atteso:

[TEST] Created synthetic dataset: 10,000 samples
  Normal:  9,000 (90.0%)
  Attacks: 1,000 (10.0%)

[TEST] Training on 6,300 normal samples...
[HYBRID] Training unsupervised model on 6,300 logs...
[HYBRID] Extracted features for X unique IPs
[HYBRID] Feature selection: 25 → 18 features
[HYBRID] Training Extended Isolation Forest...
[HYBRID] Training completed! X/Y IPs flagged as anomalies

[TEST] Detection results:
  Total detections: XX
  High confidence:   XX
  Medium confidence: XX
  Low confidence:    XX

╔══════════════════════════════════════════════════════════════╗
║                    Synthetic Test Results                     ║
╚══════════════════════════════════════════════════════════════╝

🎯 Primary Metrics:
  Precision:     XX.XX%  (of 100 flagged, how many are real attacks)
  Recall:        XX.XX%  (of 100 attacks, how many detected)
  F1-Score:      XX.XX%  (harmonic mean of P&R)
  
⚠️  False Positive Analysis:
  FP Rate:       XX.XX%  (normal traffic flagged as attack)

Criterio successo:

Precision ≥ 70% (test sintetico)
FPR ≤ 10%
Nessun crash

🎯 Step 3: Training su Traffico Reale

Addestra il modello sui log reali (ultimi 7 giorni):

cd /opt/ids/python_ml

# Training su database (ultimi 7 giorni)
python train_hybrid.py --train --source database \
  --db-host localhost \
  --db-port 5432 \
  --db-name ids \
  --db-user postgres \
  --db-password "YOUR_PASSWORD" \
  --days 7

# Modelli salvati in: python_ml/models/
# - isolation_forest_latest.pkl
# - scaler_latest.pkl
# - feature_selector_latest.pkl
# - metadata_latest.json

Cosa succede:

Carica ultimi 7 giorni di network_logs (fino a 1M records)
Estrae 25 features per ogni source_ip
Applica Chi-Square feature selection → 18 features
Addestra Extended Isolation Forest (contamination=3%)
Salva modelli in models/

Criterio successo:

Training completato senza errori
File modelli creati in python_ml/models/
Log mostra "✅ Training completed!"

📊 Step 4: (Opzionale) Validazione CICIDS2017

Per validare con dataset scientifico (solo se si vuole benchmark accurato):

4.1 Download CICIDS2017

# Crea directory dataset
mkdir -p /opt/ids/python_ml/datasets/cicids2017

# Scarica manualmente da:
# https://www.unb.ca/cic/datasets/ids-2017.html
# Estrai i file CSV in: /opt/ids/python_ml/datasets/cicids2017/

# File richiesti (8 giorni):
# - Monday-WorkingHours.pcap_ISCX.csv
# - Tuesday-WorkingHours.pcap_ISCX.csv
# - ... (tutti i file CSV)

4.2 Validazione (10% sample per test)

cd /opt/ids/python_ml

# Validazione con 10% del dataset (test veloce)
python train_hybrid.py --validate --sample 0.1

# Validazione completa (LENTO - può richiedere ore!)
# python train_hybrid.py --validate

Output atteso:

╔══════════════════════════════════════════════════════════════╗
║              CICIDS2017 Validation Results                    ║
╚══════════════════════════════════════════════════════════════╝

🎯 Primary Metrics:
  Precision:     ≥90.00%  ✅ TARGET
  Recall:        ≥80.00%  ✅ TARGET
  F1-Score:      ≥85.00%  ✅ TARGET
  
⚠️  False Positive Analysis:
  FP Rate:       ≤5.00%   ✅ TARGET

[VALIDATE] Checking production deployment criteria...
✅ Model ready for production deployment!

Criterio successo production:

Precision ≥ 90%
Recall ≥ 80%
FPR ≤ 5%
F1-Score ≥ 85%

🚀 Step 5: Deploy in Produzione

5.1 Configura Environment Variable

# Aggiungi al .env del ML backend
echo "USE_HYBRID_DETECTOR=true" >> /opt/ids/python_ml/.env

# Oppure export manuale
export USE_HYBRID_DETECTOR=true

Default: USE_HYBRID_DETECTOR=true (nuovo detector attivo)

Per rollback: USE_HYBRID_DETECTOR=false (usa legacy detector)

5.2 Restart ML Backend

# Systemd service
sudo systemctl restart ids-ml-backend

# Verifica startup
sudo systemctl status ids-ml-backend
sudo journalctl -u ids-ml-backend -f

# Cerca log:
# "[ML] Using Hybrid ML Detector (Extended Isolation Forest + Feature Selection)"
# "[HYBRID] Models loaded (version: latest)"

5.3 Test API

# Test health check
curl http://localhost:8000/health

# Output atteso:
{
  "status": "healthy",
  "database": "connected",
  "ml_model": "loaded",
  "ml_model_type": "hybrid (EIF + Feature Selection)",
  "timestamp": "2025-11-24T18:30:00"
}

# Test root endpoint
curl http://localhost:8000/

# Output atteso:
{
  "service": "IDS API",
  "version": "2.0.0",
  "status": "running",
  "model_type": "hybrid",
  "model_loaded": true,
  "use_hybrid": true
}

📈 Step 6: Monitoring & Validation

6.1 Primo Detection Run

# API call per detection (con API key se configurata)
curl -X POST http://localhost:8000/detect \
  -H "Content-Type: application/json" \
  -H "X-API-Key: YOUR_API_KEY" \
  -d '{
    "max_records": 5000,
    "hours_back": 1,
    "risk_threshold": 60.0,
    "auto_block": false
  }'

6.2 Verifica Detections

# Query PostgreSQL per vedere detections
psql -d ids -c "
SELECT 
  source_ip, 
  risk_score, 
  confidence, 
  anomaly_type,
  detected_at
FROM detections 
ORDER BY detected_at DESC 
LIMIT 10;
"

6.3 Monitoring Logs

# Monitora log ML backend
sudo journalctl -u ids-ml-backend -f | grep -E "(HYBRID|DETECT|TRAIN)"

# Log chiave:
# - "[HYBRID] Models loaded" - Modello caricato OK
# - "[DETECT] Using Hybrid ML Detector" - Detection con nuovo modello
# - "[DETECT] Detected X unique IPs above threshold" - Risultati

🔄 Step 7: Re-training Periodico

Il modello va ri-addestrato periodicamente (es. settimanalmente) su traffico recente:

Opzione A: Manuale

# Ogni settimana
cd /opt/ids/python_ml
source venv/bin/activate

python train_hybrid.py --train --source database \
  --db-password "YOUR_PASSWORD" \
  --days 7

Opzione B: Cron Job

# Crea script wrapper
cat > /opt/ids/scripts/retrain_ml.sh << 'EOF'
#!/bin/bash
set -e

cd /opt/ids/python_ml
source venv/bin/activate

python train_hybrid.py --train --source database \
  --db-host localhost \
  --db-port 5432 \
  --db-name ids \
  --db-user postgres \
  --db-password "$PGPASSWORD" \
  --days 7

# Restart backend per caricare nuovo modello
sudo systemctl restart ids-ml-backend

echo "[$(date)] ML model retrained successfully"
EOF

chmod +x /opt/ids/scripts/retrain_ml.sh

# Aggiungi cron (ogni domenica alle 3:00 AM)
sudo crontab -e

# Aggiungi riga:
0 3 * * 0 /opt/ids/scripts/retrain_ml.sh >> /var/log/ids/ml_retrain.log 2>&1

📊 Step 8: Confronto Vecchio vs Nuovo

Monitora metriche prima/dopo per 1-2 settimane:

Metriche da tracciare:

False Positive Rate (obiettivo: -80%)

-- Query FP rate settimanale
SELECT 
  DATE(detected_at) as date,
  COUNT(*) FILTER (WHERE is_false_positive = true) as false_positives,
  COUNT(*) as total_detections,
  ROUND(100.0 * COUNT(*) FILTER (WHERE is_false_positive = true) / COUNT(*), 2) as fp_rate
FROM detections
WHERE detected_at >= NOW() - INTERVAL '7 days'
GROUP BY DATE(detected_at)
ORDER BY date;

Detection Count per Confidence Level

SELECT 
  confidence,
  COUNT(*) as count
FROM detections
WHERE detected_at >= NOW() - INTERVAL '24 hours'
GROUP BY confidence
ORDER BY 
  CASE confidence
    WHEN 'high' THEN 1
    WHEN 'medium' THEN 2
    WHEN 'low' THEN 3
  END;

Blocked IPs Analysis

# Query MikroTik per vedere IP bloccati
# Confronta con detections high-confidence

🔧 Troubleshooting

Problema: "ModuleNotFoundError: No module named 'eif'"

Soluzione:

cd /opt/ids/python_ml
source venv/bin/activate
pip install eif==2.0.0

Problema: "Modello non addestrato. Esegui /train prima."

Soluzione:

# Verifica modelli esistano
ls -lh /opt/ids/python_ml/models/

# Se vuoti, esegui training
python train_hybrid.py --train --source database --db-password "PWD"

Problema: API restituisce errore 500

Soluzione:

# Check logs
sudo journalctl -u ids-ml-backend -n 100

# Verifica USE_HYBRID_DETECTOR
grep USE_HYBRID_DETECTOR /opt/ids/python_ml/.env

# Fallback a legacy
echo "USE_HYBRID_DETECTOR=false" >> /opt/ids/python_ml/.env
sudo systemctl restart ids-ml-backend

Problema: Metrics validation non passa (Precision < 90%)

Soluzione: Tuning hyperparameters

# In ml_hybrid_detector.py, modifica config:
'eif_contamination': 0.02,  # Prova valori 0.01-0.05
'chi2_top_k': 20,           # Prova 15-25
'confidence_high': 97.0,    # Aumenta soglia confidence

✅ Checklist Finale

Test sintetico passato (Precision ≥70%)
Training su dati reali completato
Modelli salvati in python_ml/models/
USE_HYBRID_DETECTOR=true configurato
ML backend restartato con successo
API /health mostra "ml_model_type": "hybrid"
Primo detection run completato
Detections salvate in database con confidence levels
(Opzionale) Validazione CICIDS2017 con metrics target raggiunti
Re-training periodico configurato (cron o manuale)
Dashboard frontend mostra detections con nuovi confidence levels

📚 Documentazione Tecnica

Architettura

┌─────────────────┐
│  Network Logs   │
│  (PostgreSQL)   │
└────────┬────────┘
         │
         v
┌─────────────────┐
│ Feature Extract │  25 features per IP
│   (25 features) │  (volume, temporal, protocol, behavioral)
└────────┬────────┘
         │
         v
┌─────────────────┐
│ Chi-Square Test │  Feature Selection
│  (Select Top 18)│  Riduce dimensionalità
└────────┬────────┘
         │
         v
┌─────────────────┐
│  Extended IF    │  Unsupervised Anomaly Detection
│ (contamination  │  n_estimators=250
│    = 0.03)      │  anomaly_score: 0-100
└────────┬────────┘
         │
         v
┌─────────────────┐
│ Confidence Score│  3-tier system
│  High ≥95%      │  - High: auto-block
│  Medium ≥70%    │  - Medium: manual review
│  Low <70%       │  - Low: monitor
└────────┬────────┘
         │
         v
┌─────────────────┐
│   Detections    │  Salvate in DB
│   (Database)    │  Con geo info + confidence
└─────────────────┘

Hyperparameters Tuning

Parametro	Valore Default	Range Consigliato	Effetto
`eif_contamination`	0.03	0.01 - 0.05	% di anomalie attese. ↑ = più rilevamenti
`eif_n_estimators`	250	100 - 500	Numero alberi. ↑ = più stabile ma lento
`chi2_top_k`	18	15 - 25	Numero features selezionate
`confidence_high`	95.0	90.0 - 98.0	Soglia auto-block. ↑ = più conservativo
`confidence_medium`	70.0	60.0 - 80.0	Soglia review manuale

🎯 Target Metrics Recap

Metrica	Target Production	Test Sintetico	Note
Precision	≥ 90%	≥ 70%	Di 100 flagged, quanti sono veri attacchi
Recall	≥ 80%	≥ 60%	Di 100 attacchi, quanti rilevati
F1-Score	≥ 85%	≥ 65%	Media armonica Precision/Recall
FPR	≤ 5%	≤ 10%	Falsi positivi su traffico normale

📞 Support

Per problemi o domande:

Check logs: sudo journalctl -u ids-ml-backend -f
Verifica modelli: ls -lh /opt/ids/python_ml/models/
Test manuale: python train_hybrid.py --test
Rollback: USE_HYBRID_DETECTOR=false + restart

Ultimo aggiornamento: 24 Nov 2025 - v2.0.0

14 KiB Raw Blame History

Deployment Checklist - Hybrid ML Detector

📋 Pre-requisiti

🔧 Step 1: Installazione Dipendenze

🧪 Step 2: Quick Test (Dataset Sintetico)

🎯 Step 3: Training su Traffico Reale

📊 Step 4: (Opzionale) Validazione CICIDS2017

4.1 Download CICIDS2017

4.2 Validazione (10% sample per test)

🚀 Step 5: Deploy in Produzione

5.1 Configura Environment Variable

5.2 Restart ML Backend

5.3 Test API

📈 Step 6: Monitoring & Validation

6.1 Primo Detection Run

6.2 Verifica Detections

6.3 Monitoring Logs

🔄 Step 7: Re-training Periodico

Opzione A: Manuale

Opzione B: Cron Job

📊 Step 8: Confronto Vecchio vs Nuovo

Metriche da tracciare:

🔧 Troubleshooting

Problema: "ModuleNotFoundError: No module named 'eif'"

Problema: "Modello non addestrato. Esegui /train prima."

Problema: API restituisce errore 500

Problema: Metrics validation non passa (Precision < 90%)

✅ Checklist Finale

📚 Documentazione Tecnica

Architettura

Hyperparameters Tuning

🎯 Target Metrics Recap

📞 Support

14 KiB

Raw Blame History