Adapt ML model to new database schema and automate training
Adjusts SQL queries and feature extraction to accommodate changes in the network_logs database schema, enabling automatic weekly retraining of the ML hybrid detector. Replit-Commit-Author: Agent Replit-Commit-Session-Id: 7a657272-55ba-4a79-9a2e-f1ed9bc7a528 Replit-Commit-Checkpoint-Type: intermediate_checkpoint Replit-Commit-Event-Id: f4fdd53b-f433-44d9-9f0f-63616a9eeec1 Replit-Commit-Screenshot-Url: https://storage.googleapis.com/screenshot-production-us-central1/449cf7c4-c97a-45ae-8234-e5c5b8d6a84f/7a657272-55ba-4a79-9a2e-f1ed9bc7a528/2lUhxO2
This commit is contained in:
parent
7e9599804a
commit
b88377e2d5
4
.replit
4
.replit
@ -18,6 +18,10 @@ externalPort = 80
|
||||
localPort = 41303
|
||||
externalPort = 3002
|
||||
|
||||
[[ports]]
|
||||
localPort = 42815
|
||||
externalPort = 3001
|
||||
|
||||
[[ports]]
|
||||
localPort = 43471
|
||||
externalPort = 3003
|
||||
|
||||
@ -0,0 +1,54 @@
|
||||
./deployment/train_hybrid_production.sh
|
||||
=======================================================================
|
||||
TRAINING HYBRID ML DETECTOR - DATI REALI
|
||||
=======================================================================
|
||||
|
||||
📂 Caricamento credenziali database da .env...
|
||||
✅ Credenziali caricate:
|
||||
Host: localhost
|
||||
Port: 5432
|
||||
Database: ids_database
|
||||
User: ids_user
|
||||
Password: ****** (nascosta)
|
||||
|
||||
🎯 Parametri training:
|
||||
Periodo: ultimi 7 giorni
|
||||
Max records: 1000000
|
||||
|
||||
🐍 Python: /opt/ids/python_ml/venv/bin/python
|
||||
|
||||
📊 Verifica dati disponibili nel database...
|
||||
primo_log | ultimo_log | periodo_totale | totale_records
|
||||
---------------------+---------------------+----------------+----------------
|
||||
2025-11-22 10:03:21 | 2025-11-24 17:58:17 | 2 giorni | 234,316,667
|
||||
(1 row)
|
||||
|
||||
|
||||
🚀 Avvio training...
|
||||
|
||||
=======================================================================
|
||||
[WARNING] Extended Isolation Forest not available, using standard IF
|
||||
|
||||
======================================================================
|
||||
IDS HYBRID ML TRAINING - UNSUPERVISED MODE
|
||||
======================================================================
|
||||
[TRAIN] Loading last 7 days of real traffic from database...
|
||||
|
||||
❌ Error: column "dest_ip" does not exist
|
||||
LINE 5: dest_ip,
|
||||
^
|
||||
|
||||
Traceback (most recent call last):
|
||||
File "/opt/ids/python_ml/train_hybrid.py", line 365, in main
|
||||
train_unsupervised(args)
|
||||
File "/opt/ids/python_ml/train_hybrid.py", line 91, in train_unsupervised
|
||||
logs_df = train_on_real_traffic(db_config, days=args.days)
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
File "/opt/ids/python_ml/train_hybrid.py", line 50, in train_on_real_traffic
|
||||
cursor.execute(query, (days,))
|
||||
File "/opt/ids/python_ml/venv/lib64/python3.11/site-packages/psycopg2/extras.py", line 236, in execute
|
||||
return super().execute(query, vars)
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
psycopg2.errors.UndefinedColumn: column "dest_ip" does not exist
|
||||
LINE 5: dest_ip,
|
||||
^
|
||||
92
deployment/run_ml_training.sh
Normal file
92
deployment/run_ml_training.sh
Normal file
@ -0,0 +1,92 @@
|
||||
#!/bin/bash
|
||||
#
|
||||
# ML Training Wrapper - Esecuzione Automatica via Systemd
|
||||
# Carica credenziali da .env in modo sicuro
|
||||
#
|
||||
|
||||
set -e
|
||||
|
||||
IDS_ROOT="/opt/ids"
|
||||
ENV_FILE="$IDS_ROOT/.env"
|
||||
PYTHON_ML_DIR="$IDS_ROOT/python_ml"
|
||||
VENV_PYTHON="$PYTHON_ML_DIR/venv/bin/python"
|
||||
LOG_DIR="/var/log/ids"
|
||||
|
||||
# Crea directory log se non esiste
|
||||
mkdir -p "$LOG_DIR"
|
||||
|
||||
# File log dedicato
|
||||
LOG_FILE="$LOG_DIR/ml-training.log"
|
||||
|
||||
# Funzione logging
|
||||
log() {
|
||||
echo "[$(date '+%Y-%m-%d %H:%M:%S')] $1" | tee -a "$LOG_FILE"
|
||||
}
|
||||
|
||||
log "========================================="
|
||||
log "ML Training - Avvio automatico"
|
||||
log "========================================="
|
||||
|
||||
# Verifica .env
|
||||
if [ ! -f "$ENV_FILE" ]; then
|
||||
log "ERROR: File .env non trovato: $ENV_FILE"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
# Carica variabili ambiente
|
||||
log "Caricamento credenziali database..."
|
||||
set -a
|
||||
source "$ENV_FILE"
|
||||
set +a
|
||||
|
||||
# Verifica credenziali
|
||||
if [ -z "$PGPASSWORD" ]; then
|
||||
log "ERROR: PGPASSWORD non trovata in .env"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
DB_HOST="${PGHOST:-localhost}"
|
||||
DB_PORT="${PGPORT:-5432}"
|
||||
DB_NAME="${PGDATABASE:-ids}"
|
||||
DB_USER="${PGUSER:-postgres}"
|
||||
|
||||
log "Database: $DB_USER@$DB_HOST:$DB_PORT/$DB_NAME"
|
||||
|
||||
# Verifica venv
|
||||
if [ ! -f "$VENV_PYTHON" ]; then
|
||||
log "ERROR: Venv Python non trovato: $VENV_PYTHON"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
# Parametri training
|
||||
DAYS="${ML_TRAINING_DAYS:-7}" # Default 7 giorni, configurabile via env var
|
||||
|
||||
log "Training ultimi $DAYS giorni di traffico..."
|
||||
|
||||
# Esegui training
|
||||
cd "$PYTHON_ML_DIR"
|
||||
"$VENV_PYTHON" train_hybrid.py --train --source database \
|
||||
--db-host "$DB_HOST" \
|
||||
--db-port "$DB_PORT" \
|
||||
--db-name "$DB_NAME" \
|
||||
--db-user "$DB_USER" \
|
||||
--db-password "$PGPASSWORD" \
|
||||
--days "$DAYS" 2>&1 | tee -a "$LOG_FILE"
|
||||
|
||||
# Check exit code
|
||||
if [ ${PIPESTATUS[0]} -eq 0 ]; then
|
||||
log "========================================="
|
||||
log "✅ Training completato con successo!"
|
||||
log "========================================="
|
||||
log "Modelli salvati in: $PYTHON_ML_DIR/models/"
|
||||
log ""
|
||||
log "Il ML backend caricherà automaticamente i nuovi modelli al prossimo riavvio."
|
||||
log "Per applicare immediatamente: sudo systemctl restart ids-ml-backend"
|
||||
exit 0
|
||||
else
|
||||
log "========================================="
|
||||
log "❌ ERRORE durante il training"
|
||||
log "========================================="
|
||||
log "Controlla log completo: $LOG_FILE"
|
||||
exit 1
|
||||
fi
|
||||
98
deployment/setup_ml_training_timer.sh
Normal file
98
deployment/setup_ml_training_timer.sh
Normal file
@ -0,0 +1,98 @@
|
||||
#!/bin/bash
|
||||
#
|
||||
# Setup ML Training Systemd Timer
|
||||
# Configura training automatico settimanale del modello ML hybrid
|
||||
#
|
||||
|
||||
set -e
|
||||
|
||||
echo "================================================================"
|
||||
echo " SETUP ML TRAINING TIMER - Training Automatico Settimanale"
|
||||
echo "================================================================"
|
||||
echo ""
|
||||
|
||||
# Verifica root
|
||||
if [ "$EUID" -ne 0 ]; then
|
||||
echo "❌ ERRORE: Questo script deve essere eseguito come root"
|
||||
echo " Usa: sudo $0"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
IDS_ROOT="/opt/ids"
|
||||
SYSTEMD_DIR="/etc/systemd/system"
|
||||
|
||||
# Verifica directory IDS
|
||||
if [ ! -d "$IDS_ROOT" ]; then
|
||||
echo "❌ ERRORE: Directory IDS non trovata: $IDS_ROOT"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
echo "📁 Directory IDS: $IDS_ROOT"
|
||||
echo ""
|
||||
|
||||
# 1. Copia systemd files
|
||||
echo "📋 Step 1: Installazione systemd units..."
|
||||
|
||||
cp "$IDS_ROOT/deployment/systemd/ids-ml-training.service" "$SYSTEMD_DIR/"
|
||||
cp "$IDS_ROOT/deployment/systemd/ids-ml-training.timer" "$SYSTEMD_DIR/"
|
||||
|
||||
echo " ✅ Service copiato: $SYSTEMD_DIR/ids-ml-training.service"
|
||||
echo " ✅ Timer copiato: $SYSTEMD_DIR/ids-ml-training.timer"
|
||||
echo ""
|
||||
|
||||
# 2. Rendi eseguibile script
|
||||
echo "🔧 Step 2: Permessi script..."
|
||||
chmod +x "$IDS_ROOT/deployment/run_ml_training.sh"
|
||||
echo " ✅ Script eseguibile: $IDS_ROOT/deployment/run_ml_training.sh"
|
||||
echo ""
|
||||
|
||||
# 3. Reload systemd
|
||||
echo "🔄 Step 3: Reload systemd daemon..."
|
||||
systemctl daemon-reload
|
||||
echo " ✅ Daemon reloaded"
|
||||
echo ""
|
||||
|
||||
# 4. Enable e start timer
|
||||
echo "🚀 Step 4: Attivazione timer..."
|
||||
systemctl enable ids-ml-training.timer
|
||||
systemctl start ids-ml-training.timer
|
||||
echo " ✅ Timer attivato e avviato"
|
||||
echo ""
|
||||
|
||||
# 5. Verifica status
|
||||
echo "📊 Step 5: Verifica configurazione..."
|
||||
echo ""
|
||||
echo "Timer status:"
|
||||
systemctl status ids-ml-training.timer --no-pager
|
||||
echo ""
|
||||
echo "Prossima esecuzione:"
|
||||
systemctl list-timers ids-ml-training.timer --no-pager
|
||||
echo ""
|
||||
|
||||
echo "================================================================"
|
||||
echo "✅ SETUP COMPLETATO!"
|
||||
echo "================================================================"
|
||||
echo ""
|
||||
echo "📅 Schedule: Ogni Lunedì alle 03:00 AM"
|
||||
echo "📁 Log: /var/log/ids/ml-training.log"
|
||||
echo ""
|
||||
echo "🔍 COMANDI UTILI:"
|
||||
echo ""
|
||||
echo " # Verifica timer attivo"
|
||||
echo " systemctl status ids-ml-training.timer"
|
||||
echo ""
|
||||
echo " # Vedi prossima esecuzione"
|
||||
echo " systemctl list-timers ids-ml-training.timer"
|
||||
echo ""
|
||||
echo " # Esegui training manualmente ORA"
|
||||
echo " sudo systemctl start ids-ml-training.service"
|
||||
echo ""
|
||||
echo " # Vedi log training"
|
||||
echo " journalctl -u ids-ml-training.service -f"
|
||||
echo " tail -f /var/log/ids/ml-training.log"
|
||||
echo ""
|
||||
echo " # Disabilita training automatico"
|
||||
echo " sudo systemctl stop ids-ml-training.timer"
|
||||
echo " sudo systemctl disable ids-ml-training.timer"
|
||||
echo ""
|
||||
echo "================================================================"
|
||||
30
deployment/systemd/ids-ml-training.service
Normal file
30
deployment/systemd/ids-ml-training.service
Normal file
@ -0,0 +1,30 @@
|
||||
[Unit]
|
||||
Description=IDS ML Hybrid Detector Training
|
||||
Documentation=https://github.com/your-repo/ids
|
||||
After=network.target postgresql.service
|
||||
Requires=postgresql.service
|
||||
|
||||
[Service]
|
||||
Type=oneshot
|
||||
User=root
|
||||
WorkingDirectory=/opt/ids/python_ml
|
||||
|
||||
# Carica environment file per credenziali database
|
||||
EnvironmentFile=/opt/ids/.env
|
||||
|
||||
# Esegui training
|
||||
ExecStart=/opt/ids/deployment/run_ml_training.sh
|
||||
|
||||
# Timeout generoso (training può richiedere fino a 30 min)
|
||||
TimeoutStartSec=1800
|
||||
|
||||
# Log
|
||||
StandardOutput=journal
|
||||
StandardError=journal
|
||||
SyslogIdentifier=ids-ml-training
|
||||
|
||||
# Restart policy
|
||||
Restart=no
|
||||
|
||||
[Install]
|
||||
WantedBy=multi-user.target
|
||||
17
deployment/systemd/ids-ml-training.timer
Normal file
17
deployment/systemd/ids-ml-training.timer
Normal file
@ -0,0 +1,17 @@
|
||||
[Unit]
|
||||
Description=IDS ML Training - Weekly Retraining
|
||||
Documentation=https://github.com/your-repo/ids
|
||||
Requires=ids-ml-training.service
|
||||
|
||||
[Timer]
|
||||
# Esecuzione settimanale: ogni Lunedì alle 03:00 AM
|
||||
OnCalendar=Mon *-*-* 03:00:00
|
||||
|
||||
# Persistenza: se il server era spento, esegui al prossimo boot
|
||||
Persistent=true
|
||||
|
||||
# Accuratezza: 5 minuti di tolleranza
|
||||
AccuracySec=5min
|
||||
|
||||
[Install]
|
||||
WantedBy=timers.target
|
||||
@ -102,8 +102,19 @@ class MLHybridDetector:
|
||||
group = group.sort_values('timestamp')
|
||||
|
||||
# Volume features (5)
|
||||
total_packets = group['packets'].sum() if 'packets' in group.columns else len(group)
|
||||
total_bytes = group['bytes'].sum() if 'bytes' in group.columns else 0
|
||||
# Handle different database schemas
|
||||
if 'packets' in group.columns:
|
||||
total_packets = group['packets'].sum()
|
||||
else:
|
||||
total_packets = len(group) # Each row = 1 packet
|
||||
|
||||
if 'bytes' in group.columns:
|
||||
total_bytes = group['bytes'].sum()
|
||||
elif 'packet_length' in group.columns:
|
||||
total_bytes = group['packet_length'].sum() # Use packet_length from MikroTik logs
|
||||
else:
|
||||
total_bytes = 0
|
||||
|
||||
conn_count = len(group)
|
||||
avg_packet_size = total_bytes / max(total_packets, 1)
|
||||
bytes_per_second = total_bytes / max((group['timestamp'].max() - group['timestamp'].min()).total_seconds(), 1)
|
||||
@ -151,6 +162,9 @@ class MLHybridDetector:
|
||||
if 'bytes' in group.columns and 'packets' in group.columns:
|
||||
group['packet_size'] = group['bytes'] / group['packets'].replace(0, 1)
|
||||
packet_size_variance = group['packet_size'].std()
|
||||
elif 'packet_length' in group.columns:
|
||||
# Use packet_length directly for variance
|
||||
packet_size_variance = group['packet_length'].std()
|
||||
else:
|
||||
packet_size_variance = 0
|
||||
|
||||
|
||||
@ -35,11 +35,10 @@ def train_on_real_traffic(db_config: dict, days: int = 7) -> pd.DataFrame:
|
||||
SELECT
|
||||
timestamp,
|
||||
source_ip,
|
||||
dest_ip,
|
||||
dest_port,
|
||||
destination_ip as dest_ip,
|
||||
destination_port as dest_port,
|
||||
protocol,
|
||||
packets,
|
||||
bytes,
|
||||
packet_length,
|
||||
action
|
||||
FROM network_logs
|
||||
WHERE timestamp > NOW() - INTERVAL '%s days'
|
||||
|
||||
28
replit.md
28
replit.md
@ -123,4 +123,32 @@ The IDS employs a React-based frontend for real-time monitoring, detection visua
|
||||
- `requirements.txt`: Rimosso `eif==2.0.2` e `Cython==3.0.5` (non più necessari)
|
||||
- `deployment/install_ml_deps.sh`: Semplificato da 4 a 2 step, nessuna compilazione
|
||||
- `deployment/CHECKLIST_ML_HYBRID.md`: Aggiornato con nuove istruzioni semplificate
|
||||
|
||||
### 🔄 Database Schema Adaptation & Auto-Training (24 Nov 2025 - 23:30)
|
||||
- **Database Schema Fix**: Adattato ML detector allo schema reale `network_logs`
|
||||
- Query SQL corretta: `destination_ip` (non `dest_ip`), `destination_port` (non `dest_port`)
|
||||
- Feature extraction: supporto `packet_length` invece di `packets`/`bytes` separati
|
||||
- Backward compatible: funziona sia con schema MikroTik che dataset CICIDS2017
|
||||
- **Training Automatico Settimanale**:
|
||||
- Script wrapper: `deployment/run_ml_training.sh` (carica credenziali da .env)
|
||||
- Systemd service: `ids-ml-training.service`
|
||||
- Systemd timer: `ids-ml-training.timer` (ogni Lunedì 03:00 AM)
|
||||
- Setup automatico: `./deployment/setup_ml_training_timer.sh`
|
||||
- Log persistenti: `/var/log/ids/ml-training.log`
|
||||
- **Workflow Completo**:
|
||||
1. Timer systemd esegue training settimanale automatico
|
||||
2. Script carica ultimi 7 giorni di traffico dal database (234M+ records)
|
||||
3. Training Hybrid ML (IF + Ensemble + Feature Selection)
|
||||
4. Modelli salvati in `python_ml/models/`
|
||||
5. ML backend li carica automaticamente al prossimo riavvio
|
||||
- **Files creati**:
|
||||
- `deployment/run_ml_training.sh` - Wrapper sicuro per training
|
||||
- `deployment/train_hybrid_production.sh` - Script training manuale completo
|
||||
- `deployment/systemd/ids-ml-training.service` - Service systemd
|
||||
- `deployment/systemd/ids-ml-training.timer` - Timer settimanale
|
||||
- `deployment/setup_ml_training_timer.sh` - Setup automatico
|
||||
- **Files modificati**:
|
||||
- `python_ml/train_hybrid.py` - Query SQL adattata allo schema DB reale
|
||||
- `python_ml/ml_hybrid_detector.py` - Supporto `packet_length`, backward compatible
|
||||
- `python_ml/dataset_loader.py` - Fix timestamp mancante in dataset sintetico
|
||||
- **Impatto**: Sistema userà automaticamente sklearn IF tramite fallback, tutti gli 8 checkpoint fail-fast funzionano identicamente
|
||||
Loading…
Reference in New Issue
Block a user