Scan History and Persistence

The following files were used as context for generating this wiki page:

Purpose and Scope

This document describes WSHawk's Infrastructure Persistence Plane, specifically the SQLite-backed scan history system that provides zero-loss data persistence for all security assessments. This page covers the database architecture, storage mechanisms, Write-Ahead Logging (WAL) mode for crash recovery, and the web dashboard's scan history management interface.

For information about launching the web dashboard and authentication, see Dashboard Overview and Launch. For REST API programmatic access to scan history, see REST API Reference. For broader architectural context, see Infrastructure Persistence Layer.

SQLite Database Architecture

WSHawk v3.0.0 replaces memory-resident data structures with a persistent SQLite database that survives crashes, system reboots, and scanner terminations. The database implements a "zero-loss persistence" design philosophy where security data is never ephemeral.

Storage Location

The database file is stored at one of the following locations, depending on deployment mode:

| Deployment Mode | Database Path | Additional Files | |----------------|---------------|------------------| | Local Development | ./scans.db | scans.db-wal, scans.db-shm | | Production/Docker | ~/.wshawk/scans.db | ~/.wshawk/scans.db-wal, ~/.wshawk/scans.db-shm | | Docker Volume Mount | /app/.wshawk/scans.db | /app/.wshawk/scans.db-wal, /app/.wshawk/scans.db-shm |

The .gitignore file explicitly excludes these database files from version control: .gitignore:90-93

Sources: README.md:132, RELEASE_SUMMARY.md:17, .gitignore:90-93

Database Schema and Stored Data

The SQLite database maintains three primary data categories for comprehensive security audit trails:

erDiagram
    SCANS ||--o{ VULNERABILITIES : contains
    SCANS ||--o{ TRAFFIC_LOGS : records
    SCANS ||--o{ SCAN_METRICS : tracks
    
    SCANS {
        integer id PK
        string target_url
        datetime start_time
        datetime end_time
        string scan_type
        string status
        string report_path
    }
    
    VULNERABILITIES {
        integer id PK
        integer scan_id FK
        string vuln_type
        string severity
        float cvss_score
        string cvss_vector
        text payload
        text response
        text evidence
        string confidence
    }
    
    TRAFFIC_LOGS {
        integer id PK
        integer scan_id FK
        datetime timestamp
        string direction
        text message_content
        integer frame_size
    }
    
    SCAN_METRICS {
        integer id PK
        integer scan_id FK
        integer total_payloads
        integer messages_sent
        integer messages_received
        float avg_rps
        integer connection_errors
    }

Data Categories

1. Scan Metadata

Target WebSocket URL
Start/end timestamps
Scan type (quick, advanced, defensive)
Final status (completed, failed, interrupted)
Generated report file path

2. Vulnerability Findings

Vulnerability type (SQL Injection, XSS, XXE, etc.)
CVSS v3.1 score and vector string
Proof-of-concept payload
Server response that confirmed exploitation
Evidence artifacts (screenshots for XSS, OAST callbacks for XXE/SSRF)
Confidence level (LOW/MEDIUM/HIGH)

3. Traffic Logs

Every WebSocket frame sent and received
Timestamps for temporal analysis
Message direction (client→server, server→client)
Frame size for bandwidth analysis

4. Performance Metrics

Total payloads tested
Average requests per second (RPS)
Connection stability metrics
Error rate tracking

Sources: docs/V3_COMPLETE_GUIDE.md:293-297, RELEASE_SUMMARY.md:17

WAL Mode and Zero-Loss Design

WSHawk configures SQLite to use Write-Ahead Logging (WAL) mode, a critical feature for high-throughput scanning operations that ensures data integrity even during unexpected terminations.

Write-Ahead Logging Mechanism

sequenceDiagram
    participant Scanner as "WSHawkV2 Scanner"
    participant WAL as "scans.db-wal"
    participant DB as "scans.db"
    participant SHM as "scans.db-shm (Shared Memory)"
    
    Scanner->>WAL: "Write vulnerability finding"
    Note over WAL: "Append-only log<br/>(no blocking)"
    WAL-->>Scanner: "Write acknowledged"
    
    Scanner->>WAL: "Write traffic log"
    WAL-->>Scanner: "Write acknowledged"
    
    Note over WAL,SHM: "Checkpoint threshold reached"
    
    WAL->>DB: "Commit WAL pages to main DB"
    WAL->>SHM: "Update shared memory index"
    
    Note over Scanner: "Scanner crashes mid-scan"
    
    Note over WAL: "On restart: WAL intact"
    WAL->>DB: "Replay uncommitted pages"
    Note over DB: "Data recovered successfully"

Crash Recovery Behavior

When WSHawk is terminated unexpectedly (kill signal, system crash, power loss):

WAL Preservation: All writes committed to scans.db-wal are preserved on disk
Automatic Recovery: On next launch, SQLite automatically replays the WAL
Transaction Integrity: Only complete transactions are recovered; partial writes are discarded
No Data Loss: All acknowledged vulnerability findings and traffic logs are restored

This architecture ensures that even if WSHawk is forcibly killed during a 10-hour scan, all data up to the last successful database write is preserved.

Sources: docs/V3_COMPLETE_GUIDE.md:124, RELEASE_SUMMARY.md:3

Scan Lifecycle and Persistence Flow

The following diagram shows how scan data flows from the scanner engine through to persistent storage:

graph TB
    subgraph "WSHawkV2 Scanner Execution"
        Init["Scanner Initialization<br/>scanner_v2.WSHawkV2"]
        Connect["WebSocket Connection<br/>ws.connect()"]
        Inject["Payload Injection Loop<br/>send_payload()"]
        Analyze["VulnerabilityVerifier<br/>analyze_response()"]
    end
    
    subgraph "Database Persistence Layer"
        CreateScan["Create Scan Record<br/>INSERT INTO scans<br/>status='running'"]
        LogTraffic["Log Traffic Frame<br/>INSERT INTO traffic_logs<br/>timestamp, direction, content"]
        StoreVuln["Store Vulnerability<br/>INSERT INTO vulnerabilities<br/>cvss_score, payload, evidence"]
        UpdateMetrics["Update Metrics<br/>UPDATE scan_metrics<br/>total_payloads, avg_rps"]
        FinalizeScan["Finalize Scan<br/>UPDATE scans<br/>status='completed', end_time"]
    end
    
    subgraph "File System Storage"
        GenReport["Generate HTML Report<br/>wshawk_report_*.html"]
        SaveScreenshot["Save Screenshots<br/>xss_evidence_*.png"]
        StoreReportPath["UPDATE scans<br/>SET report_path=?"]
    end
    
    subgraph "SQLite Database Files"
        WAL["scans.db-wal<br/>Write-Ahead Log"]
        DB["scans.db<br/>Main Database"]
        SHM["scans.db-shm<br/>Shared Memory Index"]
    end
    
    Init --> CreateScan
    CreateScan --> WAL
    
    Connect --> Inject
    Inject --> LogTraffic
    LogTraffic --> WAL
    
    Inject --> Analyze
    Analyze --> StoreVuln
    StoreVuln --> WAL
    
    Inject --> UpdateMetrics
    UpdateMetrics --> WAL
    
    Analyze --> GenReport
    GenReport --> SaveScreenshot
    SaveScreenshot --> StoreReportPath
    StoreReportPath --> WAL
    
    WAL --> DB
    WAL --> SHM
    
    Inject --> FinalizeScan
    FinalizeScan --> WAL

Persistence Workflow

Scan Initialization: When WSHawkV2.run_heuristic_scan() starts, a new record is inserted into the scans table with status='running'
Real-Time Logging: Every WebSocket frame (send/receive) is immediately written to traffic_logs via WAL
Vulnerability Recording: When VulnerabilityVerifier confirms a finding, it's written to vulnerabilities with full CVSS scoring
Metric Tracking: Performance counters (RPS, payload counts) are periodically flushed to scan_metrics
Report Linking: Generated HTML reports and screenshot files are linked back to the scan record
Finalization: On scan completion, the status is updated to completed with an end timestamp

Sources: docs/V3_COMPLETE_GUIDE.md:115-120, RELEASE_SUMMARY.md:16-19

Web Dashboard Scan History Interface

The web management dashboard provides a visual interface for browsing, filtering, and managing historical scans.

Dashboard Storage Architecture

graph LR
    subgraph "Browser Client"
        UI["Web Dashboard UI<br/>wshawk/web/templates/"]
        JS["JavaScript<br/>Real-time Updates"]
    end
    
    subgraph "Flask Web Server"
        Routes["Flask Routes<br/>@app.route('/api/scans')"]
        Auth["Authentication Middleware<br/>SHA-256 Password Check"]
        Query["Database Query Layer<br/>SELECT FROM scans"]
    end
    
    subgraph "SQLite Backend"
        ScanTable["scans table"]
        VulnTable["vulnerabilities table"]
        Indexes["Indexes:<br/>idx_scan_timestamp<br/>idx_vuln_severity"]
    end
    
    UI --> JS
    JS -->|"HTTP GET /api/scans"| Routes
    Routes --> Auth
    Auth --> Query
    Query --> ScanTable
    Query --> VulnTable
    Query --> Indexes
    
    ScanTable --> Query
    VulnTable --> Query
    
    Query --> Routes
    Routes -->|"JSON Response"| JS
    JS --> UI

Visual Progress Tracking

The dashboard provides real-time visibility into scan execution:

| Progress Indicator | Data Source | Update Frequency | |-------------------|-------------|------------------| | Scan Status | scans.status column | Real-time (WebSocket) | | Payloads Tested | scan_metrics.total_payloads | Every 10 payloads | | Vulnerabilities Found | COUNT(vulnerabilities) | On each finding | | Current RPS | scan_metrics.avg_rps | Every 5 seconds | | Connection Health | scan_metrics.connection_errors | On each error |

The dashboard uses JavaScript polling or WebSocket updates to refresh these metrics without page reload, providing a "live view of the scan brain" as described in docs/V3_COMPLETE_GUIDE.md:305.

Sources: README.md:133, docs/V3_COMPLETE_GUIDE.md:305

Report Management and Retention

Report File System Structure

graph TB
    subgraph "Report Storage Directory"
        ReportsDir["./reports/ or ~/.wshawk/reports/"]
        HTMLReports["HTML Reports<br/>wshawk_report_YYYYMMDD_HHMMSS.html"]
        Screenshots["Screenshots<br/>xss_evidence_<scan_id>_<vuln_id>.png"]
        TrafficLogs["Traffic Dumps<br/>traffic_<scan_id>.json"]
        SARIFExports["SARIF Exports<br/>wshawk_<scan_id>.sarif"]
    end
    
    subgraph "Database References"
        ScanRecord["scans.report_path<br/>'./reports/wshawk_report_*.html'"]
        VulnRecord["vulnerabilities.evidence<br/>'./reports/xss_evidence_*.png'"]
    end
    
    ReportsDir --> HTMLReports
    ReportsDir --> Screenshots
    ReportsDir --> TrafficLogs
    ReportsDir --> SARIFExports
    
    HTMLReports -.->|"referenced by"| ScanRecord
    Screenshots -.->|"referenced by"| VulnRecord

Interactive Report Management

The web dashboard provides operations on historical scans:

View Operations:

List All Scans: Paginated view with filtering by date, target, status, severity
Scan Details: Drill-down into individual scan showing all findings, traffic logs, metrics
Vulnerability Timeline: Chronological view of when each vulnerability was discovered during the scan

Management Operations:

Delete Scan: Removes database record and associated report files
Re-export: Regenerate report in different format (JSON, CSV, SARIF)
Compare Scans: Side-by-side comparison of two scans against the same target to track remediation

Data Retention: By default, WSHawk retains all scan history indefinitely. For production deployments with disk constraints, administrators can configure automatic cleanup policies:

# Example wshawk.yaml configuration (not included by default)
persistence:
  retention_days: 90  # Delete scans older than 90 days
  max_scans: 1000     # Keep only most recent 1000 scans
  auto_cleanup: true  # Enable automatic cleanup on dashboard startup

Sources: README.md:134, .gitignore:50-53

Historical Comparison and Regression Detection

WSHawk's persistent history enables security teams to track remediation progress over time.

Comparison Workflow

sequenceDiagram
    participant User as "Security Analyst"
    participant Dashboard as "Web Dashboard"
    participant DB as "scans.db"
    participant Diff as "Comparison Engine"
    
    User->>Dashboard: "Select Scan A (Baseline)"
    Dashboard->>DB: "SELECT * FROM scans WHERE id=A"
    DB-->>Dashboard: "Scan A data (10 vulns)"
    
    User->>Dashboard: "Select Scan B (Current)"
    Dashboard->>DB: "SELECT * FROM scans WHERE id=B"
    DB-->>Dashboard: "Scan B data (3 vulns)"
    
    Dashboard->>Diff: "Compare(A, B)"
    
    Diff->>Diff: "Identify Fixed: 7 vulns"
    Diff->>Diff: "Identify Persistent: 3 vulns"
    Diff->>Diff: "Identify New: 0 vulns"
    
    Diff-->>Dashboard: "Comparison Report"
    Dashboard-->>User: "Display: 7 Fixed, 3 Persistent, 0 New"

Regression Detection

When comparing two scans of the same target:

Fixed Vulnerabilities: Present in baseline scan but absent in current scan (indicates successful remediation)
Persistent Vulnerabilities: Present in both scans with identical payloads (indicates incomplete remediation)
New Vulnerabilities: Absent in baseline but present in current scan (indicates regression or new attack surface)
Changed Exploitability: Same vulnerability type but different CVSS score (indicates partial fix or changed conditions)

This historical analysis is critical for tracking security posture improvements and identifying regressions introduced by code changes.

Sources: docs/V3_COMPLETE_GUIDE.md:307

Database Performance and Optimization

Indexing Strategy

To support fast queries on large scan histories, WSHawk creates indexes on high-cardinality columns:

-- Conceptual indexes (actual implementation in scanner code)
CREATE INDEX idx_scan_timestamp ON scans(start_time DESC);
CREATE INDEX idx_scan_target ON scans(target_url);
CREATE INDEX idx_vuln_severity ON vulnerabilities(severity, cvss_score DESC);
CREATE INDEX idx_vuln_type ON vulnerabilities(vuln_type);
CREATE INDEX idx_traffic_scan ON traffic_logs(scan_id, timestamp);

Query Optimization

Common dashboard queries are optimized for performance:

| Query Type | Optimization | Expected Performance | |-----------|--------------|---------------------| | Recent Scans List | idx_scan_timestamp descending | < 50ms for 10,000 scans | | Vulnerability by Severity | idx_vuln_severity with covering index | < 100ms for 50,000 findings | | Target Scan History | idx_scan_target with JOIN | < 200ms for 1,000 target scans | | Traffic Log Replay | idx_traffic_scan sequential read | Streaming, no memory limit |

Disk Space Considerations

The SQLite database grows based on scan activity:

Typical Scan: 2-5 MB (22,000 payloads, 1 hour scan)
With Full Traffic Logging: 20-50 MB (every frame recorded)
With Screenshots: +5 MB per XSS finding (Playwright evidence)
Annual Production Usage: ~50 GB (daily scans, 1 year retention)

For long-term deployments, consider implementing retention policies or archiving old scans to separate databases.

Sources: docs/V3_COMPLETE_GUIDE.md:293-297

Docker Volume Persistence

For containerized deployments, proper volume mounting ensures scan history survives container recreation.

Volume Mount Configuration

# Recommended Docker run command for persistence
docker run --rm \
  -v wshawk_data:/app/.wshawk \
  -v wshawk_reports:/app/reports \
  rothackers/wshawk --web --host 0.0.0.0

Docker Compose Configuration

# Example docker-compose.yml excerpt
services:
  wshawk:
    image: rothackers/wshawk:latest
    volumes:
      - wshawk_data:/app/.wshawk      # Database persistence
      - wshawk_reports:/app/reports    # Report file persistence
    environment:
      - WSHAWK_WEB_PASSWORD=${WEB_PASSWORD}

volumes:
  wshawk_data:
    driver: local
  wshawk_reports:
    driver: local

Without volume mounts, the database and reports are lost when the container stops. The volume approach ensures that:

scans.db persists across container restarts
WAL files (scans.db-wal, scans.db-shm) are properly maintained
Report HTML files and screenshots remain accessible
Multiple container instances can share the same data volume (with appropriate locking)

Sources: README.md:64-78

Summary

WSHawk's Infrastructure Persistence Plane provides enterprise-grade data durability through:

SQLite with WAL Mode: Zero-loss persistence even during crashes
Comprehensive Schema: Stores scan metadata, vulnerabilities, traffic logs, and performance metrics
Web Dashboard Interface: Visual scan history browsing and management
Historical Analysis: Comparison and regression detection across scans
Production-Ready: Optimized indexes, volume persistence, and retention policies

This architecture transforms WSHawk from a one-time scanning tool into a persistent security monitoring platform suitable for continuous assessment and long-term security posture tracking.

Sources: README.md:23, README.md:112-136, RELEASE_SUMMARY.md:15-19, docs/V3_COMPLETE_GUIDE.md:122-125, docs/V3_COMPLETE_GUIDE.md:289-307