Infrastructure Persistence Layer

The following files were used as context for generating this wiki page:

Purpose and Scope

The Infrastructure Persistence Layer implements WSHawk's "zero-loss persistence" philosophy, ensuring that all security data—scan results, vulnerabilities, traffic logs, and reports—survives process crashes, network interruptions, and system restarts. This layer combines SQLite database storage with file system artifacts to provide comprehensive historical tracking and audit capabilities.

This document covers the database architecture, WAL mode configuration, file system organization, and crash recovery mechanisms. For information about the Web Management Dashboard that consumes this persistence layer, see Web Management Dashboard. For details on report generation and export formats, see Report Format and Output.

Architecture Overview

The persistence layer operates on two parallel storage systems: a SQLite database for structured metadata and queryable history, and a file system hierarchy for binary artifacts and generated reports.

Persistence Architecture Diagram

graph TB
    subgraph "Scanning Components"
        Scanner["WSHawkV2"]
        Session["SessionHijackingTester"]
        Browser["HeadlessBrowserXSSVerifier"]
        OAST["OASTProvider"]
    end
    
    subgraph "Persistence Layer"
        DBLayer["Database Layer"]
        FSLayer["File System Layer"]
    end
    
    subgraph "SQLite Database (~/.wshawk/scans.db)"
        ScansTable["scans table<br/>metadata, status, timestamps"]
        VulnsTable["vulnerabilities table<br/>findings, CVSS, payloads"]
        LogsTable["traffic_logs table<br/>request/response frames"]
        StatsTable["scan_statistics table<br/>performance metrics"]
    end
    
    subgraph "File System (~/.wshawk/)"
        Reports["reports/<br/>wshawk_report_*.html"]
        Screenshots["screenshots/<br/>xss_verification_*.png"]
        Traffic["traffic/<br/>websocket_frames_*.log"]
        SARIF["exports/<br/>wshawk_*.sarif"]
    end
    
    subgraph "Consumers"
        WebDash["Web Dashboard<br/>Flask App"]
        RESTAPI["REST API<br/>/api/scans"]
        CLI["CLI Report Viewer"]
    end
    
    Scanner --> DBLayer
    Session --> DBLayer
    Browser --> FSLayer
    OAST --> DBLayer
    
    DBLayer --> ScansTable
    DBLayer --> VulnsTable
    DBLayer --> LogsTable
    DBLayer --> StatsTable
    
    FSLayer --> Reports
    FSLayer --> Screenshots
    FSLayer --> Traffic
    FSLayer --> SARIF
    
    ScansTable --> WebDash
    VulnsTable --> WebDash
    Reports --> WebDash
    
    ScansTable --> RESTAPI
    VulnsTable --> RESTAPI
    
    Reports --> CLI

Sources: README.md:130-135, RELEASE_SUMMARY.md:15-19, docs/V3_COMPLETE_GUIDE.md:121-125

SQLite Database Layer

Database Location and Initialization

WSHawk stores its persistent database at ~/.wshawk/scans.db. This location ensures user-level isolation and survives Python virtual environment changes. The database is automatically created on first scan or when launching the web dashboard.

| Configuration | Value | Purpose | |--------------|-------|---------| | Database Path | ~/.wshawk/scans.db | Primary database file | | WAL File | ~/.wshawk/scans.db-wal | Write-ahead log for crash recovery | | Shared Memory | ~/.wshawk/scans.db-shm | WAL index for concurrent access | | Journal Mode | WAL | Write-Ahead Logging enabled | | Synchronous Mode | NORMAL | Balance between safety and performance |

Sources: README.md:132, RELEASE_SUMMARY.md:17, .gitignore:90-92

WAL Mode Configuration

Write-Ahead Logging (WAL) is critical to WSHawk's zero-loss guarantee. Unlike rollback journaling, WAL allows readers to access the database while a scan is writing vulnerability data, and ensures that committed transactions survive process termination.

WAL Transaction Flow

sequenceDiagram
    participant Scanner as "WSHawkV2"
    participant DB as "scans.db"
    participant WAL as "scans.db-wal"
    participant Reader as "Web Dashboard"
    
    Scanner->>DB: "BEGIN TRANSACTION"
    Scanner->>WAL: "Write scan metadata"
    Scanner->>WAL: "Write vulnerability #1"
    Scanner->>WAL: "Write vulnerability #2"
    
    Note over Scanner,WAL: "Process crashes here"
    
    Reader->>DB: "SELECT * FROM scans"
    DB->>WAL: "Check uncommitted data"
    WAL-->>DB: "Return last checkpoint"
    DB-->>Reader: "All committed scans"
    
    Scanner->>WAL: "COMMIT TRANSACTION"
    WAL->>DB: "Checkpoint (merge WAL to main DB)"

WAL Advantages:

Concurrent Access: Readers don't block writers, enabling real-time dashboard updates during scans
Crash Recovery: Uncommitted transactions in WAL are discarded; committed transactions are preserved
Performance: Sequential writes to WAL are faster than random writes to main database

Sources: docs/V3_COMPLETE_GUIDE.md:123-125, RELEASE_SUMMARY.md:15-19

Database Schema

The database schema is optimized for fast historical queries, vulnerability lookups, and traffic log retrieval. The exact implementation can be found in the web dashboard initialization code, but the logical structure follows this design:

Database Schema Diagram

erDiagram
    scans ||--o{ vulnerabilities : "contains"
    scans ||--o{ traffic_logs : "captures"
    scans ||--|| scan_statistics : "has"
    
    scans {
        INTEGER id PK
        TEXT target_url
        TEXT scan_mode
        DATETIME start_time
        DATETIME end_time
        TEXT status
        TEXT report_path
        INTEGER vuln_count
    }
    
    vulnerabilities {
        INTEGER id PK
        INTEGER scan_id FK
        TEXT vuln_type
        TEXT severity
        REAL cvss_score
        TEXT cvss_vector
        TEXT payload
        TEXT evidence
        TEXT confidence
        DATETIME detected_at
    }
    
    traffic_logs {
        INTEGER id PK
        INTEGER scan_id FK
        TEXT direction
        TEXT frame_type
        TEXT payload
        INTEGER frame_size
        DATETIME timestamp
    }
    
    scan_statistics {
        INTEGER scan_id FK
        INTEGER total_payloads
        INTEGER frames_sent
        INTEGER frames_received
        REAL avg_response_time
        INTEGER rate_limit_hits
    }

Key Design Decisions:

| Decision | Rationale | |----------|-----------| | Indexed scan_id | Enables fast filtering by scan for dashboard pagination | | TEXT for payloads | SQLite handles large text blobs efficiently; no binary encoding needed | | Separate statistics table | Keeps scan metadata lightweight for list queries | | DATETIME with ISO8601 | SQLite's native date functions for time-range queries | | No CASCADE DELETE | Preserves orphaned data for forensic analysis if scan record corrupted |

Sources: docs/V3_COMPLETE_GUIDE.md:293-297, README.md:132-135

File System Storage

The persistence layer organizes file system artifacts in a structured hierarchy under ~/.wshawk/. Each artifact type has a dedicated subdirectory to prevent naming collisions and enable efficient cleanup.

File System Hierarchy

graph TB
    Root["~/.wshawk/"]
    
    Root --> DB["scans.db<br/>scans.db-wal<br/>scans.db-shm"]
    Root --> Reports["reports/"]
    Root --> Screenshots["screenshots/"]
    Root --> Traffic["traffic/"]
    Root --> Exports["exports/"]
    Root --> Config["wshawk.yaml"]
    
    Reports --> HTMLReport["wshawk_report_20250115_143022.html<br/>wshawk_report_20250115_150311.html"]
    Screenshots --> XSSScreenshots["xss_verification_1737123456.png<br/>xss_verification_1737124789.png"]
    Traffic --> FrameLogs["websocket_frames_20250115_143022.log<br/>raw_traffic_20250115_143022.pcap"]
    Exports --> SARIFExport["wshawk_20250115_143022.sarif<br/>wshawk_20250115_143022.json<br/>wshawk_20250115_143022.csv"]
    
    style DB fill:#f9f9f9
    style Reports fill:#f9f9f9
    style Screenshots fill:#f9f9f9
    style Traffic fill:#f9f9f9
    style Exports fill:#f9f9f9

Report Files

HTML reports are the primary human-readable output format. Each report is self-contained with embedded CSS and includes CVSS scores, screenshots, and remediation guidance.

| File Pattern | Example | Contents | |--------------|---------|----------| | wshawk_report_YYYYMMDD_HHMMSS.html | wshawk_report_20250115_143022.html | Complete vulnerability report with evidence | | Report path stored in | scans.report_path column | Database foreign key for dashboard linking |

Sources: README.md:185, docs/V3_COMPLETE_GUIDE.md:293-297

Screenshots

Playwright-generated screenshots are stored as PNG files with timestamps. These provide visual evidence of XSS exploitation in real browsers.

| File Pattern | Example | Purpose | |--------------|---------|---------| | xss_verification_<timestamp>.png | xss_verification_1737123456.png | Browser screenshot showing alert dialog execution | | Referenced in | vulnerabilities.evidence field | Database stores screenshot path as JSON |

Sources: README.md:177

Traffic Logs

Complete WebSocket frame captures enable deep forensic analysis. Traffic logs contain both the raw binary frames and decoded text payloads.

| File Pattern | Example | Format | |--------------|---------|--------| | websocket_frames_YYYYMMDD_HHMMSS.log | websocket_frames_20250115_143022.log | Text format with timestamps and directions | | Optionally stored in | traffic_logs table | Database mirrors critical frames for query performance |

Sources: README.md:181, docs/V3_COMPLETE_GUIDE.md:297

Export Formats

Multi-format exports support integration with external platforms and SIEM systems.

| Format | File Pattern | Use Case | |--------|--------------|----------| | SARIF | wshawk_*.sarif | GitHub Security tab integration | | JSON | wshawk_*.json | Machine-readable structured data | | CSV | wshawk_*.csv | Spreadsheet analysis and reporting |

Sources: README.md:54, RELEASE_SUMMARY.md:53-54

Zero-Loss Persistence Design

WSHawk's persistence layer implements several guarantees to prevent data loss during adverse conditions:

Zero-Loss Guarantees

| Scenario | Mechanism | Recovery | |----------|-----------|----------| | Process Crash | WAL mode commits | Restart resumes from last checkpoint; uncommitted work discarded | | Network Interruption | In-memory buffer flush | Periodic WAL checkpoints (every 1000 frames) | | Disk Full | Pre-flight space check | Scan aborts gracefully with error; existing data preserved | | Database Corruption | WAL integrity check | Automatic rollback to last valid checkpoint | | Concurrent Access | SQLite locking | Readers never block writers; writers queue with timeout |

Data Persistence Flow During Scan

sequenceDiagram
    participant Scanner as "WSHawkV2"
    participant Buffer as "In-Memory Buffer"
    participant WAL as "WAL File"
    participant DB as "Main DB"
    participant Disk as "File System"
    
    Scanner->>Buffer: "Vulnerability detected"
    Scanner->>Buffer: "Store metadata"
    
    alt "Buffer reaches threshold (100 items)"
        Buffer->>WAL: "BEGIN TRANSACTION"
        Buffer->>WAL: "INSERT vulnerabilities"
        Buffer->>WAL: "INSERT traffic_logs"
        WAL->>WAL: "COMMIT TRANSACTION"
        Note over WAL: "Data now durable"
    end
    
    Scanner->>Disk: "Write HTML report"
    Scanner->>Disk: "Write screenshot"
    
    alt "Scan completes"
        Scanner->>WAL: "UPDATE scan status = 'completed'"
        WAL->>DB: "Checkpoint (merge WAL)"
    end
    
    alt "Process crashes"
        Note over WAL,DB: "Uncommitted buffer lost"
        Note over WAL: "Committed transactions safe"
    end

Checkpoint Strategy:

Frequency: Every 1000 vulnerability records or 10 MB WAL size
Mode: PASSIVE - doesn't block ongoing transactions
Fallback: Manual checkpoint on scan completion

Sources: docs/V3_COMPLETE_GUIDE.md:96, docs/V3_COMPLETE_GUIDE.md:123-125

Integration Points

The persistence layer exposes data through multiple interfaces to support different use cases:

Persistence Consumer Interfaces

graph LR
    subgraph "Persistence Layer"
        DB[("scans.db<br/>(SQLite)")]
        FS["File System<br/>(~/.wshawk/)"]
    end
    
    subgraph "Direct Consumers"
        Flask["Flask Web App<br/>web/app.py"]
        API["REST API<br/>/api/scans"]
    end
    
    subgraph "External Integrations"
        Jira["Jira Integration<br/>Read vuln details"]
        DD["DefectDojo<br/>Push findings"]
        Webhooks["Webhooks<br/>Scan completion alerts"]
    end
    
    subgraph "CLI Tools"
        ReportViewer["wshawk --view-scan ID"]
        ExportTool["wshawk --export-sarif"]
    end
    
    DB --> Flask
    DB --> API
    FS --> Flask
    
    DB --> Jira
    DB --> DD
    DB --> Webhooks
    
    DB --> ReportViewer
    FS --> ReportViewer
    DB --> ExportTool

Web Dashboard Integration

The Flask web application queries the database for scan history display and provides a REST API for programmatic access. Database access is authenticated using SHA-256 password hashing.

| Endpoint | Query | Purpose | |----------|-------|---------| | GET /scans | SELECT * FROM scans ORDER BY start_time DESC LIMIT 50 | List recent scans | | GET /scan/<id> | SELECT * FROM vulnerabilities WHERE scan_id = ? | Vulnerability details | | GET /scan/<id>/report | Read report_path from database, serve file | HTML report download | | POST /api/scans | INSERT INTO scans (target_url, ...) VALUES (?, ...) | Trigger new scan |

Sources: README.md:112-135, RELEASE_SUMMARY.md:15-19

External Platform Integration

Integration modules read from the database to construct API payloads for external platforms:

Jira: Reads vulnerabilities table filtered by severity >= HIGH, creates tickets with CVSS details
DefectDojo: Exports all findings for a scan as JSON, POSTs to /api/v2/import-scan/
Webhooks: Aggregates scan statistics on completion, sends rich notifications to Slack/Discord/Teams

Sources: RELEASE_SUMMARY.md:30-34

Data Lifecycle Management

Scan Lifecycle States

The scans.status field tracks the execution state of each scan:

| Status | Meaning | Persistence State | |--------|---------|-------------------| | pending | Scan queued via API | Row created, no vulnerabilities yet | | running | Active scan in progress | Vulnerabilities being written incrementally | | completed | Scan finished successfully | Final report generated, all data committed | | failed | Scan terminated with error | Partial data preserved, error logged | | cancelled | User-initiated abort | Data up to cancellation point preserved |

Retention and Cleanup

WSHawk does not implement automatic data retention policies. All scans are preserved indefinitely unless manually deleted. The web dashboard provides a cleanup interface:

Delete Scan: Removes row from scans table, deletes associated report file
Cascade Behavior: Manually deletes related vulnerabilities and traffic_logs rows
Orphan Detection: Dashboard UI flags reports on disk without database entries

Manual Cleanup Commands:

# Delete database (preserves reports)
rm ~/.wshawk/scans.db*

# Delete all artifacts
rm -rf ~/.wshawk/

# Delete reports older than 30 days
find ~/.wshawk/reports/ -name "*.html" -mtime +30 -delete

Sources: .gitignore:90-96, docs/V3_COMPLETE_GUIDE.md:293-297

Performance Characteristics

The persistence layer is optimized for write-heavy workloads with occasional read queries:

| Operation | Performance | Notes | |-----------|-------------|-------| | Write vulnerability | ~0.1ms (buffered) | In-memory buffer, periodic flush | | Write traffic log | ~0.05ms (buffered) | Higher throughput than vulnerabilities | | Query scan list | ~10ms | Indexed by start_time DESC | | Query vulnerability details | ~50ms | Full-text search on payloads possible | | Generate HTML report | ~500ms | Template rendering + file I/O | | WAL checkpoint | ~200ms | Non-blocking, passive mode |

Scaling Limits:

Max database size: Limited by SQLite (281 TB theoretical, 100 GB practical)
Max concurrent readers: Unlimited (WAL mode)
Max concurrent writers: 1 (SQLite limitation)
Recommended cleanup: Every 10,000 scans or 1 GB database size

Sources: docs/V3_COMPLETE_GUIDE.md:123-125

Configuration Options

The persistence layer can be configured through environment variables and the wshawk.yaml configuration file:

| Setting | Environment Variable | Default | Purpose | |---------|---------------------|---------|---------| | Database Path | WSHAWK_DB_PATH | ~/.wshawk/scans.db | Override default database location | | WAL Mode | WSHAWK_WAL_MODE | 1 (enabled) | Disable for network file systems | | Checkpoint Interval | WSHAWK_WAL_CHECKPOINT | 1000 | Records between checkpoints | | Reports Directory | WSHAWK_REPORTS_DIR | ~/.wshawk/reports/ | Custom report output location | | Log Retention | WSHAWK_LOG_RETENTION_DAYS | 0 (infinite) | Auto-delete old traffic logs |

Example Configuration:

# wshawk.yaml
persistence:
  database_path: "/data/wshawk/scans.db"
  wal_mode: true
  checkpoint_interval: 5000
  reports_directory: "/data/wshawk/reports/"
  log_retention_days: 90

Sources: README.md:137-150

Summary

The Infrastructure Persistence Layer provides enterprise-grade data durability through SQLite WAL mode, structured file system organization, and zero-loss crash recovery mechanisms. The combination of database storage for queryable metadata and file system artifacts for binary evidence enables comprehensive historical analysis, real-time dashboard updates, and seamless integration with external security platforms.

Key Design Principles:

Zero-Loss Guarantee: WAL mode ensures committed transactions survive all crash scenarios
Concurrent Access: Readers never block writers, enabling real-time dashboard during scans
Structured Organization: Clear separation between database metadata and file system artifacts
Integration-Ready: Direct SQL access for custom integrations and external platforms

Sources: docs/V3_COMPLETE_GUIDE.md:96-98, docs/V3_COMPLETE_GUIDE.md:121-131, RELEASE_SUMMARY.md:15-19