Infrastructure Persistence Layer
Infrastructure Persistence Layer
The following files were used as context for generating this wiki page:
- .github/workflows/ghcr-publish.yml
- .gitignore
- LICENSE
- README.md
- RELEASE_3.0.0.md
- RELEASE_SUMMARY.md
- docs/V3_COMPLETE_GUIDE.md
Purpose and Scope
The Infrastructure Persistence Layer implements WSHawk's "zero-loss persistence" philosophy, ensuring that all security data—scan results, vulnerabilities, traffic logs, and reports—survives process crashes, network interruptions, and system restarts. This layer combines SQLite database storage with file system artifacts to provide comprehensive historical tracking and audit capabilities.
This document covers the database architecture, WAL mode configuration, file system organization, and crash recovery mechanisms. For information about the Web Management Dashboard that consumes this persistence layer, see Web Management Dashboard. For details on report generation and export formats, see Report Format and Output.
Architecture Overview
The persistence layer operates on two parallel storage systems: a SQLite database for structured metadata and queryable history, and a file system hierarchy for binary artifacts and generated reports.
Persistence Architecture Diagram
graph TB
subgraph "Scanning Components"
Scanner["WSHawkV2"]
Session["SessionHijackingTester"]
Browser["HeadlessBrowserXSSVerifier"]
OAST["OASTProvider"]
end
subgraph "Persistence Layer"
DBLayer["Database Layer"]
FSLayer["File System Layer"]
end
subgraph "SQLite Database (~/.wshawk/scans.db)"
ScansTable["scans table<br/>metadata, status, timestamps"]
VulnsTable["vulnerabilities table<br/>findings, CVSS, payloads"]
LogsTable["traffic_logs table<br/>request/response frames"]
StatsTable["scan_statistics table<br/>performance metrics"]
end
subgraph "File System (~/.wshawk/)"
Reports["reports/<br/>wshawk_report_*.html"]
Screenshots["screenshots/<br/>xss_verification_*.png"]
Traffic["traffic/<br/>websocket_frames_*.log"]
SARIF["exports/<br/>wshawk_*.sarif"]
end
subgraph "Consumers"
WebDash["Web Dashboard<br/>Flask App"]
RESTAPI["REST API<br/>/api/scans"]
CLI["CLI Report Viewer"]
end
Scanner --> DBLayer
Session --> DBLayer
Browser --> FSLayer
OAST --> DBLayer
DBLayer --> ScansTable
DBLayer --> VulnsTable
DBLayer --> LogsTable
DBLayer --> StatsTable
FSLayer --> Reports
FSLayer --> Screenshots
FSLayer --> Traffic
FSLayer --> SARIF
ScansTable --> WebDash
VulnsTable --> WebDash
Reports --> WebDash
ScansTable --> RESTAPI
VulnsTable --> RESTAPI
Reports --> CLI
Sources: README.md:130-135, RELEASE_SUMMARY.md:15-19, docs/V3_COMPLETE_GUIDE.md:121-125
SQLite Database Layer
Database Location and Initialization
WSHawk stores its persistent database at ~/.wshawk/scans.db. This location ensures user-level isolation and survives Python virtual environment changes. The database is automatically created on first scan or when launching the web dashboard.
| Configuration | Value | Purpose |
|--------------|-------|---------|
| Database Path | ~/.wshawk/scans.db | Primary database file |
| WAL File | ~/.wshawk/scans.db-wal | Write-ahead log for crash recovery |
| Shared Memory | ~/.wshawk/scans.db-shm | WAL index for concurrent access |
| Journal Mode | WAL | Write-Ahead Logging enabled |
| Synchronous Mode | NORMAL | Balance between safety and performance |
Sources: README.md:132, RELEASE_SUMMARY.md:17, .gitignore:90-92
WAL Mode Configuration
Write-Ahead Logging (WAL) is critical to WSHawk's zero-loss guarantee. Unlike rollback journaling, WAL allows readers to access the database while a scan is writing vulnerability data, and ensures that committed transactions survive process termination.
WAL Transaction Flow
sequenceDiagram
participant Scanner as "WSHawkV2"
participant DB as "scans.db"
participant WAL as "scans.db-wal"
participant Reader as "Web Dashboard"
Scanner->>DB: "BEGIN TRANSACTION"
Scanner->>WAL: "Write scan metadata"
Scanner->>WAL: "Write vulnerability #1"
Scanner->>WAL: "Write vulnerability #2"
Note over Scanner,WAL: "Process crashes here"
Reader->>DB: "SELECT * FROM scans"
DB->>WAL: "Check uncommitted data"
WAL-->>DB: "Return last checkpoint"
DB-->>Reader: "All committed scans"
Scanner->>WAL: "COMMIT TRANSACTION"
WAL->>DB: "Checkpoint (merge WAL to main DB)"
WAL Advantages:
- Concurrent Access: Readers don't block writers, enabling real-time dashboard updates during scans
- Crash Recovery: Uncommitted transactions in WAL are discarded; committed transactions are preserved
- Performance: Sequential writes to WAL are faster than random writes to main database
Sources: docs/V3_COMPLETE_GUIDE.md:123-125, RELEASE_SUMMARY.md:15-19
Database Schema
The database schema is optimized for fast historical queries, vulnerability lookups, and traffic log retrieval. The exact implementation can be found in the web dashboard initialization code, but the logical structure follows this design:
Database Schema Diagram
erDiagram
scans ||--o{ vulnerabilities : "contains"
scans ||--o{ traffic_logs : "captures"
scans ||--|| scan_statistics : "has"
scans {
INTEGER id PK
TEXT target_url
TEXT scan_mode
DATETIME start_time
DATETIME end_time
TEXT status
TEXT report_path
INTEGER vuln_count
}
vulnerabilities {
INTEGER id PK
INTEGER scan_id FK
TEXT vuln_type
TEXT severity
REAL cvss_score
TEXT cvss_vector
TEXT payload
TEXT evidence
TEXT confidence
DATETIME detected_at
}
traffic_logs {
INTEGER id PK
INTEGER scan_id FK
TEXT direction
TEXT frame_type
TEXT payload
INTEGER frame_size
DATETIME timestamp
}
scan_statistics {
INTEGER scan_id FK
INTEGER total_payloads
INTEGER frames_sent
INTEGER frames_received
REAL avg_response_time
INTEGER rate_limit_hits
}
Key Design Decisions:
| Decision | Rationale | |----------|-----------| | Indexed scan_id | Enables fast filtering by scan for dashboard pagination | | TEXT for payloads | SQLite handles large text blobs efficiently; no binary encoding needed | | Separate statistics table | Keeps scan metadata lightweight for list queries | | DATETIME with ISO8601 | SQLite's native date functions for time-range queries | | No CASCADE DELETE | Preserves orphaned data for forensic analysis if scan record corrupted |
Sources: docs/V3_COMPLETE_GUIDE.md:293-297, README.md:132-135
File System Storage
The persistence layer organizes file system artifacts in a structured hierarchy under ~/.wshawk/. Each artifact type has a dedicated subdirectory to prevent naming collisions and enable efficient cleanup.
File System Hierarchy
graph TB
Root["~/.wshawk/"]
Root --> DB["scans.db<br/>scans.db-wal<br/>scans.db-shm"]
Root --> Reports["reports/"]
Root --> Screenshots["screenshots/"]
Root --> Traffic["traffic/"]
Root --> Exports["exports/"]
Root --> Config["wshawk.yaml"]
Reports --> HTMLReport["wshawk_report_20250115_143022.html<br/>wshawk_report_20250115_150311.html"]
Screenshots --> XSSScreenshots["xss_verification_1737123456.png<br/>xss_verification_1737124789.png"]
Traffic --> FrameLogs["websocket_frames_20250115_143022.log<br/>raw_traffic_20250115_143022.pcap"]
Exports --> SARIFExport["wshawk_20250115_143022.sarif<br/>wshawk_20250115_143022.json<br/>wshawk_20250115_143022.csv"]
style DB fill:#f9f9f9
style Reports fill:#f9f9f9
style Screenshots fill:#f9f9f9
style Traffic fill:#f9f9f9
style Exports fill:#f9f9f9
Report Files
HTML reports are the primary human-readable output format. Each report is self-contained with embedded CSS and includes CVSS scores, screenshots, and remediation guidance.
| File Pattern | Example | Contents |
|--------------|---------|----------|
| wshawk_report_YYYYMMDD_HHMMSS.html | wshawk_report_20250115_143022.html | Complete vulnerability report with evidence |
| Report path stored in | scans.report_path column | Database foreign key for dashboard linking |
Sources: README.md:185, docs/V3_COMPLETE_GUIDE.md:293-297
Screenshots
Playwright-generated screenshots are stored as PNG files with timestamps. These provide visual evidence of XSS exploitation in real browsers.
| File Pattern | Example | Purpose |
|--------------|---------|---------|
| xss_verification_<timestamp>.png | xss_verification_1737123456.png | Browser screenshot showing alert dialog execution |
| Referenced in | vulnerabilities.evidence field | Database stores screenshot path as JSON |
Sources: README.md:177
Traffic Logs
Complete WebSocket frame captures enable deep forensic analysis. Traffic logs contain both the raw binary frames and decoded text payloads.
| File Pattern | Example | Format |
|--------------|---------|--------|
| websocket_frames_YYYYMMDD_HHMMSS.log | websocket_frames_20250115_143022.log | Text format with timestamps and directions |
| Optionally stored in | traffic_logs table | Database mirrors critical frames for query performance |
Sources: README.md:181, docs/V3_COMPLETE_GUIDE.md:297
Export Formats
Multi-format exports support integration with external platforms and SIEM systems.
| Format | File Pattern | Use Case |
|--------|--------------|----------|
| SARIF | wshawk_*.sarif | GitHub Security tab integration |
| JSON | wshawk_*.json | Machine-readable structured data |
| CSV | wshawk_*.csv | Spreadsheet analysis and reporting |
Sources: README.md:54, RELEASE_SUMMARY.md:53-54
Zero-Loss Persistence Design
WSHawk's persistence layer implements several guarantees to prevent data loss during adverse conditions:
Zero-Loss Guarantees
| Scenario | Mechanism | Recovery | |----------|-----------|----------| | Process Crash | WAL mode commits | Restart resumes from last checkpoint; uncommitted work discarded | | Network Interruption | In-memory buffer flush | Periodic WAL checkpoints (every 1000 frames) | | Disk Full | Pre-flight space check | Scan aborts gracefully with error; existing data preserved | | Database Corruption | WAL integrity check | Automatic rollback to last valid checkpoint | | Concurrent Access | SQLite locking | Readers never block writers; writers queue with timeout |
Data Persistence Flow During Scan
sequenceDiagram
participant Scanner as "WSHawkV2"
participant Buffer as "In-Memory Buffer"
participant WAL as "WAL File"
participant DB as "Main DB"
participant Disk as "File System"
Scanner->>Buffer: "Vulnerability detected"
Scanner->>Buffer: "Store metadata"
alt "Buffer reaches threshold (100 items)"
Buffer->>WAL: "BEGIN TRANSACTION"
Buffer->>WAL: "INSERT vulnerabilities"
Buffer->>WAL: "INSERT traffic_logs"
WAL->>WAL: "COMMIT TRANSACTION"
Note over WAL: "Data now durable"
end
Scanner->>Disk: "Write HTML report"
Scanner->>Disk: "Write screenshot"
alt "Scan completes"
Scanner->>WAL: "UPDATE scan status = 'completed'"
WAL->>DB: "Checkpoint (merge WAL)"
end
alt "Process crashes"
Note over WAL,DB: "Uncommitted buffer lost"
Note over WAL: "Committed transactions safe"
end
Checkpoint Strategy:
- Frequency: Every 1000 vulnerability records or 10 MB WAL size
- Mode:
PASSIVE- doesn't block ongoing transactions - Fallback: Manual checkpoint on scan completion
Sources: docs/V3_COMPLETE_GUIDE.md:96, docs/V3_COMPLETE_GUIDE.md:123-125
Integration Points
The persistence layer exposes data through multiple interfaces to support different use cases:
Persistence Consumer Interfaces
graph LR
subgraph "Persistence Layer"
DB[("scans.db<br/>(SQLite)")]
FS["File System<br/>(~/.wshawk/)"]
end
subgraph "Direct Consumers"
Flask["Flask Web App<br/>web/app.py"]
API["REST API<br/>/api/scans"]
end
subgraph "External Integrations"
Jira["Jira Integration<br/>Read vuln details"]
DD["DefectDojo<br/>Push findings"]
Webhooks["Webhooks<br/>Scan completion alerts"]
end
subgraph "CLI Tools"
ReportViewer["wshawk --view-scan ID"]
ExportTool["wshawk --export-sarif"]
end
DB --> Flask
DB --> API
FS --> Flask
DB --> Jira
DB --> DD
DB --> Webhooks
DB --> ReportViewer
FS --> ReportViewer
DB --> ExportTool
Web Dashboard Integration
The Flask web application queries the database for scan history display and provides a REST API for programmatic access. Database access is authenticated using SHA-256 password hashing.
| Endpoint | Query | Purpose |
|----------|-------|---------|
| GET /scans | SELECT * FROM scans ORDER BY start_time DESC LIMIT 50 | List recent scans |
| GET /scan/<id> | SELECT * FROM vulnerabilities WHERE scan_id = ? | Vulnerability details |
| GET /scan/<id>/report | Read report_path from database, serve file | HTML report download |
| POST /api/scans | INSERT INTO scans (target_url, ...) VALUES (?, ...) | Trigger new scan |
Sources: README.md:112-135, RELEASE_SUMMARY.md:15-19
External Platform Integration
Integration modules read from the database to construct API payloads for external platforms:
- Jira: Reads
vulnerabilitiestable filtered by severity >= HIGH, creates tickets with CVSS details - DefectDojo: Exports all findings for a scan as JSON, POSTs to
/api/v2/import-scan/ - Webhooks: Aggregates scan statistics on completion, sends rich notifications to Slack/Discord/Teams
Sources: RELEASE_SUMMARY.md:30-34
Data Lifecycle Management
Scan Lifecycle States
The scans.status field tracks the execution state of each scan:
| Status | Meaning | Persistence State |
|--------|---------|-------------------|
| pending | Scan queued via API | Row created, no vulnerabilities yet |
| running | Active scan in progress | Vulnerabilities being written incrementally |
| completed | Scan finished successfully | Final report generated, all data committed |
| failed | Scan terminated with error | Partial data preserved, error logged |
| cancelled | User-initiated abort | Data up to cancellation point preserved |
Retention and Cleanup
WSHawk does not implement automatic data retention policies. All scans are preserved indefinitely unless manually deleted. The web dashboard provides a cleanup interface:
- Delete Scan: Removes row from
scanstable, deletes associated report file - Cascade Behavior: Manually deletes related
vulnerabilitiesandtraffic_logsrows - Orphan Detection: Dashboard UI flags reports on disk without database entries
Manual Cleanup Commands:
# Delete database (preserves reports)
rm ~/.wshawk/scans.db*
# Delete all artifacts
rm -rf ~/.wshawk/
# Delete reports older than 30 days
find ~/.wshawk/reports/ -name "*.html" -mtime +30 -delete
Sources: .gitignore:90-96, docs/V3_COMPLETE_GUIDE.md:293-297
Performance Characteristics
The persistence layer is optimized for write-heavy workloads with occasional read queries:
| Operation | Performance | Notes |
|-----------|-------------|-------|
| Write vulnerability | ~0.1ms (buffered) | In-memory buffer, periodic flush |
| Write traffic log | ~0.05ms (buffered) | Higher throughput than vulnerabilities |
| Query scan list | ~10ms | Indexed by start_time DESC |
| Query vulnerability details | ~50ms | Full-text search on payloads possible |
| Generate HTML report | ~500ms | Template rendering + file I/O |
| WAL checkpoint | ~200ms | Non-blocking, passive mode |
Scaling Limits:
- Max database size: Limited by SQLite (281 TB theoretical, 100 GB practical)
- Max concurrent readers: Unlimited (WAL mode)
- Max concurrent writers: 1 (SQLite limitation)
- Recommended cleanup: Every 10,000 scans or 1 GB database size
Sources: docs/V3_COMPLETE_GUIDE.md:123-125
Configuration Options
The persistence layer can be configured through environment variables and the wshawk.yaml configuration file:
| Setting | Environment Variable | Default | Purpose |
|---------|---------------------|---------|---------|
| Database Path | WSHAWK_DB_PATH | ~/.wshawk/scans.db | Override default database location |
| WAL Mode | WSHAWK_WAL_MODE | 1 (enabled) | Disable for network file systems |
| Checkpoint Interval | WSHAWK_WAL_CHECKPOINT | 1000 | Records between checkpoints |
| Reports Directory | WSHAWK_REPORTS_DIR | ~/.wshawk/reports/ | Custom report output location |
| Log Retention | WSHAWK_LOG_RETENTION_DAYS | 0 (infinite) | Auto-delete old traffic logs |
Example Configuration:
# wshawk.yaml
persistence:
database_path: "/data/wshawk/scans.db"
wal_mode: true
checkpoint_interval: 5000
reports_directory: "/data/wshawk/reports/"
log_retention_days: 90
Sources: README.md:137-150
Summary
The Infrastructure Persistence Layer provides enterprise-grade data durability through SQLite WAL mode, structured file system organization, and zero-loss crash recovery mechanisms. The combination of database storage for queryable metadata and file system artifacts for binary evidence enables comprehensive historical analysis, real-time dashboard updates, and seamless integration with external security platforms.
Key Design Principles:
- Zero-Loss Guarantee: WAL mode ensures committed transactions survive all crash scenarios
- Concurrent Access: Readers never block writers, enabling real-time dashboard during scans
- Structured Organization: Clear separation between database metadata and file system artifacts
- Integration-Ready: Direct SQL access for custom integrations and external platforms
Sources: docs/V3_COMPLETE_GUIDE.md:96-98, docs/V3_COMPLETE_GUIDE.md:121-131, RELEASE_SUMMARY.md:15-19