ResilientSession and Resilience Control
ResilientSession and Resilience Control
The following files were used as context for generating this wiki page:
- .github/workflows/ghcr-publish.yml
- README.md
- RELEASE_3.0.0.md
- RELEASE_SUMMARY.md
- docs/V3_COMPLETE_GUIDE.md
- requirements.txt
- wshawk/advanced_cli.py
- wshawk/scanner_v2.py
Purpose and Scope
This document describes WSHawk's Resilience Control Plane, the production-grade fault tolerance layer that ensures stable scanning operations against unstable targets, rate-limited APIs, and unreliable network conditions. The resilience architecture wraps all network I/O operations with exponential backoff, circuit breakers, and adaptive rate limiting to prevent cascading failures and enable graceful degradation.
Scope covered:
ResilientSessionwrapper architecture for HTTP, WebSocket, and external API calls- Exponential backoff with jitter for automatic retry logic
- Circuit breaker state machine for failure isolation
- Error classification (transient vs. permanent)
TokenBucketRateLimiterfor adaptive rate control- Integration patterns within the scanning engine
Related pages:
- For scanner engine integration, see WSHawkV2 Scanner Engine
- For rate limiting and session state, see Rate Limiting and Session State
- For external platform integrations, see Configuration and Integration
Sources: README.md:1-310, RELEASE_SUMMARY.md:1-60, docs/V3_COMPLETE_GUIDE.md:1-443
Architectural Overview
The Resilience Control Plane sits between the scanning logic and all external network operations. It provides a unified fault-tolerance layer that prevents target instability from disrupting scan execution.
Resilience Control Plane Architecture
graph TB
subgraph "Scanning Engine"
Scanner["WSHawkV2 Scanner<br/>scanner_v2.py"]
Integrations["External Integrations<br/>Jira / DefectDojo / Webhooks"]
OAST["OAST Provider<br/>interact.sh"]
Browser["HeadlessBrowserXSSVerifier<br/>Playwright"]
end
subgraph "Resilience Control Plane"
ResilientSession["ResilientSession<br/>Core Wrapper"]
ErrorClassifier["Error Classifier<br/>Transient vs Permanent"]
RetryLogic["Retry Logic<br/>Max Attempts Counter"]
Backoff["Exponential Backoff<br/>with Jitter"]
CircuitBreaker["Circuit Breaker<br/>State Machine"]
RateLimiter["TokenBucketRateLimiter<br/>rate_limiter.py"]
end
subgraph "Network Layer"
WebSocket["websockets library"]
HTTP["aiohttp client"]
ExternalAPIs["External API endpoints"]
end
Scanner --> ResilientSession
Integrations --> ResilientSession
OAST --> ResilientSession
Browser --> ResilientSession
ResilientSession --> ErrorClassifier
ErrorClassifier -->|Transient| RetryLogic
ErrorClassifier -->|Permanent| Fail["Fast Fail"]
RetryLogic -->|attempt < max| Backoff
RetryLogic -->|attempt >= max| CircuitBreaker
Backoff --> ResilientSession
CircuitBreaker -->|OPEN| Block["Block Requests<br/>60s cooldown"]
CircuitBreaker -->|HALF_OPEN| CanaryTest["Send Test Request"]
CircuitBreaker -->|CLOSED| ResilientSession
ResilientSession --> RateLimiter
RateLimiter --> WebSocket
RateLimiter --> HTTP
RateLimiter --> ExternalAPIs
Sources: RELEASE_SUMMARY.md:9-13, docs/V3_COMPLETE_GUIDE.md:126-131
Design Principles
The resilience layer is built on three core principles from the v3.0.0 architectural redesign:
| Principle | Description | Implementation | |-----------|-------------|----------------| | Self-Healing | Automatically recover from transient failures without user intervention | Exponential backoff with jitter prevents thundering herd | | Fail-Fast on Permanent Errors | Don't retry operations that will never succeed | Error classification distinguishes 401/403 from 429/500 | | Circuit Breaking | Isolate failing subsystems to prevent cascading failures | State machine blocks requests when threshold exceeded |
Sources: RELEASE_SUMMARY.md:9-13, RELEASE_3.0.0.md:9-13
ResilientSession Core
ResilientSession is the central wrapper class that encapsulates all resilience logic. It wraps every network operation performed by WSHawk, including:
- WebSocket connections to scan targets
- HTTP requests to Jira, DefectDojo, and webhook endpoints
- OAST provider registration and polling
- Playwright browser automation commands
Request Flow Through ResilientSession
sequenceDiagram
participant Caller as "Scanner / Integration"
participant RS as "ResilientSession"
participant EC as "Error Classifier"
participant CB as "Circuit Breaker"
participant Backoff as "Backoff Calculator"
participant Network as "Network I/O"
Caller->>RS: execute_resiliently(payload)
RS->>CB: check_state()
alt Circuit OPEN
CB-->>RS: blocked (cooling down)
RS->>RS: await asyncio.sleep(60)
RS->>CB: check_state()
end
CB-->>RS: CLOSED (proceed)
RS->>Network: send(payload)
alt Success
Network-->>RS: response
RS->>CB: record_success()
RS-->>Caller: return response
else Error
Network-->>RS: exception
RS->>EC: classify_error(exception)
alt Transient Error
EC-->>RS: TRANSIENT
RS->>RS: increment attempt counter
alt attempt < max_retries
RS->>Backoff: calculate_delay(attempt)
Backoff-->>RS: delay + jitter
RS->>RS: await asyncio.sleep(delay)
RS->>Network: retry send(payload)
else attempt >= max_retries
RS->>CB: record_failure()
CB->>CB: check threshold
CB->>CB: transition to OPEN
RS-->>Caller: raise exception
end
else Permanent Error
EC-->>RS: PERMANENT
RS->>CB: record_failure()
RS-->>Caller: raise exception (no retry)
end
end
Sources: docs/V3_COMPLETE_GUIDE.md:136-160
Conceptual Implementation Pattern
The following pattern demonstrates the resilience logic as described in the v3.0.0 architecture documentation:
# From docs/V3_COMPLETE_GUIDE.md:139-160
async def execute_resiliently(self, payload):
attempt = 0
while attempt < self.max_retries:
if self.circuit_breaker.is_open():
Logger.warning("Circuit Breaker OPEN - Cooling down...")
await asyncio.sleep(60)
continue
try:
return await self.raw_send(payload)
except Exception as e:
if self.is_transient(e):
attempt += 1
delay = self.calculate_backoff(attempt)
Logger.info(f"Retrying in {delay}s (Attempt {attempt})...")
await asyncio.sleep(delay)
else:
self.circuit_breaker.record_failure()
raise e
This pattern is applied to all network operations throughout WSHawk, ensuring consistent fault tolerance.
Sources: docs/V3_COMPLETE_GUIDE.md:136-160
Exponential Backoff Strategy
Exponential backoff prevents overwhelming a recovering server by increasing the delay between retry attempts exponentially.
Backoff Calculation Formula
The delay calculation implements exponential backoff with randomized jitter:
Wait = Min(Cap, Base * 2^Attempt) + Random_Jitter
Where:
- Base: Initial delay (typically 1 second)
- Attempt: Current retry attempt number (0-indexed)
- Cap: Maximum delay ceiling (typically 30-60 seconds)
- Random_Jitter: Random value between 0-1 seconds to prevent thundering herd
Backoff Progression Example
| Attempt | Base Calculation | Actual Delay (with jitter) | Cumulative Time | |---------|-----------------|---------------------------|-----------------| | 0 | 1 * 2^0 = 1s | 1.0s - 2.0s | 1-2s | | 1 | 1 * 2^1 = 2s | 2.0s - 3.0s | 3-5s | | 2 | 1 * 2^2 = 4s | 4.0s - 5.0s | 7-10s | | 3 | 1 * 2^3 = 8s | 8.0s - 9.0s | 15-19s | | 4 | 1 * 2^4 = 16s | 16.0s - 17.0s | 31-36s | | 5 | 1 * 2^5 = 32s | 30.0s (capped) + jitter | 61-67s |
Jitter Purpose
Randomized jitter prevents the "thundering herd" problem where multiple failed requests all retry simultaneously, causing another spike that overwhelms the recovering server. By adding randomness, retries are naturally staggered.
Sources: docs/V3_COMPLETE_GUIDE.md:162-164, RELEASE_SUMMARY.md:12
Circuit Breaker Pattern
The circuit breaker prevents cascading failures by temporarily blocking requests to a failing service, allowing it time to recover without being bombarded with additional requests.
Circuit Breaker State Machine
stateDiagram-v2
[*] --> CLOSED: "Initial State"
CLOSED --> CLOSED: "Success<br/>(record_success)"
CLOSED --> CLOSED: "Failure < Threshold<br/>(record_failure)"
CLOSED --> OPEN: "Failures >= Threshold<br/>(10 failures in 60s)"
OPEN --> OPEN: "All Requests Blocked"
OPEN --> HALF_OPEN: "After 60s Cooldown<br/>(recovery_timeout)"
HALF_OPEN --> CLOSED: "Canary Request Success<br/>(service recovered)"
HALF_OPEN --> OPEN: "Canary Request Failed<br/>(still unhealthy)"
note right of CLOSED
Normal operation
All requests allowed
Failure counter active
end note
note right of OPEN
Fast-fail mode
Requests blocked
60s cooldown timer
end note
note right of HALF_OPEN
Testing recovery
Single probe request
Transition pending
end note
State Descriptions
| State | Behavior | Transition Conditions | |-------|----------|----------------------| | CLOSED | Normal operation. All requests proceed. Failures are counted. | → OPEN when failure threshold exceeded (e.g., 10 failures in 60s) | | OPEN | Fast-fail mode. All requests blocked immediately without attempting network I/O. | → HALF_OPEN after 60-second recovery timeout | | HALF_OPEN | Testing mode. One "canary" request allowed to test if service recovered. | → CLOSED if success→ OPEN if failure |
Failure Threshold Configuration
The circuit breaker tracks failures within a sliding time window:
- Threshold: 10 failures
- Time Window: 60 seconds
- Recovery Timeout: 60 seconds (OPEN state duration)
This configuration prevents a single burst of errors from permanently disabling scanning while still protecting against sustained failures.
Sources: docs/V3_COMPLETE_GUIDE.md:166-169, RELEASE_SUMMARY.md:13, RELEASE_3.0.0.md:13
Error Classification
The error classifier analyzes exceptions to determine whether they represent transient (retryable) or permanent (non-retryable) conditions.
Error Classification Decision Tree
graph TB
Start["Exception Raised"] --> Type{"Exception Type"}
Type -->|"Network Errors"| Network["asyncio.TimeoutError<br/>ConnectionResetError<br/>ConnectionRefusedError"]
Type -->|"HTTP Status"| HTTP{"HTTP Status Code"}
Type -->|"WebSocket Errors"| WS["websockets.ConnectionClosed<br/>InvalidStatusCode"]
Type -->|"Other"| Other["Application Errors"]
Network --> Transient1["TRANSIENT<br/>Retry with backoff"]
HTTP -->|"429 Too Many Requests"| Transient2["TRANSIENT<br/>Rate limited"]
HTTP -->|"500, 502, 503, 504"| Transient3["TRANSIENT<br/>Server error"]
HTTP -->|"401, 403"| Permanent1["PERMANENT<br/>Auth error"]
HTTP -->|"404, 400"| Permanent2["PERMANENT<br/>Client error"]
WS -->|"Code 1000-1001"| Permanent3["PERMANENT<br/>Normal closure"]
WS -->|"Code 1008-1011"| Transient4["TRANSIENT<br/>Abnormal closure"]
Other --> Permanent4["PERMANENT<br/>Unknown error"]
style Transient1 fill:#e8f5e9
style Transient2 fill:#e8f5e9
style Transient3 fill:#e8f5e9
style Transient4 fill:#e8f5e9
style Permanent1 fill:#ffebee
style Permanent2 fill:#ffebee
style Permanent3 fill:#ffebee
style Permanent4 fill:#ffebee
Classification Categories
Transient Errors (Retry with backoff):
- Network timeouts (
asyncio.TimeoutError) - Connection resets (
ConnectionResetError) - HTTP 429 (Too Many Requests) - rate limiting
- HTTP 500-504 (Server errors) - temporary server issues
- WebSocket abnormal closures (codes 1008-1011)
Permanent Errors (Fail fast, no retry):
- HTTP 401/403 (Authentication/Authorization) - credentials invalid
- HTTP 404/400 (Not Found/Bad Request) - endpoint doesn't exist or malformed request
- WebSocket normal closures (codes 1000-1001) - intentional disconnect
- Application-level exceptions - logic errors in scanner code
Sources: docs/V3_COMPLETE_GUIDE.md:136-160
TokenBucketRateLimiter
The TokenBucketRateLimiter implements adaptive rate control using the token bucket algorithm. It ensures WSHawk sends payloads at a controlled rate to avoid overwhelming the target server or triggering aggressive rate limiting.
Token Bucket Algorithm
graph LR
subgraph "Token Bucket State"
Bucket["Token Bucket<br/>Current: X tokens<br/>Capacity: 20 tokens"]
Refill["Refill Rate<br/>10 tokens/second"]
end
subgraph "Request Flow"
Request["await rate_limiter.acquire()"]
Check{"Tokens Available?"}
Consume["Consume 1 token"]
Wait["await asyncio.sleep()"]
Proceed["Request Proceeds"]
end
Refill -.->|"Adds tokens<br/>over time"| Bucket
Request --> Check
Check -->|Yes| Consume
Check -->|No| Wait
Consume --> Bucket
Consume --> Proceed
Wait --> Check
TokenBucketRateLimiter Initialization
The rate limiter is configured when initializing WSHawkV2:
# From scanner_v2.py:62-66
self.rate_limiter = TokenBucketRateLimiter(
tokens_per_second=rate_limit, # Default: 10 req/s
bucket_size=rate_limit * 2, # Default: 20 tokens (2x burst)
enable_adaptive=True # Server health monitoring
)
Configuration Parameters
| Parameter | Default | Description |
|-----------|---------|-------------|
| tokens_per_second | 10 | Base rate at which tokens are added to bucket |
| bucket_size | 20 | Maximum tokens (allows short bursts) |
| enable_adaptive | True | Dynamically adjusts rate based on server health |
Usage in Scanner
The rate limiter is invoked before sending payloads:
# From scanner_v2.py:560
await self.rate_limiter.acquire()
This call blocks until a token is available, ensuring the configured rate limit is never exceeded.
Adaptive Rate Control
When enable_adaptive=True, the rate limiter monitors server response times and error rates:
- Fast Responses: Rate can increase up to 150% of base rate
- Slow Responses (>2s): Rate decreases to 50% of base rate
- 429 Errors: Rate immediately drops to 25% of base rate for 60s
This adaptive behavior prevents detection by aggressive WAFs while maintaining scanning efficiency against healthy targets.
Sources: wshawk/scanner_v2.py:62-66, wshawk/scanner_v2.py:560, README.md:48
Integration with WSHawkV2 Scanner
The resilience layer is deeply integrated into the WSHawkV2 scanner engine. Every network operation passes through the resilience control plane.
Protected Operations
graph TB
subgraph "WSHawkV2 Scanner"
Init["WSHawkV2.__init__<br/>scanner_v2.py:40-100"]
Connect["connect()<br/>WebSocket handshake"]
Learning["learning_phase()<br/>Message collection"]
TestSQL["test_sql_injection_v2()"]
TestXSS["test_xss_v2()"]
TestCMD["test_command_injection_v2()"]
SessionTest["SessionHijackingTester"]
end
subgraph "Resilience Layer"
RateLimit["TokenBucketRateLimiter<br/>Line 62-66"]
Resilient["ResilientSession<br/>(Wraps all I/O)"]
end
subgraph "External Services"
Jira["Jira Integration<br/>Ticket Creation"]
DD["DefectDojo<br/>Findings Push"]
Webhook["Webhook Notifier<br/>Slack/Discord/Teams"]
OAST["OAST Provider<br/>interact.sh"]
Browser["HeadlessBrowserXSSVerifier<br/>Playwright"]
end
Init --> RateLimit
Connect --> Resilient
Learning --> Resilient
TestSQL --> RateLimit
TestSQL --> Resilient
TestXSS --> RateLimit
TestXSS --> Resilient
TestCMD --> RateLimit
TestCMD --> Resilient
SessionTest --> Resilient
Resilient --> Jira
Resilient --> DD
Resilient --> Webhook
Resilient --> OAST
Resilient --> Browser
Scanner Initialization with Rate Limiter
# From scanner_v2.py:40-66
class WSHawkV2:
def __init__(self, url: str, headers: Optional[Dict] = None,
auth_sequence: Optional[str] = None,
max_rps: int = 10,
config: Optional['WSHawkConfig'] = None):
# ...
rate_limit = self.config.get('scanner.rate_limit', max_rps)
# Initialize rate limiter with resilience
self.rate_limiter = TokenBucketRateLimiter(
tokens_per_second=rate_limit,
bucket_size=rate_limit * 2,
enable_adaptive=True
)
Rate-Limited Payload Injection
Every payload injection in vulnerability testing methods uses the rate limiter:
# From scanner_v2.py:176-255 (SQL injection example)
for payload in payloads:
try:
# Rate limiting before each payload
if self.learning_complete and self.message_analyzer.detected_format == MessageFormat.JSON:
injected_messages = self.message_analyzer.inject_payload_into_message(
base_message, payload
)
else:
injected_messages = [payload]
for msg in injected_messages:
await ws.send(msg) # Protected by ResilientSession
self.messages_sent += 1
try:
response = await asyncio.wait_for(ws.recv(), timeout=2.0)
# Verification logic...
except asyncio.TimeoutError:
pass
await asyncio.sleep(0.05) # Additional throttling
This pattern appears in all vulnerability testing methods (test_sql_injection_v2, test_xss_v2, test_command_injection_v2, etc.), ensuring consistent rate control across all attack vectors.
Sources: wshawk/scanner_v2.py:40-100, wshawk/scanner_v2.py:176-255, wshawk/scanner_v2.py:258-341
Configuration and Tuning
The resilience layer can be configured through the hierarchical configuration system (see Configuration System) or via CLI flags.
Configuration via wshawk.yaml
scanner:
rate_limit: 10 # Requests per second
max_retries: 3 # Maximum retry attempts
backoff_base: 1 # Base delay in seconds
backoff_cap: 30 # Maximum delay cap
circuit_breaker:
threshold: 10 # Failures to trigger OPEN
window: 60 # Time window in seconds
recovery_timeout: 60 # OPEN state duration
Configuration via CLI
# From advanced_cli.py:43-44
wshawk-advanced ws://target.com --rate 5 # 5 requests/second (stealthy)
wshawk-advanced ws://target.com --rate 50 # 50 requests/second (aggressive)
The --rate flag directly controls the TokenBucketRateLimiter tokens-per-second parameter.
Tuning Recommendations
| Scenario | Rate Limit | Adaptive | Notes | |----------|-----------|----------|-------| | Production Targets | 5-10 req/s | Enabled | Avoid triggering rate limiting | | Internal Testing | 20-50 req/s | Disabled | Faster scans on controlled environments | | WAF-Protected | 2-5 req/s | Enabled | Minimize detection probability | | CI/CD Pipelines | 10-20 req/s | Enabled | Balance speed and reliability |
Network Partitioning Handling
If WSHawk detects complete network loss (e.g., Wi-Fi disconnection), it enters a "Safe-Pause" state:
- All active timers are paused
- Circuit breaker states are preserved
- When network returns, scanning resumes from the exact payload that was interrupted
- No scan progress is lost
This ensures that temporary network issues don't corrupt scan state or require restart from scratch.
Sources: wshawk/advanced_cli.py:43-44, docs/V3_COMPLETE_GUIDE.md:171-172
Resilience in Action: Example Scenarios
Scenario 1: Rate-Limited API (HTTP 429)
1. Scanner sends payload → 429 Too Many Requests
2. Error Classifier: TRANSIENT (rate limit)
3. Exponential Backoff: Wait 2s (attempt 1)
4. Retry → 429 Again
5. Exponential Backoff: Wait 4s (attempt 2)
6. Retry → 429 Again
7. Exponential Backoff: Wait 8s (attempt 3)
8. Retry → Success (200 OK)
9. Circuit Breaker: Record success, remain CLOSED
Scenario 2: Cascading Failures (Service Down)
1. Scanner sends 10 payloads → All fail with ConnectionRefused
2. Circuit Breaker: Threshold exceeded (10 failures)
3. Circuit Breaker: Transition to OPEN
4. All subsequent requests blocked for 60s (fast-fail)
5. After 60s: Transition to HALF_OPEN
6. Send single canary request
- If success: → CLOSED (resume normal operation)
- If failure: → OPEN (wait another 60s)
Scenario 3: Intermittent Network
1. Scanner sends payload → TimeoutError
2. Error Classifier: TRANSIENT
3. Exponential Backoff: Wait 1s + jitter
4. Retry → Success
5. Circuit Breaker: Record success
6. Continue scanning (no service degradation)
Sources: docs/V3_COMPLETE_GUIDE.md:136-172
Benefits of Resilience Architecture
The resilience control plane provides critical operational benefits for production security scanning:
| Benefit | Description | Impact | |---------|-------------|--------| | Zero-Loss Scans | Network issues don't corrupt scan state | Complete vulnerability coverage | | Target Protection | Rate limiting prevents overwhelming targets | Safe for production environments | | Integration Stability | Circuit breakers isolate failing external services | Jira/DefectDojo failures don't block scanning | | CI/CD Reliability | Retries handle ephemeral cloud infrastructure issues | Fewer false CI failures | | WAF Evasion | Adaptive rate control avoids detection | Higher bypass success rate |
The resilience layer transforms WSHawk from a research tool into a production-grade enterprise scanner capable of operating reliably in hostile network conditions and against hardened targets.
Sources: RELEASE_SUMMARY.md:1-60, RELEASE_3.0.0.md:1-62, docs/V3_COMPLETE_GUIDE.md:95-98