ResilientSession and Resilience Control

ResilientSession and Resilience Control

The following files were used as context for generating this wiki page:

Purpose and Scope

This document describes WSHawk's Resilience Control Plane, the production-grade fault tolerance layer that ensures stable scanning operations against unstable targets, rate-limited APIs, and unreliable network conditions. The resilience architecture wraps all network I/O operations with exponential backoff, circuit breakers, and adaptive rate limiting to prevent cascading failures and enable graceful degradation.

Scope covered:

  • ResilientSession wrapper architecture for HTTP, WebSocket, and external API calls
  • Exponential backoff with jitter for automatic retry logic
  • Circuit breaker state machine for failure isolation
  • Error classification (transient vs. permanent)
  • TokenBucketRateLimiter for adaptive rate control
  • Integration patterns within the scanning engine

Related pages:

Sources: README.md:1-310, RELEASE_SUMMARY.md:1-60, docs/V3_COMPLETE_GUIDE.md:1-443


Architectural Overview

The Resilience Control Plane sits between the scanning logic and all external network operations. It provides a unified fault-tolerance layer that prevents target instability from disrupting scan execution.

Resilience Control Plane Architecture

graph TB
    subgraph "Scanning Engine"
        Scanner["WSHawkV2 Scanner<br/>scanner_v2.py"]
        Integrations["External Integrations<br/>Jira / DefectDojo / Webhooks"]
        OAST["OAST Provider<br/>interact.sh"]
        Browser["HeadlessBrowserXSSVerifier<br/>Playwright"]
    end
    
    subgraph "Resilience Control Plane"
        ResilientSession["ResilientSession<br/>Core Wrapper"]
        ErrorClassifier["Error Classifier<br/>Transient vs Permanent"]
        RetryLogic["Retry Logic<br/>Max Attempts Counter"]
        Backoff["Exponential Backoff<br/>with Jitter"]
        CircuitBreaker["Circuit Breaker<br/>State Machine"]
        RateLimiter["TokenBucketRateLimiter<br/>rate_limiter.py"]
    end
    
    subgraph "Network Layer"
        WebSocket["websockets library"]
        HTTP["aiohttp client"]
        ExternalAPIs["External API endpoints"]
    end
    
    Scanner --> ResilientSession
    Integrations --> ResilientSession
    OAST --> ResilientSession
    Browser --> ResilientSession
    
    ResilientSession --> ErrorClassifier
    ErrorClassifier -->|Transient| RetryLogic
    ErrorClassifier -->|Permanent| Fail["Fast Fail"]
    
    RetryLogic -->|attempt < max| Backoff
    RetryLogic -->|attempt >= max| CircuitBreaker
    
    Backoff --> ResilientSession
    
    CircuitBreaker -->|OPEN| Block["Block Requests<br/>60s cooldown"]
    CircuitBreaker -->|HALF_OPEN| CanaryTest["Send Test Request"]
    CircuitBreaker -->|CLOSED| ResilientSession
    
    ResilientSession --> RateLimiter
    RateLimiter --> WebSocket
    RateLimiter --> HTTP
    RateLimiter --> ExternalAPIs

Sources: RELEASE_SUMMARY.md:9-13, docs/V3_COMPLETE_GUIDE.md:126-131


Design Principles

The resilience layer is built on three core principles from the v3.0.0 architectural redesign:

| Principle | Description | Implementation | |-----------|-------------|----------------| | Self-Healing | Automatically recover from transient failures without user intervention | Exponential backoff with jitter prevents thundering herd | | Fail-Fast on Permanent Errors | Don't retry operations that will never succeed | Error classification distinguishes 401/403 from 429/500 | | Circuit Breaking | Isolate failing subsystems to prevent cascading failures | State machine blocks requests when threshold exceeded |

Sources: RELEASE_SUMMARY.md:9-13, RELEASE_3.0.0.md:9-13


ResilientSession Core

ResilientSession is the central wrapper class that encapsulates all resilience logic. It wraps every network operation performed by WSHawk, including:

  • WebSocket connections to scan targets
  • HTTP requests to Jira, DefectDojo, and webhook endpoints
  • OAST provider registration and polling
  • Playwright browser automation commands

Request Flow Through ResilientSession

sequenceDiagram
    participant Caller as "Scanner / Integration"
    participant RS as "ResilientSession"
    participant EC as "Error Classifier"
    participant CB as "Circuit Breaker"
    participant Backoff as "Backoff Calculator"
    participant Network as "Network I/O"
    
    Caller->>RS: execute_resiliently(payload)
    RS->>CB: check_state()
    
    alt Circuit OPEN
        CB-->>RS: blocked (cooling down)
        RS->>RS: await asyncio.sleep(60)
        RS->>CB: check_state()
    end
    
    CB-->>RS: CLOSED (proceed)
    
    RS->>Network: send(payload)
    
    alt Success
        Network-->>RS: response
        RS->>CB: record_success()
        RS-->>Caller: return response
    else Error
        Network-->>RS: exception
        RS->>EC: classify_error(exception)
        
        alt Transient Error
            EC-->>RS: TRANSIENT
            RS->>RS: increment attempt counter
            
            alt attempt < max_retries
                RS->>Backoff: calculate_delay(attempt)
                Backoff-->>RS: delay + jitter
                RS->>RS: await asyncio.sleep(delay)
                RS->>Network: retry send(payload)
            else attempt >= max_retries
                RS->>CB: record_failure()
                CB->>CB: check threshold
                CB->>CB: transition to OPEN
                RS-->>Caller: raise exception
            end
        else Permanent Error
            EC-->>RS: PERMANENT
            RS->>CB: record_failure()
            RS-->>Caller: raise exception (no retry)
        end
    end

Sources: docs/V3_COMPLETE_GUIDE.md:136-160

Conceptual Implementation Pattern

The following pattern demonstrates the resilience logic as described in the v3.0.0 architecture documentation:

# From docs/V3_COMPLETE_GUIDE.md:139-160
async def execute_resiliently(self, payload):
    attempt = 0
    while attempt < self.max_retries:
        if self.circuit_breaker.is_open():
            Logger.warning("Circuit Breaker OPEN - Cooling down...")
            await asyncio.sleep(60)
            continue
            
        try:
            return await self.raw_send(payload)
        except Exception as e:
            if self.is_transient(e):
                attempt += 1
                delay = self.calculate_backoff(attempt)
                Logger.info(f"Retrying in {delay}s (Attempt {attempt})...")
                await asyncio.sleep(delay)
            else:
                self.circuit_breaker.record_failure()
                raise e

This pattern is applied to all network operations throughout WSHawk, ensuring consistent fault tolerance.

Sources: docs/V3_COMPLETE_GUIDE.md:136-160


Exponential Backoff Strategy

Exponential backoff prevents overwhelming a recovering server by increasing the delay between retry attempts exponentially.

Backoff Calculation Formula

The delay calculation implements exponential backoff with randomized jitter:

Wait = Min(Cap, Base * 2^Attempt) + Random_Jitter

Where:

  • Base: Initial delay (typically 1 second)
  • Attempt: Current retry attempt number (0-indexed)
  • Cap: Maximum delay ceiling (typically 30-60 seconds)
  • Random_Jitter: Random value between 0-1 seconds to prevent thundering herd

Backoff Progression Example

| Attempt | Base Calculation | Actual Delay (with jitter) | Cumulative Time | |---------|-----------------|---------------------------|-----------------| | 0 | 1 * 2^0 = 1s | 1.0s - 2.0s | 1-2s | | 1 | 1 * 2^1 = 2s | 2.0s - 3.0s | 3-5s | | 2 | 1 * 2^2 = 4s | 4.0s - 5.0s | 7-10s | | 3 | 1 * 2^3 = 8s | 8.0s - 9.0s | 15-19s | | 4 | 1 * 2^4 = 16s | 16.0s - 17.0s | 31-36s | | 5 | 1 * 2^5 = 32s | 30.0s (capped) + jitter | 61-67s |

Jitter Purpose

Randomized jitter prevents the "thundering herd" problem where multiple failed requests all retry simultaneously, causing another spike that overwhelms the recovering server. By adding randomness, retries are naturally staggered.

Sources: docs/V3_COMPLETE_GUIDE.md:162-164, RELEASE_SUMMARY.md:12


Circuit Breaker Pattern

The circuit breaker prevents cascading failures by temporarily blocking requests to a failing service, allowing it time to recover without being bombarded with additional requests.

Circuit Breaker State Machine

stateDiagram-v2
    [*] --> CLOSED: "Initial State"
    
    CLOSED --> CLOSED: "Success<br/>(record_success)"
    CLOSED --> CLOSED: "Failure < Threshold<br/>(record_failure)"
    CLOSED --> OPEN: "Failures >= Threshold<br/>(10 failures in 60s)"
    
    OPEN --> OPEN: "All Requests Blocked"
    OPEN --> HALF_OPEN: "After 60s Cooldown<br/>(recovery_timeout)"
    
    HALF_OPEN --> CLOSED: "Canary Request Success<br/>(service recovered)"
    HALF_OPEN --> OPEN: "Canary Request Failed<br/>(still unhealthy)"
    
    note right of CLOSED
        Normal operation
        All requests allowed
        Failure counter active
    end note
    
    note right of OPEN
        Fast-fail mode
        Requests blocked
        60s cooldown timer
    end note
    
    note right of HALF_OPEN
        Testing recovery
        Single probe request
        Transition pending
    end note

State Descriptions

| State | Behavior | Transition Conditions | |-------|----------|----------------------| | CLOSED | Normal operation. All requests proceed. Failures are counted. | → OPEN when failure threshold exceeded (e.g., 10 failures in 60s) | | OPEN | Fast-fail mode. All requests blocked immediately without attempting network I/O. | → HALF_OPEN after 60-second recovery timeout | | HALF_OPEN | Testing mode. One "canary" request allowed to test if service recovered. | → CLOSED if success→ OPEN if failure |

Failure Threshold Configuration

The circuit breaker tracks failures within a sliding time window:

  • Threshold: 10 failures
  • Time Window: 60 seconds
  • Recovery Timeout: 60 seconds (OPEN state duration)

This configuration prevents a single burst of errors from permanently disabling scanning while still protecting against sustained failures.

Sources: docs/V3_COMPLETE_GUIDE.md:166-169, RELEASE_SUMMARY.md:13, RELEASE_3.0.0.md:13


Error Classification

The error classifier analyzes exceptions to determine whether they represent transient (retryable) or permanent (non-retryable) conditions.

Error Classification Decision Tree

graph TB
    Start["Exception Raised"] --> Type{"Exception Type"}
    
    Type -->|"Network Errors"| Network["asyncio.TimeoutError<br/>ConnectionResetError<br/>ConnectionRefusedError"]
    Type -->|"HTTP Status"| HTTP{"HTTP Status Code"}
    Type -->|"WebSocket Errors"| WS["websockets.ConnectionClosed<br/>InvalidStatusCode"]
    Type -->|"Other"| Other["Application Errors"]
    
    Network --> Transient1["TRANSIENT<br/>Retry with backoff"]
    
    HTTP -->|"429 Too Many Requests"| Transient2["TRANSIENT<br/>Rate limited"]
    HTTP -->|"500, 502, 503, 504"| Transient3["TRANSIENT<br/>Server error"]
    HTTP -->|"401, 403"| Permanent1["PERMANENT<br/>Auth error"]
    HTTP -->|"404, 400"| Permanent2["PERMANENT<br/>Client error"]
    
    WS -->|"Code 1000-1001"| Permanent3["PERMANENT<br/>Normal closure"]
    WS -->|"Code 1008-1011"| Transient4["TRANSIENT<br/>Abnormal closure"]
    
    Other --> Permanent4["PERMANENT<br/>Unknown error"]
    
    style Transient1 fill:#e8f5e9
    style Transient2 fill:#e8f5e9
    style Transient3 fill:#e8f5e9
    style Transient4 fill:#e8f5e9
    style Permanent1 fill:#ffebee
    style Permanent2 fill:#ffebee
    style Permanent3 fill:#ffebee
    style Permanent4 fill:#ffebee

Classification Categories

Transient Errors (Retry with backoff):

  • Network timeouts (asyncio.TimeoutError)
  • Connection resets (ConnectionResetError)
  • HTTP 429 (Too Many Requests) - rate limiting
  • HTTP 500-504 (Server errors) - temporary server issues
  • WebSocket abnormal closures (codes 1008-1011)

Permanent Errors (Fail fast, no retry):

  • HTTP 401/403 (Authentication/Authorization) - credentials invalid
  • HTTP 404/400 (Not Found/Bad Request) - endpoint doesn't exist or malformed request
  • WebSocket normal closures (codes 1000-1001) - intentional disconnect
  • Application-level exceptions - logic errors in scanner code

Sources: docs/V3_COMPLETE_GUIDE.md:136-160


TokenBucketRateLimiter

The TokenBucketRateLimiter implements adaptive rate control using the token bucket algorithm. It ensures WSHawk sends payloads at a controlled rate to avoid overwhelming the target server or triggering aggressive rate limiting.

Token Bucket Algorithm

graph LR
    subgraph "Token Bucket State"
        Bucket["Token Bucket<br/>Current: X tokens<br/>Capacity: 20 tokens"]
        Refill["Refill Rate<br/>10 tokens/second"]
    end
    
    subgraph "Request Flow"
        Request["await rate_limiter.acquire()"]
        Check{"Tokens Available?"}
        Consume["Consume 1 token"]
        Wait["await asyncio.sleep()"]
        Proceed["Request Proceeds"]
    end
    
    Refill -.->|"Adds tokens<br/>over time"| Bucket
    Request --> Check
    Check -->|Yes| Consume
    Check -->|No| Wait
    Consume --> Bucket
    Consume --> Proceed
    Wait --> Check

TokenBucketRateLimiter Initialization

The rate limiter is configured when initializing WSHawkV2:

# From scanner_v2.py:62-66
self.rate_limiter = TokenBucketRateLimiter(
    tokens_per_second=rate_limit,      # Default: 10 req/s
    bucket_size=rate_limit * 2,        # Default: 20 tokens (2x burst)
    enable_adaptive=True               # Server health monitoring
)

Configuration Parameters

| Parameter | Default | Description | |-----------|---------|-------------| | tokens_per_second | 10 | Base rate at which tokens are added to bucket | | bucket_size | 20 | Maximum tokens (allows short bursts) | | enable_adaptive | True | Dynamically adjusts rate based on server health |

Usage in Scanner

The rate limiter is invoked before sending payloads:

# From scanner_v2.py:560
await self.rate_limiter.acquire()

This call blocks until a token is available, ensuring the configured rate limit is never exceeded.

Adaptive Rate Control

When enable_adaptive=True, the rate limiter monitors server response times and error rates:

  • Fast Responses: Rate can increase up to 150% of base rate
  • Slow Responses (>2s): Rate decreases to 50% of base rate
  • 429 Errors: Rate immediately drops to 25% of base rate for 60s

This adaptive behavior prevents detection by aggressive WAFs while maintaining scanning efficiency against healthy targets.

Sources: wshawk/scanner_v2.py:62-66, wshawk/scanner_v2.py:560, README.md:48


Integration with WSHawkV2 Scanner

The resilience layer is deeply integrated into the WSHawkV2 scanner engine. Every network operation passes through the resilience control plane.

Protected Operations

graph TB
    subgraph "WSHawkV2 Scanner"
        Init["WSHawkV2.__init__<br/>scanner_v2.py:40-100"]
        Connect["connect()<br/>WebSocket handshake"]
        Learning["learning_phase()<br/>Message collection"]
        TestSQL["test_sql_injection_v2()"]
        TestXSS["test_xss_v2()"]
        TestCMD["test_command_injection_v2()"]
        SessionTest["SessionHijackingTester"]
    end
    
    subgraph "Resilience Layer"
        RateLimit["TokenBucketRateLimiter<br/>Line 62-66"]
        Resilient["ResilientSession<br/>(Wraps all I/O)"]
    end
    
    subgraph "External Services"
        Jira["Jira Integration<br/>Ticket Creation"]
        DD["DefectDojo<br/>Findings Push"]
        Webhook["Webhook Notifier<br/>Slack/Discord/Teams"]
        OAST["OAST Provider<br/>interact.sh"]
        Browser["HeadlessBrowserXSSVerifier<br/>Playwright"]
    end
    
    Init --> RateLimit
    
    Connect --> Resilient
    Learning --> Resilient
    TestSQL --> RateLimit
    TestSQL --> Resilient
    TestXSS --> RateLimit
    TestXSS --> Resilient
    TestCMD --> RateLimit
    TestCMD --> Resilient
    SessionTest --> Resilient
    
    Resilient --> Jira
    Resilient --> DD
    Resilient --> Webhook
    Resilient --> OAST
    Resilient --> Browser

Scanner Initialization with Rate Limiter

# From scanner_v2.py:40-66
class WSHawkV2:
    def __init__(self, url: str, headers: Optional[Dict] = None, 
                 auth_sequence: Optional[str] = None,
                 max_rps: int = 10,
                 config: Optional['WSHawkConfig'] = None):
        # ...
        rate_limit = self.config.get('scanner.rate_limit', max_rps)
        
        # Initialize rate limiter with resilience
        self.rate_limiter = TokenBucketRateLimiter(
            tokens_per_second=rate_limit,
            bucket_size=rate_limit * 2,
            enable_adaptive=True
        )

Rate-Limited Payload Injection

Every payload injection in vulnerability testing methods uses the rate limiter:

# From scanner_v2.py:176-255 (SQL injection example)
for payload in payloads:
    try:
        # Rate limiting before each payload
        if self.learning_complete and self.message_analyzer.detected_format == MessageFormat.JSON:
            injected_messages = self.message_analyzer.inject_payload_into_message(
                base_message, payload
            )
        else:
            injected_messages = [payload]
        
        for msg in injected_messages:
            await ws.send(msg)  # Protected by ResilientSession
            self.messages_sent += 1
            
            try:
                response = await asyncio.wait_for(ws.recv(), timeout=2.0)
                # Verification logic...
            except asyncio.TimeoutError:
                pass
            
            await asyncio.sleep(0.05)  # Additional throttling

This pattern appears in all vulnerability testing methods (test_sql_injection_v2, test_xss_v2, test_command_injection_v2, etc.), ensuring consistent rate control across all attack vectors.

Sources: wshawk/scanner_v2.py:40-100, wshawk/scanner_v2.py:176-255, wshawk/scanner_v2.py:258-341


Configuration and Tuning

The resilience layer can be configured through the hierarchical configuration system (see Configuration System) or via CLI flags.

Configuration via wshawk.yaml

scanner:
  rate_limit: 10              # Requests per second
  max_retries: 3              # Maximum retry attempts
  backoff_base: 1             # Base delay in seconds
  backoff_cap: 30             # Maximum delay cap
  circuit_breaker:
    threshold: 10             # Failures to trigger OPEN
    window: 60                # Time window in seconds
    recovery_timeout: 60      # OPEN state duration

Configuration via CLI

# From advanced_cli.py:43-44
wshawk-advanced ws://target.com --rate 5    # 5 requests/second (stealthy)
wshawk-advanced ws://target.com --rate 50   # 50 requests/second (aggressive)

The --rate flag directly controls the TokenBucketRateLimiter tokens-per-second parameter.

Tuning Recommendations

| Scenario | Rate Limit | Adaptive | Notes | |----------|-----------|----------|-------| | Production Targets | 5-10 req/s | Enabled | Avoid triggering rate limiting | | Internal Testing | 20-50 req/s | Disabled | Faster scans on controlled environments | | WAF-Protected | 2-5 req/s | Enabled | Minimize detection probability | | CI/CD Pipelines | 10-20 req/s | Enabled | Balance speed and reliability |

Network Partitioning Handling

If WSHawk detects complete network loss (e.g., Wi-Fi disconnection), it enters a "Safe-Pause" state:

  1. All active timers are paused
  2. Circuit breaker states are preserved
  3. When network returns, scanning resumes from the exact payload that was interrupted
  4. No scan progress is lost

This ensures that temporary network issues don't corrupt scan state or require restart from scratch.

Sources: wshawk/advanced_cli.py:43-44, docs/V3_COMPLETE_GUIDE.md:171-172


Resilience in Action: Example Scenarios

Scenario 1: Rate-Limited API (HTTP 429)

1. Scanner sends payload → 429 Too Many Requests
2. Error Classifier: TRANSIENT (rate limit)
3. Exponential Backoff: Wait 2s (attempt 1)
4. Retry → 429 Again
5. Exponential Backoff: Wait 4s (attempt 2)
6. Retry → 429 Again
7. Exponential Backoff: Wait 8s (attempt 3)
8. Retry → Success (200 OK)
9. Circuit Breaker: Record success, remain CLOSED

Scenario 2: Cascading Failures (Service Down)

1. Scanner sends 10 payloads → All fail with ConnectionRefused
2. Circuit Breaker: Threshold exceeded (10 failures)
3. Circuit Breaker: Transition to OPEN
4. All subsequent requests blocked for 60s (fast-fail)
5. After 60s: Transition to HALF_OPEN
6. Send single canary request
   - If success: → CLOSED (resume normal operation)
   - If failure: → OPEN (wait another 60s)

Scenario 3: Intermittent Network

1. Scanner sends payload → TimeoutError
2. Error Classifier: TRANSIENT
3. Exponential Backoff: Wait 1s + jitter
4. Retry → Success
5. Circuit Breaker: Record success
6. Continue scanning (no service degradation)

Sources: docs/V3_COMPLETE_GUIDE.md:136-172


Benefits of Resilience Architecture

The resilience control plane provides critical operational benefits for production security scanning:

| Benefit | Description | Impact | |---------|-------------|--------| | Zero-Loss Scans | Network issues don't corrupt scan state | Complete vulnerability coverage | | Target Protection | Rate limiting prevents overwhelming targets | Safe for production environments | | Integration Stability | Circuit breakers isolate failing external services | Jira/DefectDojo failures don't block scanning | | CI/CD Reliability | Retries handle ephemeral cloud infrastructure issues | Fewer false CI failures | | WAF Evasion | Adaptive rate control avoids detection | Higher bypass success rate |

The resilience layer transforms WSHawk from a research tool into a production-grade enterprise scanner capable of operating reliably in hostile network conditions and against hardened targets.

Sources: RELEASE_SUMMARY.md:1-60, RELEASE_3.0.0.md:1-62, docs/V3_COMPLETE_GUIDE.md:95-98