Skip to content

ADR-024: Telemetry Retry with Configurable Backoff

Status

Accepted Date: 2026-03-31

Context

When a telemetry read fails (BLE timeout, CalDAV unreachable, serial error), the framework logs the error, publishes to the error topic, and waits for the next full poll interval. For long-interval apps this causes disproportionate data gaps:

App Interval Gap from one failure
airthings2mqtt 25 min 25 min (BLE flaky)
caldates2mqtt 2 h 2 h (CalDAV unreachable)
vito2mqtt 60 s 60 s (serial timeout)
gas2mqtt 30 s 30 s (meter read failure)

No app has implemented custom retry — they all accept the gap. A framework-level solution closes this gap transparently.

Current error flow (no retry)

run_telemetry loop:
  while not shutdown:
    try:
      result = handler(**kwargs)
      → publish / persist / clear-error
    except Exception:
      → log + publish error + set health "error"
    await ctx.sleep(interval)          ← always full interval after failure

Constraints

  • ADR-011 (Error Handling): fire-and-forget error publishing, error deduplication by exception type, recovery resets health to "ok".
  • ADR-012 (Health Reporting): per-device status in heartbeat payload, string-typed status values.
  • ADR-013 (Publish Strategies): retry is transparent to publish strategies. The strategy only sees the final successful result, never intermediate retry failures.
  • Existing MQTT backoff model (_mqtt_client.py): exponential backoff with ±20% jitter, delay = min(delay * 2, max_interval), reset on success.

Decision

Add configurable retry with backoff strategies and optional circuit breaker to @app.telemetry. Retry logic lives in _telemetry_runner.py and wraps the handler invocation, transparent to the publish strategy layer.

API surface

New parameters on @app.telemetry and app.add_telemetry():

@app.telemetry(
    interval=300,
    retry=3,                              # max retry attempts (0 = no retry)
    retry_on=(OSError,),                  # exception types to retry on
    backoff=ExponentialBackoff(            # backoff strategy (optional)
        base=2.0,
        max_delay=60.0,
    ),
    circuit_breaker=CircuitBreaker(       # optional circuit breaker
        threshold=5,
    ),
)
async def read_sensor(adapter: BlePort) -> dict[str, object]:
    ...

Decision 1: retry_on default — (OSError,)

The default is (OSError,), which covers all I/O and OS-level failures:

  • ConnectionError (ConnectionRefusedError, ConnectionResetError, BrokenPipeError)
  • TimeoutError (subclass of OSError since Python 3.3, PEP 3151)
  • FileNotFoundError, PermissionError

ValueError is excluded from the default because it conflates two scenarios: garbled sensor data (transient, retryable) and programming bugs (permanent, should fail fast). Users who handle garbled data opt in explicitly:

@app.telemetry(retry=3, retry_on=(OSError, ValueError))

When retry > 0 and retry_on is empty (explicitly passed retry_on=()), the framework raises ValueError at registration time — this is almost certainly a configuration error.

Decision 2: Configurable backoff strategies

Backoff is configurable from the start, using a BackoffStrategy protocol with built-in implementations:

class BackoffStrategy(Protocol):
    def delay(self, attempt: int) -> float:
        """Return delay in seconds for the given attempt number (1-based)."""
        ...

Built-in strategies:

Strategy Formula Use case
ExponentialBackoff(base=2.0, max_delay=60.0) min(base × 2^(attempt-1), max_delay) Default. Most transient failures.
LinearBackoff(step=2.0, max_delay=60.0) min(step × attempt, max_delay) Predictable delay growth.
FixedBackoff(delay=5.0) delay Constant wait between retries.

All strategies enforce a max_delay ceiling to prevent unbounded waits.

Default: When backoff= is omitted and retry > 0, the framework uses ExponentialBackoff(base=2.0, max_delay=60.0).

Jitter: All built-in strategies apply ±20% jitter (matching the MQTT reconnect model in _mqtt_client.py) to prevent thundering-herd effects when multiple devices retry simultaneously.

Decision 3: Cumulative retry counter with optional circuit breaker

The retry counter persists across poll cycles. This ensures that exponential backoff functions correctly — if the counter reset every cycle, a 25-minute interval with retry=3 would always start at attempt 1, never back off meaningfully.

The counter resets to zero on a successful read.

Circuit breaker (optional):

When provided, the circuit breaker tracks consecutive failed cycles (cycles where all retry attempts were exhausted). After threshold consecutive failures, the circuit opens:

@app.telemetry(
    retry=3,
    circuit_breaker=CircuitBreaker(threshold=5),
)
State Behavior
closed Normal operation. Handler runs, retries on failure.
open Handler is skipped. Device status set to "circuit_open". The framework logs at WARNING each skipped cycle.
half-open After one full interval in open state, the framework makes a single probe attempt. On success → closed. On failure → open.

Circuit breaker state is exposed via:

  • Health reporter: set_device_status(name, "circuit_open") — visible in heartbeat payload at {app}/status.
  • Introspection endpoint: retry configuration (retry count, retry_on types, backoff strategy, circuit breaker threshold) is included in the registry snapshot. Runtime state (current attempt count, circuit state, consecutive failure count) is not currently exposed.
  • Logging: WARNING logs are emitted when the handler is skipped due to an open circuit and when retry attempts are exhausted.

Decision 4: Retry transparent to publish strategies

Retry wraps the handler invocation inside the telemetry runner, before the result reaches _handle_telemetry_outcome():

run_telemetry loop:
  while not shutdown:
    if circuit_breaker.is_open:
      → probe on half-open transition, skip otherwise
    else:
      for attempt in range(1, retry + 1):
        try:
          result = handler(**kwargs)
          → reset retry counter + circuit breaker
          → _handle_telemetry_outcome (publish / persist / clear-error)
          break
        except retry_on_exceptions:
          → log WARNING with attempt number
          if attempt < retry:
            await ctx.sleep(backoff.delay(attempt))
          else:
            → retry exhausted: _handle_telemetry_error as today
            → increment circuit breaker consecutive failures
    await ctx.sleep(interval)

The publish strategy only sees successful results. OnChange deduplication, Every immediate publish — all work unchanged. A retry that eventually succeeds produces a result that flows through the normal pipeline. A retry that exhausts all attempts flows through _handle_telemetry_error() as today.

Decision 5: No retry on @app.command

Commands have fire-once semantics — user-initiated, expecting immediate response. Retry is restricted to @app.telemetry for v1. If a command handler needs retry, the app implements it manually.

Retry during shutdown

Retry attempts use ctx.sleep() (shutdown-aware). If shutdown is requested during a backoff wait, the sleep completes immediately and the retry loop exits cleanly — no further attempts are made.

Retry and coalescing groups

For grouped handlers (group=), retry applies per handler invocation within the group tick. Handlers in a batch execute sequentially (matching the existing scheduler design), so a retrying handler's backoff delays affect subsequent handlers in the same tick.

The group runner executes retries inline and does not currently preempt or abandon in-flight retries when the nominal group tick interval is exceeded. If a handler's retries run long, the current batch completes first and the next group tick is effectively delayed; subsequent ticks are scheduled after the batch finishes.

Error deduplication interaction

Retry attempts are not published as individual errors. Only the final failure (after all retries exhausted) flows through _handle_telemetry_error() and the existing deduplication logic. This prevents flooding the error topic with transient failures that may self-resolve.

Implementation location

Component File
BackoffStrategy protocol + implementations _retry.py (new)
CircuitBreaker class _retry.py (new)
Retry loop integration _telemetry_runner.py
New registration fields _registration.py
Decorator parameters _app.py
Public API exports __init__.py
Introspection updates _introspect.py

Decision Drivers

  • Disproportionate data gaps. A single transient failure at 25-minute intervals causes a 25-minute gap. Retry with 2s→4s→8s backoff recovers within seconds.
  • No app has solved this. Four apps accept the gap — framework-level solution benefits all.
  • Existing backoff precedent. _mqtt_client.py already implements exponential backoff with jitter. The telemetry retry should follow the same pattern.
  • Configurable from day one. Backoff needs vary: BLE adapters benefit from exponential, CalDAV from fixed delays, serial from linear. Making it pluggable via a protocol avoids redesign later.
  • Circuit breaker prevents futile retries. When an adapter is truly down (not transient), retrying every cycle wastes resources and floods logs.

Considered Options

Option A: Flat decorator parameters, per-cycle reset

retry=3, retry_on=(OSError,), backoff=2.0 as flat float parameters. Retry counter resets each cycle. No circuit breaker.

  • Advantages: Minimal API surface. Simple implementation.
  • Disadvantages: Fixed exponential only — no strategy choice. Per-cycle reset means backoff never grows beyond a single cycle, limiting effectiveness. No protection against prolonged outages.

Option B: RetryPolicy object, cumulative counter, circuit breaker (chosen)

Backoff strategies via protocol. Cumulative retry counter with reset on success. Optional circuit breaker with open/half-open/closed states.

  • Advantages: Pluggable strategies. Meaningful exponential backoff across cycles. Circuit breaker prevents futile retries. Full observability via health reporter.
  • Disadvantages: More API surface. Circuit breaker adds state machine complexity. Cumulative counter requires careful interaction with error deduplication.

Option C: Tenacity library integration

Wrap handler calls with tenacity retry decorators.

  • Advantages: Battle-tested. Rich retry features out of the box.
  • Disadvantages: New dependency. Doesn't integrate with ctx.sleep() for shutdown-awareness. Leaks third-party API into public surface. Can't expose circuit breaker state in health reporter.

Decision Matrix

Criterion A: Flat/per-cycle B: Policy/cumulative C: Tenacity
Backoff flexibility 2 5 5
Shutdown awareness 4 5 2
Observability 2 5 2
API simplicity 5 3 3
No new dependencies 5 5 2
Covers long outages 2 5 4
Total 20 28 18

Scale: 1 (poor) to 5 (excellent)

Consequences

Positive

  • Transient failures recover in seconds instead of waiting a full poll interval. For airthings2mqtt (25 min), this reduces worst-case data gaps from 25 minutes to ~14 seconds (3 retries with exponential backoff).
  • Backoff strategy protocol is open for extension — users can implement custom strategies without framework changes (Open/Closed Principle).
  • Circuit breaker prevents infinite retry churn during prolonged outages, with clear state visibility in health heartbeats.
  • Retry is transparent to publish strategies — no changes needed to OnChange, Every, or future strategies.
  • Follows the existing MQTT reconnect backoff model, maintaining consistency within the framework.
  • Cumulative counter across cycles means exponential backoff is meaningful even for short-interval telemetry.

Negative

  • Three new public types (ExponentialBackoff, LinearBackoff, FixedBackoff, CircuitBreaker) increase the API surface.
  • Circuit breaker state machine (closed/open/half-open) adds complexity to _telemetry_runner.py — careful testing of state transitions is essential.
  • Cumulative retry counter interacts with error deduplication — must ensure only the final exhausted failure publishes an error, not intermediate attempts.
  • Retry within coalescing groups may cause handler execution to extend beyond the group tick — the "abandon stale retry" rule must be clearly documented.

2026-03-31