ADR-029: Adapter Auto-Restart Strategy¶

Status¶

Accepted Date: 2026-04-03

Context¶

ADR-028 introduced the adapter health check protocol — periodic probing of HealthCheckable adapters with availability reporting. When an adapter becomes unhealthy, the framework sets affected devices to "offline" and logs warnings, but takes no corrective action. Operators must still manually restart the container to recover from a wedged adapter (BlueZ crash, serial port glitch, hardware reset).

Auto-restart closes the loop: detect → restart → recover, without human intervention. This is critical for unattended deployments where adapters may experience transient failures hours or days apart.

Existing infrastructure¶

HealthCheckRunner tracks consecutive_failures per adapter via AdapterHealthStatus (ADR-028).
Adapter lifecycle uses duck-typed __aenter__/__aexit__ managed by an AsyncExitStack (ADR-016).
build_adapter_device_map() maps adapter types to dependent device names via DI introspection (ADR-028).
Device tasks are created by start_device_tasks() and cancelled by cancel_tasks() during shutdown.
Commands are received via MQTT subscriptions and routed to device handlers by the command router (ADR-025).

Decision¶

1. Restart trigger: consecutive failure threshold¶

When an adapter's consecutive_failures reaches a configurable threshold, the framework initiates a restart. The threshold is set via App.__init__:

App(
    restart_after_failures=5,  # default: 5; 0 disables auto-restart
)

Five consecutive failures at the default 30-second health check interval gives the adapter 2.5 minutes to recover naturally before the framework intervenes. This avoids restarting on transient blips while still responding within a reasonable window.

2. Restart mechanism: cancel + recreate¶

On restart, the framework:

Cancels all device tasks that depend on the failing adapter.
Calls adapter.__aexit__() to tear down the adapter.
Waits for the cooldown period (Decision 4).
Calls adapter.__aenter__() to re-initialise the adapter.
Runs a startup health check to verify the adapter recovered.
Recreates cancelled device tasks from their original registrations.
Resets consecutive_failures to 0.

Device tasks are cancelled and recreated from scratch rather than suspended. This ensures a clean state with no stale references to the old adapter instance. @app.telemetry handlers are stateless by design, so recreation is seamless. @app.device handlers lose in-flight state (local variables, closures), but DeviceStore data survives because it is persisted (ADR-015).

This trade-off is acceptable: an adapter restart implies the underlying hardware experienced a failure, so any in-memory state derived from hardware interaction is likely stale anyway. Adopter projects should persist critical device state via DeviceStore if it must survive restarts.

3. Max restarts with sustained health reset¶

A configurable maximum prevents restart loops:

App(
    max_restarts=3,  # default: 3; per adapter, lifetime limit
    sustained_health_reset=300.0,  # seconds (5 min); resets restart counter
)

When max_restarts is exhausted, the adapter stays offline permanently until the container is restarted. A CRITICAL-level log message is emitted.

To handle adapters with rare, widely-spaced failures (e.g., a hardware glitch once every few days), the restart counter resets to 0 after sustained_health_reset seconds of consecutive healthy checks. The default of 5 minutes provides confidence that the adapter has genuinely recovered, not merely passed a single check before failing again.

4. Restart cooldown¶

A configurable delay between __aexit__ and __aenter__ gives the underlying hardware and OS resources time to release:

App(
    restart_cooldown=5.0,  # seconds, default: 5
)

Five seconds accommodates typical resource release times for Bluetooth daemons, serial port drivers, and GPIO subsystems used by current adopter projects. The cooldown uses shutdown-aware sleep so it can be interrupted if the application is shutting down.

5. Restart opt-out¶

By default, all adapters that satisfy both HealthCheckable and the lifecycle protocol (__aenter__/__aexit__) are eligible for auto-restart. Adapters that implement health_check() but lack lifecycle methods cannot be restarted (a WARNING is logged at startup).

Some adapters may not survive an exit + re-enter cycle — for example, hardware that requires a one-time initialisation sequence not repeatable at runtime. These adapters can opt out by setting a class attribute:

class MyAdapter:
    restartable: ClassVar[bool] = False  # opt out of auto-restart

    async def __aenter__(self): ...
    async def __aexit__(self, *exc): ...
    async def health_check(self) -> bool: ...

The framework checks getattr(adapter, 'restartable', True) — adapters are restartable by default unless they explicitly set restartable = False. This follows the principle of safe defaults: most adapters that implement lifecycle methods can tolerate exit + re-enter.

When an opted-out adapter's health check fails beyond the threshold, the framework logs a WARNING indicating the adapter is unhealthy but not restartable, and continues reporting "offline" availability without attempting restart.

6. In-flight command handling during restart¶

Commands that arrive while an adapter is restarting are queued and delivered after device tasks are recreated:

The command router holds incoming commands in a bounded queue during the restart window.
After device tasks are recreated, queued commands are dispatched in order.
If the queue overflows (unlikely given the short restart window), the oldest commands are dropped with a WARNING log.

Commands are asynchronous by nature — a few seconds delay during restart is acceptable. Dropping commands silently (Option A) risks lost user actions, and rejecting with an error (Option C) pushes retry logic onto the caller unnecessarily.

Decision Drivers¶

Unattended deployments need automatic recovery from transient adapter failures
Restart must be safe: no stale references, no leaked resources, no partial state
Framework conventions: duck-typed protocols, AsyncExitStack lifecycle, DI-based wiring
Simplicity: global thresholds over per-adapter configuration for v1
Current adopter projects use adapters (BLE, serial, GPIO) that tolerate exit + re-enter

Considered Options¶

Device task handling during restart¶

Option	Behaviour	Pros	Cons	Chosen
A: Cancel + recreate	Cancel tasks, restart adapter, recreate tasks	Clean state, no stale refs	Loses `@app.device` loop state	Yes
B: Suspend + resume	Pause flag, device checks it	Preserves loop state	Complex, device must cooperate	No
C: Replace adapter ref	Swap DI reference, devices keep running	Minimal disruption	Thread safety, stale handles	No

Max restarts reset¶

Option	Behaviour	Pros	Cons	Chosen
A: Never reset	Hard lifetime limit	Simple	Penalises rare failures days apart	No
B: Reset after sustained health	5 min healthy → counter resets	Handles intermittent failures	Slightly more complex	Yes

Restart opt-in vs. opt-out¶

Option	Behaviour	Pros	Cons	Chosen
A: Implicit (no control)	All eligible adapters restart	Zero config	No escape hatch for fragile adapters	No
B: Default-on with opt-out	`restartable = False` to disable	Safe default, escape hatch	Class attribute convention	Yes
C: Explicit opt-in	Must implement `Restartable` protocol	Maximum control	Extra ceremony for common case	No

In-flight commands¶

Option	Behaviour	Pros	Cons	Chosen
A: Drop	Commands lost during restart	Simple	Silent data loss	No
B: Queue	Buffer and deliver after restart	No lost commands	Bounded queue, slight delay	Yes
C: Reject	Publish error to error topic	Explicit failure	Pushes retry to caller	No

Consequences¶

Positive¶

Transient adapter failures are recovered automatically without operator intervention
Clean restart via cancel + recreate avoids stale state and reference leaks
Sustained health reset prevents permanent offline state from rare, widely-spaced failures
Opt-out mechanism accommodates adapters that cannot tolerate re-initialisation
Command queuing preserves user actions during the brief restart window
Builds directly on ADR-028 infrastructure (consecutive_failures, adapter→device mapping, HealthCheckable protocol)

Negative¶

@app.device handlers lose in-flight state on restart — adopter projects must persist critical state via DeviceStore (mitigated by the fact that adapter failure likely invalidates in-memory hardware state anyway)
Restart introduces a brief period (~5s cooldown + re-init) where affected devices are unresponsive
Command queue introduces a bounded buffer that could overflow under extreme command rates during restart (mitigated by bounded queue with overflow logging)
restartable = False is a class-level attribute, not configurable at runtime (sufficient for the known use case of hardware that fundamentally cannot re-initialise)

2026-04-03