ADR-029: Adapter Auto-Restart Strategy¶
Status¶
Accepted Date: 2026-04-03
Context¶
ADR-028 introduced the adapter health check protocol — periodic probing of
HealthCheckable adapters with availability reporting. When an adapter becomes
unhealthy, the framework sets affected devices to "offline" and logs warnings, but
takes no corrective action. Operators must still manually restart the container to
recover from a wedged adapter (BlueZ crash, serial port glitch, hardware reset).
Auto-restart closes the loop: detect → restart → recover, without human intervention. This is critical for unattended deployments where adapters may experience transient failures hours or days apart.
Existing infrastructure¶
HealthCheckRunnertracksconsecutive_failuresper adapter viaAdapterHealthStatus(ADR-028).- Adapter lifecycle uses duck-typed
__aenter__/__aexit__managed by anAsyncExitStack(ADR-016). build_adapter_device_map()maps adapter types to dependent device names via DI introspection (ADR-028).- Device tasks are created by
start_device_tasks()and cancelled bycancel_tasks()during shutdown. - Commands are received via MQTT subscriptions and routed to device handlers by the command router (ADR-025).
Decision¶
1. Restart trigger: consecutive failure threshold¶
When an adapter's consecutive_failures reaches a configurable threshold, the
framework initiates a restart. The threshold is set via App.__init__:
Five consecutive failures at the default 30-second health check interval gives the adapter 2.5 minutes to recover naturally before the framework intervenes. This avoids restarting on transient blips while still responding within a reasonable window.
2. Restart mechanism: cancel + recreate¶
On restart, the framework:
- Cancels all device tasks that depend on the failing adapter.
- Calls
adapter.__aexit__()to tear down the adapter. - Waits for the cooldown period (Decision 4).
- Calls
adapter.__aenter__()to re-initialise the adapter. - Runs a startup health check to verify the adapter recovered.
- Recreates cancelled device tasks from their original registrations.
- Resets
consecutive_failuresto 0.
Device tasks are cancelled and recreated from scratch rather than suspended.
This ensures a clean state with no stale references to the old adapter instance.
@app.telemetry handlers are stateless by design, so recreation is seamless.
@app.device handlers lose in-flight state (local variables, closures), but
DeviceStore data survives because it is persisted (ADR-015).
This trade-off is acceptable: an adapter restart implies the underlying hardware
experienced a failure, so any in-memory state derived from hardware interaction
is likely stale anyway. Adopter projects should persist critical device state
via DeviceStore if it must survive restarts.
3. Max restarts with sustained health reset¶
A configurable maximum prevents restart loops:
App(
max_restarts=3, # default: 3; per adapter, lifetime limit
sustained_health_reset=300.0, # seconds (5 min); resets restart counter
)
When max_restarts is exhausted, the adapter stays offline permanently until
the container is restarted. A CRITICAL-level log message is emitted.
To handle adapters with rare, widely-spaced failures (e.g., a hardware glitch
once every few days), the restart counter resets to 0 after
sustained_health_reset seconds of consecutive healthy checks. The default of
5 minutes provides confidence that the adapter has genuinely recovered, not
merely passed a single check before failing again.
4. Restart cooldown¶
A configurable delay between __aexit__ and __aenter__ gives the underlying
hardware and OS resources time to release:
Five seconds accommodates typical resource release times for Bluetooth daemons, serial port drivers, and GPIO subsystems used by current adopter projects. The cooldown uses shutdown-aware sleep so it can be interrupted if the application is shutting down.
5. Restart opt-out¶
By default, all adapters that satisfy both HealthCheckable and the lifecycle
protocol (__aenter__/__aexit__) are eligible for auto-restart. Adapters
that implement health_check() but lack lifecycle methods cannot be restarted
(a WARNING is logged at startup).
Some adapters may not survive an exit + re-enter cycle — for example, hardware that requires a one-time initialisation sequence not repeatable at runtime. These adapters can opt out by setting a class attribute:
class MyAdapter:
restartable: ClassVar[bool] = False # opt out of auto-restart
async def __aenter__(self): ...
async def __aexit__(self, *exc): ...
async def health_check(self) -> bool: ...
The framework checks getattr(adapter, 'restartable', True) — adapters are
restartable by default unless they explicitly set restartable = False. This
follows the principle of safe defaults: most adapters that implement lifecycle
methods can tolerate exit + re-enter.
When an opted-out adapter's health check fails beyond the threshold, the framework
logs a WARNING indicating the adapter is unhealthy but not restartable, and
continues reporting "offline" availability without attempting restart.
6. In-flight command handling during restart¶
Commands that arrive while an adapter is restarting are queued and delivered after device tasks are recreated:
- The command router holds incoming commands in a bounded queue during the restart window.
- After device tasks are recreated, queued commands are dispatched in order.
- If the queue overflows (unlikely given the short restart window), the oldest commands are dropped with a WARNING log.
Commands are asynchronous by nature — a few seconds delay during restart is acceptable. Dropping commands silently (Option A) risks lost user actions, and rejecting with an error (Option C) pushes retry logic onto the caller unnecessarily.
Decision Drivers¶
- Unattended deployments need automatic recovery from transient adapter failures
- Restart must be safe: no stale references, no leaked resources, no partial state
- Framework conventions: duck-typed protocols,
AsyncExitStacklifecycle, DI-based wiring - Simplicity: global thresholds over per-adapter configuration for v1
- Current adopter projects use adapters (BLE, serial, GPIO) that tolerate exit + re-enter
Considered Options¶
Device task handling during restart¶
| Option | Behaviour | Pros | Cons | Chosen |
|---|---|---|---|---|
| A: Cancel + recreate | Cancel tasks, restart adapter, recreate tasks | Clean state, no stale refs | Loses @app.device loop state |
Yes |
| B: Suspend + resume | Pause flag, device checks it | Preserves loop state | Complex, device must cooperate | No |
| C: Replace adapter ref | Swap DI reference, devices keep running | Minimal disruption | Thread safety, stale handles | No |
Max restarts reset¶
| Option | Behaviour | Pros | Cons | Chosen |
|---|---|---|---|---|
| A: Never reset | Hard lifetime limit | Simple | Penalises rare failures days apart | No |
| B: Reset after sustained health | 5 min healthy → counter resets | Handles intermittent failures | Slightly more complex | Yes |
Restart opt-in vs. opt-out¶
| Option | Behaviour | Pros | Cons | Chosen |
|---|---|---|---|---|
| A: Implicit (no control) | All eligible adapters restart | Zero config | No escape hatch for fragile adapters | No |
| B: Default-on with opt-out | restartable = False to disable |
Safe default, escape hatch | Class attribute convention | Yes |
| C: Explicit opt-in | Must implement Restartable protocol |
Maximum control | Extra ceremony for common case | No |
In-flight commands¶
| Option | Behaviour | Pros | Cons | Chosen |
|---|---|---|---|---|
| A: Drop | Commands lost during restart | Simple | Silent data loss | No |
| B: Queue | Buffer and deliver after restart | No lost commands | Bounded queue, slight delay | Yes |
| C: Reject | Publish error to error topic | Explicit failure | Pushes retry to caller | No |
Consequences¶
Positive¶
- Transient adapter failures are recovered automatically without operator intervention
- Clean restart via cancel + recreate avoids stale state and reference leaks
- Sustained health reset prevents permanent offline state from rare, widely-spaced failures
- Opt-out mechanism accommodates adapters that cannot tolerate re-initialisation
- Command queuing preserves user actions during the brief restart window
- Builds directly on ADR-028 infrastructure (
consecutive_failures, adapter→device mapping,HealthCheckableprotocol)
Negative¶
@app.devicehandlers lose in-flight state on restart — adopter projects must persist critical state viaDeviceStore(mitigated by the fact that adapter failure likely invalidates in-memory hardware state anyway)- Restart introduces a brief period (~5s cooldown + re-init) where affected devices are unresponsive
- Command queue introduces a bounded buffer that could overflow under extreme command rates during restart (mitigated by bounded queue with overflow logging)
restartable = Falseis a class-level attribute, not configurable at runtime (sufficient for the known use case of hardware that fundamentally cannot re-initialise)
2026-04-03