ADR-028: Adapter Health Check Protocol¶

Status¶

Accepted Date: 2026-04-02

Context¶

cosalette adapters (BLE, serial, GPIO) can enter a wedged state — a BlueZ daemon crash, a hardware reset, or a disconnected serial cable — where all operations fail indefinitely. Today, the framework publishes "online" availability as long as the process is running, regardless of whether the adapter is actually functional. Operators must notice repeated errors in logs and manually restart the container.

A health check protocol lets adapters report their own readiness. The framework periodically probes adapters that implement the protocol, and sets per-device availability to "offline" when the adapter is unhealthy. This is detection and reporting only — automatic restart belongs to Epic 6.

Existing infrastructure¶

HealthReporter (API reference) manages per-device availability topics ({prefix}/{device}/availability) with retained "online"/"offline" messages.
Adapter lifecycle uses duck-typed __aenter__/__aexit__ (ADR-016).
DI injection plans (build_injection_plan()) record which types each device handler requests — this is introspectable at wiring time.
TelemetryRunner already calls health_reporter.set_device_status("error") on telemetry failures, but this only affects the heartbeat payload, not the availability topic.

Decision¶

1. HealthCheckable Protocol¶

Add a @runtime_checkable structural protocol:

@runtime_checkable
class HealthCheckable(Protocol):
    async def health_check(self) -> bool: ...

The framework detects adapters implementing this protocol via isinstance() after adapter lifecycle entry, consistent with the duck-typing detection pattern used for __aenter__/__aexit__ (ADR-016).

2. Global health check interval¶

A single health_check_interval parameter on App.__init__ controls the check frequency for all adapters.

App(
    health_check_interval=30.0,  # seconds, default 30; None disables
)

Per-adapter overrides are deferred — 30 seconds is sufficient for all current adapter types (BLE, serial, GPIO). The interval controls detection latency, not precision; the difference between 30s and 60s is cosmetic for availability reporting.

3. Failure behaviour: availability only¶

When health_check() returns False:

Set availability to "offline" for all devices that depend on the adapter.
Log a WARNING with the adapter type and failure count.

When health_check() returns True after a failure:

Restore availability to "online" for affected devices.
Log an INFO recovery message.

Telemetry polling continues during unhealthy periods. This keeps the health check informational — pausing telemetry is a form of "acting on health" that belongs in Epic 6 (auto-restart). The existing error deduplication suppresses consecutive identical telemetry errors, limiting log noise.

4. HealthCheckable is independent of adapter lifecycle¶

An adapter can implement health_check() without implementing __aenter__/__aexit__. The framework calls health_check() on any adapter that satisfies the HealthCheckable protocol, regardless of lifecycle support.

This matters for Epic 6: auto-restart only works for adapters with lifecycle methods. A stateless adapter can report unhealthy but cannot be restarted.

5. Adapter→device mapping via DI introspection¶

The framework builds the adapter→device mapping automatically by scanning each device handler's injection_plan at wiring time:

# For each device registration:
adapter_deps = {
    t for _, t in reg.injection_plan
    if t not in KNOWN_INJECTABLE_TYPES and t in resolved_adapters
}
# adapter_deps → set of adapter port types this device depends on

This produces a dict[type, list[str]] mapping each adapter port type to its dependent device names. No user configuration required — the mapping is derived from the same type annotations already used for DI.

6. Startup health check¶

The framework runs one initial health check for each HealthCheckable adapter before starting device tasks. If the check fails:

Affected devices start with availability "offline" (instead of the default "online").
A WARNING is logged.
Device tasks still start — the failure is non-blocking.

This ensures availability is accurate from the first heartbeat, rather than being optimistically "online" until the first periodic check runs.

Decision Drivers¶

Operators need automated detection of wedged adapters without log-spelunking
Health check is informational (detection + reporting) — restart logic is separate
Framework conventions: @runtime_checkable protocols, duck-typed detection, type-based DI
Simplicity for v1 — defer per-adapter intervals and telemetry pausing

Considered Options¶

Health check interval¶

Option	API	Pros	Cons	Chosen
A: Global only	`App(health_check_interval=30)`	Simple, one setting	BLE may want different interval	Yes
B: Per-adapter override	Global + per-adapter	Flexible	Extra API surface, configuration complexity	No
C: Protocol method	`health_check_interval() -> float`	Adapter knows best	Framework can't control; hard to override in settings	No

Failure behaviour¶

Option	Behaviour	Pros	Cons	Chosen
A: Availability only	Set offline, telemetry continues	Simple, informational	Log noise from expected failures	Yes
B: Availability + pause telemetry	Set offline, skip telemetry	Clean logs	More complex, belongs in Epic 6	No

Adapter→device mapping¶

Option	Mechanism	Pros	Cons	Chosen
A: DI introspection	Scan `injection_plan`	Automatic, zero config	Introspection at wiring time	Yes
B: Explicit mapping	`App(adapter_devices={...})`	Clear	Manual, error-prone, duplicates DI	No

Consequences¶

Positive¶

Wedged adapters are detected and reported automatically via availability topics
Operators and home automation systems (Home Assistant) see accurate device availability without manual intervention
Zero-config adapter→device mapping via DI introspection — no user burden
Consistent with existing protocol conventions (@runtime_checkable, duck-typing)
Non-breaking — adapters that don't implement HealthCheckable are unaffected
Consecutive failure count is tracked, providing the foundation for Epic 6 (auto-restart)

Negative¶

Telemetry continues during unhealthy periods, producing expected errors (mitigated by error deduplication)
Global interval may not suit all adapter types (mitigated by the fact that detection latency ≤30s is acceptable for availability reporting)
DI introspection for adapter→device mapping adds build-time complexity (one-time scan, not a runtime cost)

2026-04-02