ADR-011: Error Handling and Publishing¶

Status¶

Accepted Date: 2026-02-14

Context¶

cosalette applications run as unattended daemons. When errors occur (invalid commands, hardware failures, out-of-range values), there is no user present to observe them. Errors must be reported to a remote monitoring system via MQTT so that operators can detect and diagnose problems without SSH-ing into individual devices.

The velux2mqtt reference implementation includes a 251-line ErrorPublisher that converts domain exceptions into structured JSON payloads and publishes them to MQTT error topics. This pattern needs to be generalised: velux2mqtt maps specific domain error classes (InvalidCommandError, PositionOutOfRangeError, etc.) to machine- readable error_type strings — the framework must make this mapping pluggable while providing the publication machinery.

Key design requirements from the reference implementation:

Errors are published as structured JSON (not plain text)
Publication is fire-and-forget — a failed error publication must not crash the daemon
Both global and per-device error topics are used
Wall-clock timestamps (not monotonic) for operator correlation with real time
Errors are logged locally AND published to MQTT (dual observability)

Decision¶

Use structured ErrorPayload → JSON → MQTT with pluggable error type mapping, fire-and-forget publishing, and per-device + global error topics because unattended daemon operation requires observable, machine-parseable error reporting that never crashes the main control loop.

Error payload schema¶

{
  "error_type": "invalid_command",
  "message": "Invalid command: 'hello' (not a recognised command)",
  "device": "blind",
  "timestamp": "2026-02-14T12:34:56+00:00",
  "details": {"payload": "hello"}
}

Topic layout¶

{app}/error              ← all errors (global, always published)
{app}/{device}/error     ← per-device errors (when device name is known)

Publication behaviour¶

Not retained — errors are events, not last-known state
QoS 1 — at-least-once delivery; errors should survive brief network hiccups
Fire-and-forget — publication failures are logged but never propagated
Dual output — errors are both logged locally and published to MQTT

Pluggable error types¶

The framework provides a base ErrorPublisher with build_error_payload(). Projects register their own domain error → error_type string mappings:

_ERROR_TYPE_MAP: dict[type[DomainError], str] = {
    InvalidCommandError: "invalid_command",
    PositionOutOfRangeError: "position_out_of_range",
}

Decision Drivers¶

Unattended daemon operation — no local user to observe errors
Machine-parseable error format for monitoring dashboards
Fire-and-forget — error reporting must never crash the main application
Per-device granularity for targeted alerting
Pluggable error types — each project has its own domain error hierarchy
Wall-clock timestamps for operator correlation with real-world events

Considered Options¶

Option 1: Logging only¶

Report errors through the logging system exclusively (JSON log lines).

Advantages: Simple, no additional infrastructure. Log aggregators can capture errors from the log stream.
Disadvantages: Requires a log aggregation system to be deployed and configured (not yet available). Does not enable MQTT-based monitoring dashboards. Cannot trigger HA automations on errors. Mixes error signals with operational logs.

Option 2: Exception propagation¶

Let exceptions propagate to a global handler that logs and optionally publishes.

Advantages: Standard Python error handling. Less infrastructure code.
Disadvantages: Global handlers lose per-device context. Unhandled exceptions can crash the daemon. Does not support the fire-and-forget requirement.

Option 3: Dead letter queue¶

Publish failed messages to a dead letter topic for later analysis.

Advantages: No message loss, supports replay and forensic analysis.
Disadvantages: Over-engineered for the scope. Requires infrastructure for queue management. The devices are simple IoT bridges — error events are informational, not transactional.

Option 4: Structured ErrorPayload → MQTT (chosen)¶

Convert domain errors to structured JSON payloads and publish to MQTT error topics with fire-and-forget semantics.

Advantages: Machine-parseable errors for monitoring. Fire-and-forget ensures the daemon never crashes due to error reporting. Per-device + global topics enable both targeted and aggregate monitoring. Pluggable error type mapping supports project-specific domain errors. Clock injection enables deterministic test assertions.
Disadvantages: Adds MQTT publishing overhead for every error (mitigated by QoS 1, small payloads). Error schema becomes a contract that must be maintained.

Decision Matrix¶

Criterion	Logging Only	Exception Propagation	Dead Letter Queue	Structured MQTT
Remote observability	2	2	4	5
Resilience (fire-and-forget)	4	1	3	5
Per-device granularity	2	1	3	5
Machine parseability	3	2	4	5
Implementation complexity	5	4	2	3

Scale: 1 (poor) to 5 (excellent)

Consequences¶

Positive¶

Operators can monitor all deployed applications by subscribing to +/error
Machine-parseable JSON enables dashboards, alerting, and Home Assistant automations
Fire-and-forget publication ensures errors never cascade into application crashes
Per-device error topics allow targeted monitoring of specific hardware
Pluggable error type mapping lets each project define its own domain error vocabulary
Dual output (log + MQTT) provides both local and remote observability

Negative¶

Error schema (error_type, message, actuator, timestamp, details) becomes a contract — changes require coordinated updates to monitoring consumers
Fire-and-forget means error publication failures are silently logged — errors about errors could be missed
Per-device + global topic publishing doubles MQTT messages for device-specific errors

2026-02-14