ADR-011: Error Handling and Publishing¶
Status¶
Accepted Date: 2026-02-14
Context¶
cosalette applications run as unattended daemons. When errors occur (invalid commands, hardware failures, out-of-range values), there is no user present to observe them. Errors must be reported to a remote monitoring system via MQTT so that operators can detect and diagnose problems without SSH-ing into individual devices.
The velux2mqtt reference implementation includes a 251-line ErrorPublisher that
converts domain exceptions into structured JSON payloads and publishes them to MQTT
error topics. This pattern needs to be generalised: velux2mqtt maps specific domain
error classes (InvalidCommandError, PositionOutOfRangeError, etc.) to machine-
readable error_type strings — the framework must make this mapping pluggable while
providing the publication machinery.
Key design requirements from the reference implementation:
- Errors are published as structured JSON (not plain text)
- Publication is fire-and-forget — a failed error publication must not crash the daemon
- Both global and per-device error topics are used
- Wall-clock timestamps (not monotonic) for operator correlation with real time
- Errors are logged locally AND published to MQTT (dual observability)
Decision¶
Use structured ErrorPayload → JSON → MQTT with pluggable error type mapping,
fire-and-forget publishing, and per-device + global error topics because
unattended daemon operation requires observable, machine-parseable error reporting that
never crashes the main control loop.
Error payload schema¶
{
"error_type": "invalid_command",
"message": "Invalid command: 'hello' (not a recognised command)",
"device": "blind",
"timestamp": "2026-02-14T12:34:56+00:00",
"details": {"payload": "hello"}
}
Topic layout¶
{app}/error ← all errors (global, always published)
{app}/{device}/error ← per-device errors (when device name is known)
Publication behaviour¶
- Not retained — errors are events, not last-known state
- QoS 1 — at-least-once delivery; errors should survive brief network hiccups
- Fire-and-forget — publication failures are logged but never propagated
- Dual output — errors are both logged locally and published to MQTT
Pluggable error types¶
The framework provides a base ErrorPublisher with build_error_payload().
Projects register their own domain error → error_type string mappings:
_ERROR_TYPE_MAP: dict[type[DomainError], str] = {
InvalidCommandError: "invalid_command",
PositionOutOfRangeError: "position_out_of_range",
}
Decision Drivers¶
- Unattended daemon operation — no local user to observe errors
- Machine-parseable error format for monitoring dashboards
- Fire-and-forget — error reporting must never crash the main application
- Per-device granularity for targeted alerting
- Pluggable error types — each project has its own domain error hierarchy
- Wall-clock timestamps for operator correlation with real-world events
Considered Options¶
Option 1: Logging only¶
Report errors through the logging system exclusively (JSON log lines).
- Advantages: Simple, no additional infrastructure. Log aggregators can capture errors from the log stream.
- Disadvantages: Requires a log aggregation system to be deployed and configured (not yet available). Does not enable MQTT-based monitoring dashboards. Cannot trigger HA automations on errors. Mixes error signals with operational logs.
Option 2: Exception propagation¶
Let exceptions propagate to a global handler that logs and optionally publishes.
- Advantages: Standard Python error handling. Less infrastructure code.
- Disadvantages: Global handlers lose per-device context. Unhandled exceptions can crash the daemon. Does not support the fire-and-forget requirement.
Option 3: Dead letter queue¶
Publish failed messages to a dead letter topic for later analysis.
- Advantages: No message loss, supports replay and forensic analysis.
- Disadvantages: Over-engineered for the scope. Requires infrastructure for queue management. The devices are simple IoT bridges — error events are informational, not transactional.
Option 4: Structured ErrorPayload → MQTT (chosen)¶
Convert domain errors to structured JSON payloads and publish to MQTT error topics with fire-and-forget semantics.
- Advantages: Machine-parseable errors for monitoring. Fire-and-forget ensures the daemon never crashes due to error reporting. Per-device + global topics enable both targeted and aggregate monitoring. Pluggable error type mapping supports project-specific domain errors. Clock injection enables deterministic test assertions.
- Disadvantages: Adds MQTT publishing overhead for every error (mitigated by QoS 1, small payloads). Error schema becomes a contract that must be maintained.
Decision Matrix¶
| Criterion | Logging Only | Exception Propagation | Dead Letter Queue | Structured MQTT |
|---|---|---|---|---|
| Remote observability | 2 | 2 | 4 | 5 |
| Resilience (fire-and-forget) | 4 | 1 | 3 | 5 |
| Per-device granularity | 2 | 1 | 3 | 5 |
| Machine parseability | 3 | 2 | 4 | 5 |
| Implementation complexity | 5 | 4 | 2 | 3 |
Scale: 1 (poor) to 5 (excellent)
Consequences¶
Positive¶
- Operators can monitor all deployed applications by subscribing to
+/error - Machine-parseable JSON enables dashboards, alerting, and Home Assistant automations
- Fire-and-forget publication ensures errors never cascade into application crashes
- Per-device error topics allow targeted monitoring of specific hardware
- Pluggable error type mapping lets each project define its own domain error vocabulary
- Dual output (log + MQTT) provides both local and remote observability
Negative¶
- Error schema (
error_type,message,actuator,timestamp,details) becomes a contract — changes require coordinated updates to monitoring consumers - Fire-and-forget means error publication failures are silently logged — errors about errors could be missed
- Per-device + global topic publishing doubles MQTT messages for device-specific errors
2026-02-14