Observability Model
Purpose
Section titled “Purpose”This document defines initial observability requirements.
Observability must support human operators and future AI-assisted support workflows.
Services must use structured logs.
Logs should include:
- Timestamp.
- Log level.
- Service name.
- Message.
- Correlation ID where available.
- Tenant ID where safe and relevant.
- Organization ID where safe and relevant.
- Device ID where safe and relevant.
- Error code where relevant.
Logs must not include secrets or sensitive raw credentials.
The Phase 4 service baseline uses JSON logs for backend deployables. Every log record includes:
- service name,
- runtime environment,
- log level,
- timestamp,
- message.
Logs include correlationId when request, event, or worker context provides
one. The x-correlation-id HTTP header is the initial transport header for
operation endpoints and future HTTP APIs.
Metrics
Section titled “Metrics”Initial metrics should cover:
- HTTP request count.
- HTTP error count.
- HTTP latency.
- MQTT connection state.
- MQTT messages received.
- MQTT messages rejected.
- Internal stream backlog.
- Telemetry messages processed.
- Decoder failures.
- Unknown device messages.
- Database write failures.
- Export jobs created.
- Export jobs completed.
- Export job failures.
Backend services expose Prometheus-compatible metrics at:
GET /metricsThe Phase 4 baseline exposes:
sens_service_info{service_name,environment} 1sens_service_ready{service_name,environment}sens_http_requests_total{service_name,environment,method,route,status_code}sens_http_request_duration_seconds- Node.js process metrics from
prom-client
Metrics must not use tenant IDs, organization IDs, device IDs, user IDs, tokens, raw payload values, or other high-cardinality or sensitive labels.
Local Observability Verification
Section titled “Local Observability Verification”The repository includes a local dev cockpit for Phase 4 observability checks. It can start the backend service skeletons, capture their stdout and stderr, inspect structured JSON logs, call health and readiness endpoints, validate baseline Prometheus metrics, and verify correlation ID behavior.
The dev cockpit is a local development tool only. It is not a log aggregation stack, not a production dashboard, not a Kubernetes component, and not part of the public Platform API.
Health and Readiness
Section titled “Health and Readiness”Every service must expose health and readiness signals.
Health answers whether the process is alive.
Readiness answers whether the service can serve traffic or process work.
Readiness should consider critical dependencies such as database or broker where appropriate.
Backend deployables expose these operation endpoints:
GET /healthzGET /readyzGET /metricsGET /healthz returns HTTP 200 when the process is alive. It does not check
external dependencies.
GET /readyz returns HTTP 200 when all readiness checks pass and HTTP 503 when
one or more checks fail. Phase 4 only includes a config readiness check,
because database, NATS, MQTT, decoder, and storage integrations are not
implemented yet.
Responses include x-correlation-id. If the caller supplies a valid
x-correlation-id, the service reuses it; otherwise the service generates a new
correlation ID.
The platform-api also exposes:
GET /testGET /test returns { "success": true } and emits a structured JSON log entry
with the message test endpoint called. This endpoint exists only as an early
deployment smoke test and must not be used as a product API contract.
Tracing
Section titled “Tracing”Correlation IDs should flow through HTTP requests, internal events, and worker logs.
Full distributed tracing may be introduced later.
Alerting
Section titled “Alerting”Initial alerting should consider:
- Service down.
- Database unavailable.
- Broker unavailable.
- MQTT disconnected.
- Queue backlog growing.
- Decoder errors above threshold.
- Unknown devices above threshold.
- Export job failures.
- Disk/storage pressure.
Open Decisions
Section titled “Open Decisions”- Metrics stack.
- Log aggregation stack.
- Dashboard tooling.
- Alert manager.