Monitoring & Telemetry Workflow¶

This page covers how device metrics flow from the edge to TimescaleDB (PostgreSQL 17 + TimescaleDB extension) and how to use the Grafana dashboards.

Architecture¶

graph LR
    subgraph edge[Edge Device]
        MQC[MQTT client]
    end

    subgraph tenant[Tenant-Stack]
        TB[ThingsBoard]
        TSDB_T[TimescaleDB]
        GRF_T[Grafana]
        TSDB_T --> GRF_T
    end

    subgraph provider[Provider-Stack]
        RMQ[RabbitMQ]
        TLG[Telegraf]
        TSDB_P[TimescaleDB]
        GRF_P["Grafana (platform)"]
        TSDB_P --> GRF_P
    end

    MQC -->|MQTTS| TB
    TB -->|Rule Engine| ALM["Alarms / Notifications"]
    TB -->|"AMQP (optional)"| RMQ
    RMQ -.->|"MQTT consumer (mTLS)\ncdm/provider/#"| TLG
    TLG -->|SQL| TSDB_P
    TLG -.->|"HTTP health checks"| TB

Principle: The Provider-Stack Telegraf instance collects platform infrastructure health data — it polls HTTP health endpoints for Keycloak, Grafana, IoT Bridge API, RabbitMQ, and step-ca, collects RabbitMQ management metrics, and subscribes via mqtt_consumer (mTLS) to the cdm/provider/# topic on RabbitMQ for system-monitor metrics. All data is written to the Provider TimescaleDB (cdm database) and visualized in the provider Grafana instance. Device telemetry (CPU, memory, sensors) goes through ThingsBoard in the Tenant-Stack.

Telegraf Configuration¶

The Telegraf config is at provider-stack/monitoring/telegraf/telegraf.conf. Key sections:

Global tags¶

[global_tags]
  component = "provider"

All measurements are tagged with component = "provider". There is no per-device tag; Provider Telegraf monitors platform infrastructure, not individual devices.

Inputs¶

Plugin	Metrics	Interval
`http_response`	HTTP 2xx / response_time for Keycloak, Grafana, IoT Bridge API, RabbitMQ, step-ca	60s
`rabbitmq`	Queue depths, message rates, connection counts (Basic Auth)	60s
`mqtt_consumer` (mTLS)	Platform system metrics published on `cdm/provider/#` via RabbitMQ	on message

The mqtt_consumer connects to tls://rabbitmq:8883 using a client certificate (CN=telegraf) issued by Provider step-ca. RabbitMQ maps the CN to the telegraf user via EXTERNAL SASL — no password required.

Output — Provider TimescaleDB¶

[[outputs.postgresql]]
  connection = "postgresql://telegraf:${TSDB_TELEGRAF_PASSWORD}@${TSDB_HOST}:${TSDB_PORT}/${TSDB_DATABASE}?sslmode=disable"
  schema = "public"
  tags_as_foreign_keys = false

Telegraf automatically creates one hypertable per measurement in the cdm database.

TimescaleDB Tables¶

Tenant TimescaleDB (${TENANT_ID} database) — device-facing:

Table (hypertable)	Retention	Content
`device_telemetry`	30 days	All Telegraf metrics from devices
`device_events`	90 days	Device state changes, OTA events
`device_audit`	90 days	Enrollment, certificate events
`<measurement>`	configurable	Auto-created by Telegraf per measurement

Provider TimescaleDB (cdm database) — platform-facing:

Table (hypertable)	Retention	Content
`http_response`	30 days	HTTP health-check results (response code + time) per provider service
`rabbitmq`	30 days	RabbitMQ queue depths, message rates, connection counts
`provider_system`	90 days	System-monitor metrics from `cdm/provider/#` MQTT topic
`<measurement>`	configurable	Auto-created by Telegraf per MQTT measurement name

Tables and hypertables are created automatically by monitoring/timescaledb/init-scripts/01-init-schema.sh. Telegraf creates additional tables on first write.

Grafana Dashboards¶

Grafana is pre-provisioned with the System Monitor dashboard.

System Monitor¶

The System Monitor dashboard (provider-stack/monitoring/grafana/dashboards/system-monitor.json) shows provider infrastructure health:

CPU load averages (1m / 5m / 15m) — timeseries + gauge
RAM used % — timeseries + gauge + current value

All queries read from the provider_system hypertable:

SELECT time AS "time",
       cpu_load_1m AS "1m",
       cpu_load_5m AS "5m",
       cpu_load_15m AS "15m"
FROM provider_system
WHERE $__timeFilter(time)
ORDER BY time;

The datasource type is grafana-postgresql-datasource (not the legacy postgres type) with timescaledb: true set in jsonData.

Alerting¶

ThingsBoard handles device-level alerting (offline devices, alarms, OTA status changes).

Grafana can alert on Provider infrastructure health using data in TimescaleDB:

Service endpoint down (no successful http_response for 5 min) → Grafana Alert
RabbitMQ queue depth exceeding threshold → warning alert

Troubleshooting¶

Symptom	Cause	Fix
No metrics in TimescaleDB	Telegraf can't reach TimescaleDB	Check `TSDB_DATABASE`, `TSDB_TELEGRAF_PASSWORD` in `provider-stack/.env`
Grafana shows "No data"	Wrong table name or time range	Verify table is `http_response`, `rabbitmq`, or `provider_system`; widen time range
ThingsBoard telemetry stops	MQTT client disconnected	Check `mqtt-client` logs; verify cert validity
Telegraf MQTT consumer errors	mTLS cert not issued	`docker compose restart mqtt-certs-init telegraf`
Telegraf: `password authentication failed`	Wrong `TSDB_TELEGRAF_PASSWORD`	Check `.env` and run `docker compose restart telegraf`