Monitoring & Telemetry Workflow¶
This page covers how device metrics flow from the edge to TimescaleDB (PostgreSQL 17 + TimescaleDB extension) and how to use the Grafana dashboards.
Architecture¶
graph LR
subgraph edge[Edge Device]
MQC[MQTT client]
end
subgraph tenant[Tenant-Stack]
TB[ThingsBoard]
TSDB_T[TimescaleDB]
GRF_T[Grafana]
TSDB_T --> GRF_T
end
subgraph provider[Provider-Stack]
RMQ[RabbitMQ]
TLG[Telegraf]
TSDB_P[TimescaleDB]
GRF_P["Grafana (platform)"]
TSDB_P --> GRF_P
end
MQC -->|MQTTS| TB
TB -->|Rule Engine| ALM["Alarms / Notifications"]
TB -->|"AMQP (optional)"| RMQ
RMQ -.->|"MQTT consumer (mTLS)\ncdm/provider/#"| TLG
TLG -->|SQL| TSDB_P
TLG -.->|"HTTP health checks"| TB
Principle: The Provider-Stack Telegraf instance collects platform infrastructure
health data — it polls HTTP health endpoints for Keycloak, Grafana, IoT Bridge API,
RabbitMQ, and step-ca, collects RabbitMQ management metrics, and subscribes via
mqtt_consumer (mTLS) to the cdm/provider/# topic on RabbitMQ for system-monitor
metrics. All data is written to the Provider TimescaleDB (cdm database) and
visualized in the provider Grafana instance. Device telemetry (CPU, memory, sensors)
goes through ThingsBoard in the Tenant-Stack.
Telegraf Configuration¶
The Telegraf config is at provider-stack/monitoring/telegraf/telegraf.conf. Key sections:
Global tags¶
All measurements are tagged with component = "provider". There is no per-device tag;
Provider Telegraf monitors platform infrastructure, not individual devices.
Inputs¶
| Plugin | Metrics | Interval |
|---|---|---|
http_response |
HTTP 2xx / response_time for Keycloak, Grafana, IoT Bridge API, RabbitMQ, step-ca | 60s |
rabbitmq |
Queue depths, message rates, connection counts (Basic Auth) | 60s |
mqtt_consumer (mTLS) |
Platform system metrics published on cdm/provider/# via RabbitMQ |
on message |
The mqtt_consumer connects to tls://rabbitmq:8883 using a client certificate
(CN=telegraf) issued by Provider step-ca. RabbitMQ maps the CN to the telegraf
user via EXTERNAL SASL — no password required.
Output — Provider TimescaleDB¶
[[outputs.postgresql]]
connection = "postgresql://telegraf:${TSDB_TELEGRAF_PASSWORD}@${TSDB_HOST}:${TSDB_PORT}/${TSDB_DATABASE}?sslmode=disable"
schema = "public"
tags_as_foreign_keys = false
Telegraf automatically creates one hypertable per measurement in the cdm database.
TimescaleDB Tables¶
Tenant TimescaleDB (${TENANT_ID} database) — device-facing:
| Table (hypertable) | Retention | Content |
|---|---|---|
device_telemetry |
30 days | All Telegraf metrics from devices |
device_events |
90 days | Device state changes, OTA events |
device_audit |
90 days | Enrollment, certificate events |
<measurement> |
configurable | Auto-created by Telegraf per measurement |
Provider TimescaleDB (cdm database) — platform-facing:
| Table (hypertable) | Retention | Content |
|---|---|---|
http_response |
30 days | HTTP health-check results (response code + time) per provider service |
rabbitmq |
30 days | RabbitMQ queue depths, message rates, connection counts |
provider_system |
90 days | System-monitor metrics from cdm/provider/# MQTT topic |
<measurement> |
configurable | Auto-created by Telegraf per MQTT measurement name |
Tables and hypertables are created automatically by monitoring/timescaledb/init-scripts/01-init-schema.sh.
Telegraf creates additional tables on first write.
Grafana Dashboards¶
Grafana is pre-provisioned with the System Monitor dashboard.
System Monitor¶
The System Monitor dashboard (provider-stack/monitoring/grafana/dashboards/system-monitor.json)
shows provider infrastructure health:
- CPU load averages (1m / 5m / 15m) — timeseries + gauge
- RAM used % — timeseries + gauge + current value
All queries read from the provider_system hypertable:
SELECT time AS "time",
cpu_load_1m AS "1m",
cpu_load_5m AS "5m",
cpu_load_15m AS "15m"
FROM provider_system
WHERE $__timeFilter(time)
ORDER BY time;
The datasource type is grafana-postgresql-datasource (not the legacy postgres type)
with timescaledb: true set in jsonData.
Alerting¶
ThingsBoard handles device-level alerting (offline devices, alarms, OTA status changes).
Grafana can alert on Provider infrastructure health using data in TimescaleDB:
- Service endpoint down (no successful
http_responsefor 5 min) → Grafana Alert - RabbitMQ queue depth exceeding threshold → warning alert
Troubleshooting¶
| Symptom | Cause | Fix |
|---|---|---|
| No metrics in TimescaleDB | Telegraf can't reach TimescaleDB | Check TSDB_DATABASE, TSDB_TELEGRAF_PASSWORD in provider-stack/.env |
| Grafana shows "No data" | Wrong table name or time range | Verify table is http_response, rabbitmq, or provider_system; widen time range |
| ThingsBoard telemetry stops | MQTT client disconnected | Check mqtt-client logs; verify cert validity |
| Telegraf MQTT consumer errors | mTLS cert not issued | docker compose restart mqtt-certs-init telegraf |
Telegraf: password authentication failed |
Wrong TSDB_TELEGRAF_PASSWORD |
Check .env and run docker compose restart telegraf |