Monitoring¶
FoehnCast keeps monitoring as an operator surface, not a rider-facing one. The monitoring stack turns pipeline summaries, retained prediction events, drift signals, and hosted sync state into scrapeable metrics and checked-in alert rules without changing the runtime contracts those signals describe.
This page records the current monitoring design that is validated in the local stack and in regression tests. It focuses on what is measured today, where the evidence comes from, and how the operator path stays separate from the public docs surface.
Scope
This page describes the current validated monitoring contract. It is not a roadmap. Future changes should be documented after they are chosen and implemented.
Signal Path¶
The important split is that monitoring consumes persisted or runtime monitoring state after the pipeline and serving paths run. It does not own feature engineering, model scoring, or orchestration itself.
Surface Roles¶
| Surface | What it exposes | Why it matters |
|---|---|---|
airflow/reports/*.json |
latest feature and training pipeline summaries plus timestamped history | gives the app and operator tooling a stable pipeline-run contract |
.state/monitoring/prediction-events.jsonl |
retained prediction-event history by model version | preserves inference evidence across restarts |
.state/monitoring/prediction-log.jsonl |
bounded working set used for prediction-side drift evaluation | keeps drift windows small enough for local evaluation |
.state/online-compose-sync/last-success.json |
latest hosted sync success marker | lets operators detect stale hosted updates |
app /metrics |
pipeline summaries, retained prediction history, hosted sync state, and in-process monitoring counters | gives Prometheus one stable scrape surface for app-owned monitoring state |
| StatsD exporter | drift metrics pushed from Evidently-backed checks | keeps drift signals scrapeable without inventing a second app-owned registry |
| Prometheus and Grafana | time-series storage, dashboard panels, and alert evaluation | provides the operator view and alerting layer |
Durable And Ephemeral Signals¶
The monitoring stack uses both durable and ephemeral signals on purpose.
Durable signals survive restarts and provide historical evidence:
- feature and training pipeline summaries under
airflow/reports/ - timestamped summary history alongside the latest pipeline summaries
- retained prediction-event history in
.state/monitoring/prediction-events.jsonl - hosted compose sync status in
.state/online-compose-sync/last-success.json
Ephemeral signals are runtime-only counters that reset on restart:
- prediction-monitoring schedule counts
- prediction-monitoring execution counts
- last successful background monitoring execution timestamps
The app combines both kinds of signals on /metrics, but the distinction still matters operationally. Durable files support audits and restarts; ephemeral counters describe the current process health.
Metrics Surface¶
The serving application publishes one composed Prometheus payload on /metrics.
That payload includes:
- feature pipeline summary metrics rendered from the latest persisted summary
- training pipeline summary metrics rendered from the latest persisted summary
- retained prediction-log metrics grouped by model version
- hosted online-compose sync status metrics
- in-process prediction-monitoring counters for scheduling and execution outcomes
This keeps the operator scrape path simple: Prometheus reads the app surface for app-owned monitoring state and the StatsD exporter for drift metrics.
Drift Monitoring¶
Drift detection is backed by Evidently and exported through StatsD.
The current contract is:
- feature drift compares reference and current feature datasets on shared columns
- prediction drift compares retained prediction history across reference and recent windows
- drift thresholds and evaluation windows come from the monitoring config, with validated fallbacks in code
- emitted StatsD metrics are mapped into Prometheus as
foehncast_drift_metric
This means the Grafana dashboard can show both dataset-level drift share and column-level drift signals without making the app own another custom metric family.
Prometheus And Grafana¶
The checked-in Prometheus config scrapes three operator targets:
- the app on
/metrics - the StatsD exporter
- Grafana itself
Grafana is provisioned from repository-owned dashboards and alert rules. The validated monitoring dashboard includes panels for:
- feature and training pipeline summary counts
- stage durations and stage failure counts
- feature and inference drift share
- retained prediction-log size and model coverage
- prediction-monitoring schedule failures
- seconds since the last hosted sync
That makes the dashboard a view over checked-in metrics contracts rather than a manually assembled local-only artifact.
Alert Rules¶
The checked-in alert rules cover the main operator failure modes.
| Rule family | What it guards |
|---|---|
| Prediction Monitoring Schedule Failures | background monitoring jobs that fail to schedule for one endpoint |
| Prediction Monitoring Execution Failures | background monitoring jobs that start but fail during execution |
| Prediction Monitoring Stale Success | endpoints that keep scheduling work without recording a recent successful execution |
| Feature Stage Failures | persisted feature-pipeline stage failure counts |
| Training Stage Failures | persisted training-pipeline stage failure counts |
| Hosted Sync Stale | hosted compose sync status that has not reported a recent success |
The contact point and policy tree stay checked in too, so the alert routing contract is reviewable without depending on a live Grafana instance.
Reading Evidence Safely¶
The docs site should explain monitoring with rendered evidence, not live control-plane embeds.
Preferred public-safe evidence sources are:
- rendered
/metricssnippets or screenshots - the checked-in Prometheus scrape config
- the checked-in Grafana dashboard and alert-rule definitions
- pipeline summary JSON artifacts under
airflow/reports/ - retained prediction-event files under
.state/monitoring/when showing structure rather than sensitive runtime data
This keeps the public docs understandable in review while leaving live Grafana, Prometheus, and Airflow as operator tools.
Why This Structure Works¶
- it keeps operator monitoring separate from the rider-facing app and docs pages
- it preserves durable monitoring evidence across restarts without turning runtime state into product data
- it keeps alerting reviewable because dashboards, rules, and scrape config live in git
- it lets one
/metricssurface describe the app-owned monitoring contract while StatsD handles Evidently drift export
See Architecture, Feature Pipeline, and Getting Started for the surrounding system and local operator path.