Skip to content

Monitoring

FoehnCast keeps monitoring as an operator surface, not a rider-facing one. The monitoring stack turns pipeline summaries, retained prediction events, drift signals, and hosted sync state into scrapeable metrics and checked-in alert rules without changing the runtime contracts those signals describe.

This page records the current monitoring design that is validated in the local stack and in regression tests. It focuses on what is measured today, where the evidence comes from, and how the operator path stays separate from the public docs surface.

Scope

This page describes the current validated monitoring contract. It is not a roadmap. Future changes should be documented after they are chosen and implemented.

Signal Path

flowchart LR AIR[Airflow pipeline summaries] --> APP[/App /metrics/] PRED[Prediction requests] --> LOG[Prediction log and event history] LOG --> DRIFT[Evidently drift checks] DRIFT --> STATSD[StatsD exporter] SYNC[Hosted sync status file] --> APP LOG --> APP APP --> PROM[Prometheus] STATSD --> PROM GRAFANA[Grafana] --> PROM PROM --> DASH[Dashboards and alert rules]

The important split is that monitoring consumes persisted or runtime monitoring state after the pipeline and serving paths run. It does not own feature engineering, model scoring, or orchestration itself.

Surface Roles

Surface What it exposes Why it matters
airflow/reports/*.json latest feature and training pipeline summaries plus timestamped history gives the app and operator tooling a stable pipeline-run contract
.state/monitoring/prediction-events.jsonl retained prediction-event history by model version preserves inference evidence across restarts
.state/monitoring/prediction-log.jsonl bounded working set used for prediction-side drift evaluation keeps drift windows small enough for local evaluation
.state/online-compose-sync/last-success.json latest hosted sync success marker lets operators detect stale hosted updates
app /metrics pipeline summaries, retained prediction history, hosted sync state, and in-process monitoring counters gives Prometheus one stable scrape surface for app-owned monitoring state
StatsD exporter drift metrics pushed from Evidently-backed checks keeps drift signals scrapeable without inventing a second app-owned registry
Prometheus and Grafana time-series storage, dashboard panels, and alert evaluation provides the operator view and alerting layer

Durable And Ephemeral Signals

The monitoring stack uses both durable and ephemeral signals on purpose.

Durable signals survive restarts and provide historical evidence:

  • feature and training pipeline summaries under airflow/reports/
  • timestamped summary history alongside the latest pipeline summaries
  • retained prediction-event history in .state/monitoring/prediction-events.jsonl
  • hosted compose sync status in .state/online-compose-sync/last-success.json

Ephemeral signals are runtime-only counters that reset on restart:

  • prediction-monitoring schedule counts
  • prediction-monitoring execution counts
  • last successful background monitoring execution timestamps

The app combines both kinds of signals on /metrics, but the distinction still matters operationally. Durable files support audits and restarts; ephemeral counters describe the current process health.

Metrics Surface

The serving application publishes one composed Prometheus payload on /metrics.

That payload includes:

  • feature pipeline summary metrics rendered from the latest persisted summary
  • training pipeline summary metrics rendered from the latest persisted summary
  • retained prediction-log metrics grouped by model version
  • hosted online-compose sync status metrics
  • in-process prediction-monitoring counters for scheduling and execution outcomes

This keeps the operator scrape path simple: Prometheus reads the app surface for app-owned monitoring state and the StatsD exporter for drift metrics.

Drift Monitoring

Drift detection is backed by Evidently and exported through StatsD.

The current contract is:

  • feature drift compares reference and current feature datasets on shared columns
  • prediction drift compares retained prediction history across reference and recent windows
  • drift thresholds and evaluation windows come from the monitoring config, with validated fallbacks in code
  • emitted StatsD metrics are mapped into Prometheus as foehncast_drift_metric

This means the Grafana dashboard can show both dataset-level drift share and column-level drift signals without making the app own another custom metric family.

Prometheus And Grafana

The checked-in Prometheus config scrapes three operator targets:

  • the app on /metrics
  • the StatsD exporter
  • Grafana itself

Grafana is provisioned from repository-owned dashboards and alert rules. The validated monitoring dashboard includes panels for:

  • feature and training pipeline summary counts
  • stage durations and stage failure counts
  • feature and inference drift share
  • retained prediction-log size and model coverage
  • prediction-monitoring schedule failures
  • seconds since the last hosted sync

That makes the dashboard a view over checked-in metrics contracts rather than a manually assembled local-only artifact.

Alert Rules

The checked-in alert rules cover the main operator failure modes.

Rule family What it guards
Prediction Monitoring Schedule Failures background monitoring jobs that fail to schedule for one endpoint
Prediction Monitoring Execution Failures background monitoring jobs that start but fail during execution
Prediction Monitoring Stale Success endpoints that keep scheduling work without recording a recent successful execution
Feature Stage Failures persisted feature-pipeline stage failure counts
Training Stage Failures persisted training-pipeline stage failure counts
Hosted Sync Stale hosted compose sync status that has not reported a recent success

The contact point and policy tree stay checked in too, so the alert routing contract is reviewable without depending on a live Grafana instance.

Reading Evidence Safely

The docs site should explain monitoring with rendered evidence, not live control-plane embeds.

Preferred public-safe evidence sources are:

  • rendered /metrics snippets or screenshots
  • the checked-in Prometheus scrape config
  • the checked-in Grafana dashboard and alert-rule definitions
  • pipeline summary JSON artifacts under airflow/reports/
  • retained prediction-event files under .state/monitoring/ when showing structure rather than sensitive runtime data

This keeps the public docs understandable in review while leaving live Grafana, Prometheus, and Airflow as operator tools.

Why This Structure Works

  • it keeps operator monitoring separate from the rider-facing app and docs pages
  • it preserves durable monitoring evidence across restarts without turning runtime state into product data
  • it keeps alerting reviewable because dashboards, rules, and scrape config live in git
  • it lets one /metrics surface describe the app-owned monitoring contract while StatsD handles Evidently drift export

See Architecture, Feature Pipeline, and Getting Started for the surrounding system and local operator path.