Feature Pipeline¶
FoehnCast keeps the feature pipeline as a clear set of boundaries. Forecast data is ingested, turned into curated features, checked against explicit bounds, stored through a backend abstraction, and then reshaped for Feast serving use without changing the meaning of the feature set.
This page records the current design that has been validated in the local stack and in the review notebook. It focuses on what each stage owns today and what stays outside its scope.
Scope
This page describes the current validated feature-path contract. It is not a roadmap. Future changes should be documented after they are chosen and implemented.
Pipeline Shape¶
The key point is that each stage has one clear job:
- ingest proves the upstream weather contract
- engineering creates the curated feature frame
- validation rejects structurally broken outputs
- storage preserves the curated contract without mutating it
- Feast preparation consumes curated rows instead of reimplementing the feature pipeline
- Airflow publishes the downstream asset hand-offs after persistence and Feast preparation succeed
Stage Responsibilities¶
| Stage | Main responsibility | Must not become |
|---|---|---|
| Ingest | fetch forecast rows and make source assumptions explicit | hidden enrichment or silent schema rewriting |
| Engineering | turn raw rows into stable, curated features | a place where storage or training concerns leak in |
| Validation | stop missing columns, null-heavy outputs, and impossible numeric ranges | a semantic quality model |
| Storage | persist and restore curated rows faithfully | a second feature-engineering stage |
| Feast preparation | project curated rows into offline and entity frames | a replacement for the base feature store |
| Export | materialize a stable artifact for local Feast workflows | another place where feature semantics can drift |
Ingest Boundary¶
The ingest stage uses the live forecast helper rather than a notebook-only mock. That matters because the first contract to defend is upstream shape and timestamp behavior.
The validated ingest assumptions are:
- expected forecast columns are checked explicitly
- missing and unexpected columns are surfaced rather than ignored
- timestamp ordering and duplicate timestamps are inspected
- timezone semantics are made explicit before later storage and Feast hand-offs
- the upstream wind-unit contract is explicit: Open-Meteo is requested with
wind_speed_unit=kmh, the returnedhourly_unitsmap is validated at ingest, and the persisted pipeline summary records those units for later review
For this project, raw weather features stay in source weather units, which currently means km/h for wind speed and gusts. That is separate from domain-facing rideability thresholds, which remain configured in knots and are converted at scoring time instead of changing the stored feature contract.
If the project grows a more formal landing layer, ingest should still preserve the upstream payload faithfully instead of mixing raw capture with curated enrichment.
Curated Feature Boundary¶
The engineering layer creates the project's curated feature frame. The current feature set already reflects several design choices that should remain stable:
- cyclical time variables are encoded with sine and cosine instead of plain integers
- shoreline fit is represented with circular math through
shore_alignment - gustiness is carried as both a stable absolute surplus (
gust_excess_10m) for the model and a ratio (gust_factor) retained for label semantics - steadiness remains an operational wind-quality signal through
wind_steadiness - raw columns remain available alongside engineered columns so downstream validation and storage operate on one complete curated frame
- the datetime index and time basis are preserved so later storage and Feast preparation can add persistence and serving context without redefining the feature set
The current training path is tree-based, so the design priority is feature representation quality rather than blanket feature scaling. Circular wind-direction encoding and the shift from ratio-heavy gustiness toward gust_excess_10m are part of that same representation choice, not a generic normalization layer.
The engineering stage should also stay narrow. It creates the curated feature frame, but it should not add serving metadata or turn into a Feast-specific layer. That downstream hand-off remains intentional: engineer first, then validate, store, and only later project into Feast-friendly shapes.
Validation Boundary¶
Validation is structural, not semantic. It is there to stop broken feature frames before they reach storage, training, or Feast preparation.
The current validated contract is:
- required columns cover the actual curated feature frame, not only raw ingest fields
- configured range checks cover the engineered features that later storage and Feast preparation depend on
- cyclical features and
shore_alignmentare bounded where the math makes the valid range explicit gust_excess_10m,gust_factor, andwind_steadinessare lower-bounded rather than aggressively clipped- completeness checks remain important because ratio-based features can go null when sustained wind approaches zero
- validation is the explicit gate before downstream persistence and Feast projection
This layer is not supposed to decide whether a forecast is good for riding. That belongs to the downstream ranking and prediction logic.
Storage Boundary¶
Storage works only if it behaves like persistence rather than transformation. A stored feature frame should come back with the same schema, index semantics, and numeric values that validation approved.
The current storage contract is:
- local, S3-compatible, and BigQuery backends may use different write-time metadata internally
- rerunning one logical spot-and-dataset write must replace that slice instead of accumulating cloud-only duplicates
- downstream reads must restore the same curated feature frame shape
- backend-specific columns must not leak into consumers after read-back
- round-trip checks should show matching row counts, matching columns, matching index behavior, and no numeric drift
- the stored frame should preserve the time basis needed for later Feast projection after read-back
- storage should operate on validation-approved curated rows, not compensate for upstream quality failures
This is why raw landing, curated storage, and Feast should remain separate responsibilities rather than one blurred storage layer.
Storage Layering¶
| Data role | Local baseline | Cloud direction |
|---|---|---|
| Raw landing | files if retained at all | GCS |
| Curated features | MinIO-backed parquet objects | native BigQuery tables |
| Feast offline source | exported parquet | BigQuery table or view over curated rows |
| Feast registry and staging | local files | GCS |
| Feast online store | Datastore-mode emulator | Datastore |
The local operator path mirrors the cloud storage roles more closely: MinIO-backed object storage for curated objects and MLflow artifacts, exported parquet for the local Feast offline source, and a Datastore-mode emulator for the local Feast online store.
Storage Control Surface¶
The storage split is not only conceptual. The repository exposes it through explicit runtime and infrastructure surfaces so the local path and the cloud target stay aligned.
| Surface | Current implementation | Why it matters |
|---|---|---|
| Backend selection | storage.backend in config.yaml with supported values s3 or bigquery |
keeps the curated persistence mode explicit without carrying a legacy file-backed path |
| Curated local store | S3FeatureStoreBackend against the bundled MinIO service |
keeps the local object-access layer aligned with the hosted architecture and MLflow artifact flow |
| Curated cloud store | BigQueryFeatureStoreBackend writing the configured project, dataset, and table |
matches the cloud analytical target and preserves rerun-safe slice replacement |
| Feast offline local source | export_offline_store(...) writing data/feast/<dataset>.parquet |
keeps Feast downstream from curated persistence |
| Feast cloud source | BigQuery table or view referenced by the cloud Feast config | avoids duplicating feature logic in a separate serving path |
| Feast local runtime state | .state/feast/registry.db and .state/feast/feature_store.runtime.yaml |
keeps local integration state separate from the curated dataset while the emulator remains disposable |
| Feast registry and staging | local state files in development, GCS in the cloud path | keeps registry metadata separate from the curated dataset |
| Terraform baseline | Terraform-managed GCS, BigQuery, and Datastore-mode Firestore baseline | wires the cloud storage surfaces without changing the feature contract itself |
Terraform is part of the storage boundary because it provisions the cloud-side GCS and BigQuery surfaces that the feature pipeline expects. It should supply the bucket, dataset, and table baseline while leaving ingest, engineering, validation, storage, and Feast preparation responsible for their own stage contracts.
That means storage normalization is role-specific. The local baseline now normalizes the blob-style surfaces onto MinIO, but the cloud target still keeps curated features in BigQuery and the Feast online store in Datastore rather than flattening everything behind one object API.
Feast Boundary¶
Feast is downstream from the curated feature store. It should consume curated rows, not reach back into raw ingestion or recompute engineering logic.
That means:
build_offline_store_frame(...)andbuild_entity_rows(...)stay thin projections over stored curated dataexport_offline_store(...)is a deterministic materialization step, not a second feature-engineering stageprepare_feature_store(...)is the Airflow-owned orchestration step that exports curated rows, renders the runtime config, applies the Feast repo, and materializes the online store without redefining the feature contract- the local preparation script is an operator wrapper around those helpers, not the real source of truth for the feature contract
- local Feast stays lightweight with exported parquet plus the Datastore-mode emulator, while keeping registry and runtime config state outside the workload data root and the curated rows in the local objectstore baseline
- the cloud direction stays aligned with curated BigQuery plus Datastore and GCS support
The Airflow-facing part of the contract matters here: the feature DAG emits explicit curated-feature and Feast-sync assets after prepare_feature_store(...) succeeds, then optionally publishes the training-request asset that schedules the training DAG. That makes the feature-to-training boundary visible in Airflow instead of hiding it behind a direct trigger.
This keeps the same conceptual split available in both local and cloud directions: curated features first, Feast second.
Why This Structure Works¶
- it keeps pipeline boundaries explicit enough for Airflow orchestration
- it keeps local-first development simple without blocking a BigQuery-backed cloud path
- it prevents Feast from becoming a surrogate landing layer or feature store owner
- it gives README and site documentation stable sections that can be explained without embedding run-specific notebook output
See Architecture for the broader system view and Cloud Mapping for the GCP direction.