Understanding the Pollen Forecasting Hybrid Model Workflow

Diagramly Team

4 min read
Understanding the Pollen Forecasting Hybrid Model Workflow illustration

This post walks through the pollen forecasting workflow shown in the diagram. It covers how raw data is prepared, how two models are trained, and how results are evaluated in one final pipeline.

Diagram

Loading PlantUML diagram...

Overview

Pollen forecasting gets hard fast: data arrives at different frequencies, locations are unevenly sampled, and weather effects lag over time. This workflow is designed to make those constraints explicit instead of hiding them.

The pipeline is split into three phases. First, we build a modeling dataset from pollen, climate, NDVI, and land-cover inputs. Next, we train a classifier and a regressor on a location-based split. Last, we combine predictions in a single evaluation run to measure how the system performs on unseen locations.

Diagram Breakdown

The workflow has three connected phases:

  1. Data engineering and feature matrix construction

    • 1_clean_pollen.py converts hourly pollen rows into daily data and creates unique location records.
    • 2_download_climate_V2.py pulls climate variables, including wind direction.
    • AppEEARS exports NDVI and LULC data.
    • 5_merge_all_data_V2.py joins sources and interpolates missing NDVI points.
    • 6_advanced_feature_engineering_V2.py creates GDD, seasonal, lag, rolling, and wind-vector features, then writes FINAL_MODELING_DATASET.csv.
  2. Hybrid model training

    • Data is split by location: 10 locations for training, 2 unseen locations for testing.
    • 7_train_model_A trains an XGBClassifier to detect whether pollen season is active.
    • 8_train_model_B trains an XGBRegressor to estimate pollen amount.
  3. Final evaluation pipeline

    • 9_run_final_pipeline_TUNED.py runs both models and computes metrics.
    • Combination rule: if Model A predicts "inactive season," final pollen output is 0; otherwise, use Model B output.

Key insights

  • Data quality work is not optional. Most forecast errors in this kind of system start with merge gaps, inconsistent timestamps, or weak location mapping.
  • Splitting by location is a stronger test than random splits because it checks whether the model generalizes to places it has never seen.
  • The hybrid rule is simple but useful: first decide whether pollen activity exists, then estimate magnitude only when it does.

Next steps

  • Add feature importance reporting for both models so you can inspect what drives each prediction.
  • Track metrics per location, not just global averages, to find weak regions early.
  • Add retraining checkpoints when new climate or land-use data is ingested.