hero-image
HOME
hero-image
project-highlight-image

Machine Learning for Weather Prediction

hero-image
pon

Project Timeline

Nov 2025 - Dec-2025

OVERVIEW

This project investigates the use of machine learning models to predict daily weather variables for Columbus, Ohio, motivated by the growing need for accurate and accessible data-driven forecasting methods. Long-term historical weather observations from NOAA were cleaned and analyzed, and features such as lagged values, rolling averages, and seasonal indicators were engineered to capture temporal patterns. Two modeling approaches were evaluated: Ridge Regression as a linear baseline and Random Forest as a non-linear ensemble method. Model performance was assessed using a chronological train–test split and standard regression metrics. Results show that Random Forest models provide strong predictive skill for temperature variables, explaining a large fraction of variance in next-day forecasts and retaining meaningful performance up to a seven-day horizon, while Ridge Regression offers a competitive but less flexible baseline. In contrast, precipitation proved difficult to predict using the available features, with all models exhibiting poor performance. These findings highlight both the strengths and limitations of data-driven weather prediction and motivate future work that integrates physical modeling and neural networks to study long-term climate behavior.

HighlightS

Key Highlights

  • Large-Scale Historical Dataset
  • Trained on nearly 7,000 cleaned daily observations spanning over five decades of NOAA data, enabling the models to learn seasonal patterns, long-term variability, and short-term weather persistence.
  • Realistic Time-Series Forecasting Setup
  • Used a chronological train-test split to prevent data leakage and to accurately simulate real-world forecasting conditions.
  • Feature Engineering Grounded in Physical Intuition
  • Constructed lagged temperature features, 7-day rolling averages, and seasonal indicators (month and day of year) to capture both short-term trends and annual cycles.
  • Model Comparison
  • Implemented and evaluated both Ridge Regression as a linear baseline and Random Forest Regression as a nonlinear ensemble model across multiple forecast horizons (1-day, 3-day, and 7-day).
  • Strong Temperature Prediction Performance
  • Random Forest models achieved RΒ² values above 0.8 for next-day temperature predictions, with performance decreasing gradually for longer forecast horizons, consistent with atmospheric uncertainty.
  • Transparent Handling of Precipitation Limitations
  • Precipitation predictions showed low predictive skill, highlighting the inherent difficulty of modeling precipitation using daily, location-based features alone.
  • Live Weather Integration
  • Compared model forecasts against live OpenWeatherMap observations, providing real-world context and qualitative validation of model performance.
  • Interpretability and Visualization
  • Included feature importance analysis, correlation heatmaps, and detailed forecast visualizations to improve interpretability and communicate results clearly.
  • Reproducible and Well-Documented Workflow
  • Fully reproducible pipeline with version-tracked libraries, structured analysis, and automatically generated results tables.


SKILLS

Pythonpandas, NumPyscikit-learnmatplotlib, seabornNOAA Climate Data APIOpenWeatherMap API

SUPPORTING MATERIALS

Additional Details


🌀️ CURRENT WEATHER CONDITIONS (LIVE)
────────────────────────────────────────────────────────────────────────────────
Failed to fetch current weather.

πŸ“‘ OPENWEATHERMAP 7-DAY FORECAST
────────────────────────────────────────────────────────────────────────────────
Failed to fetch OpenWeatherMap forecast.
Empty DataFrame
Columns: []
Index: []

πŸ€– MODEL 7-DAY FORECAST (Random Forest)
────────────────────────────────────────────────────────────────────────────────
Date High (Β°F) Low (Β°F) Avg (Β°F) Precip (in)
Dec 17 36.03 23.76 28.53 0.08
Dec 18 41.55 26.46 33.79 0.24
Dec 19 47.07 29.17 39.04 0.39
Dec 20 44.48 28.20 36.92 0.34
Dec 21 41.88 27.23 34.79 0.29
Dec 22 39.29 26.26 32.66 0.23
Dec 23 36.70 25.29 30.53 0.18

β„Ή NOTES
β€’ Current and OpenWeatherMap forecasts are live observations
β€’ Model forecasts are generated using trained Random Forest models
β€’ Differences highlight model uncertainty and real-world variability

def create_features(df):
"""
Create features for weather prediction:
- Rolling averages (7-day windows)
- Lag features (previous day and week)
- Time-based features (month, day of year)
"""
df = df.copy()

# Rolling averages (7-day windows) - shift by 1 to prevent data leakage
for col in ['TMAX', 'TMIN', 'TAVG', 'PRCP']:
df[f'{col}_roll7'] = df[col].rolling(window=7, min_periods=1).mean().shift(1)

# Lag features (previous day and week)
for col in ['TMAX', 'TMIN', 'TAVG']:
df[f'{col}_lag1'] = df[col].shift(1)
df[f'{col}_lag7'] = df[col].shift(7)

# Time-based features
df['month'] = df['DATE'].dt.month
df['day_of_year'] = df['DATE'].dt.dayofyear

return df

# Create features
print("Creating features...")
print("=" * 80)

df_features = create_features(df_clean)
df_features = df_features.dropna()

print(f"Features created. Shape: {df_features.shape}")
print(f"\nFeature columns created:")

feature_cols = [
'TMAX_roll7', 'TMIN_roll7', 'TAVG_roll7', 'PRCP_roll7',
'TMAX_lag1', 'TMIN_lag1', 'TAVG_lag1',
'TMAX_lag7', 'TMIN_lag7', 'TAVG_lag7',
'month', 'day_of_year'
]

for i, col in enumerate(feature_cols, 1):
print(f" {i:2d}. {col}")

print(f"\nSample of engineered features:")
print(df_features[['DATE'] + feature_cols].head())
lowinertia
Portfolio Builder for Engineers
Created by Aram Lee
Β© 2025 Low Inertia. All rights reserved.