MLB Statcast Data Explorer
Data Engineering & ML Pipeline for Baseball Analytics
Personal Project | 2024-2025
You're looking at it! This baseball webapp is powered by the data pipeline described below. The games, players, and pitch data you see throughout this site were collected and processed using these tools.
Overview
A comprehensive MLB data collection and analysis pipeline designed to build a complete database for machine learning models and predictive analytics in baseball. The system ingests pitch-by-pitch tracking data from Baseball Savant, stores it in a normalized SQLite database, and extracts features for predictive modeling.
The pipeline processes every pitch thrown in MLB games—capturing velocity, spin rate, movement, location, and outcome data. This foundation enables ML applications like pitch outcome prediction, hit probability models, and player performance forecasting.
Technical Highlights
Processor Suite Architecture
Abstract base class with specialized implementations for single-game, daily batch, and date-range processing patterns.
Normalized Database Schema
Fully normalized Schema V2 with proper foreign keys, surrogate keys, and 30+ analytics views for efficient querying.
ML Feature Engineering
Temporal-aware feature computation with no data leakage. Platoon splits, recent form tracking, and pitcher-batter matchup features.
Batch Processing Optimization
Intelligent API batching reduces calls significantly—fetching Statcast data once per day instead of per-game.
The pipeline follows a four-stage architecture from raw API data to ML-ready features.
SOURCE APIs
PROCESSOR ENGINE
GameProcessorBaseML FEATURES
game_starter_featuresSQLITE DB
The processor suite uses an abstract base class pattern for flexibility. Each processor type is optimized for different collection scenarios.
| Processor | Input | Use Case |
|---|---|---|
| SingleGameProcessor | game_pk + date | Reprocessing specific games, testing |
| DailyGamesProcessor | date | All games for a day (batch optimized) |
| RangeGamesProcessor | start_date, end_date | Full season processing with chunking |
| MLBGameProcessor | game_pk, pitch_df | Core engine (handles all uploads) |
Usage Example
from processors import DailyGamesProcessor
# Process all MLB games for a specific date
processor = DailyGamesProcessor(date="2024-07-15")
processor.run()
# Or process a date range with monthly chunking
from processors import RangeGamesProcessor
processor = RangeGamesProcessor(
start_date="2024-04-01",
end_date="2024-09-30",
chunk_size="monthly"
)
processor.run()Architecture Pattern
class GameProcessorBase(ABC):
"""Abstract base class for all processors"""
@abstractmethod
def run(self) -> None:
pass
class DailyGamesProcessor(GameProcessorBase):
"""Batch-optimized: fetches Statcast ONCE per day"""
def run(self):
# 1. Get all games for date
games = statsapi.schedule(date=self.date)
# 2. Fetch Statcast data ONCE (not per-game)
pitch_df = pybaseball.statcast(self.date)
# 3. Process each game with shared data
for game in games:
processor = MLBGameProcessor(game, pitch_df)
processor.process()Key Design Decision: Batch Optimization
The DailyGamesProcessor fetches Statcast data once per day and shares the DataFrame across all games. This reduces API calls from ~30 per day (one per game) to just 2 (one for game schedule, one for pitch data).
Schema V2 is fully normalized with proper foreign key constraints and surrogate keys. The design separates reference data, fact tables, and ML feature tables.
We designed the schema so the facts themselves are the source of truth. Each individual pitch from Statcast is treated as the authoritative record, and everything else rolls up from that grain. That decision keeps the dataset faithful to the raw feed and makes every downstream metric traceable to the original event.
Implementation-wise, we built processors that parse pybaseball's Statcast feeds and MLB Stats API responses into normalized tables, supplementing with GUMBO JSON for game-level metadata like weather conditions and umpire assignments. The goal was one clean copy of the data—no duplicated attributes, no competing sources—and explicit PK/FK relationships so joins are obvious and repeatable.
We separated ML features from raw facts on purpose. Snapshots are computed later from the base tables using time-accurate queries, which makes it easier to validate that feature calculations match Statcast definitions and avoid leakage. It also keeps feature generation flexible: we can revise feature logic without rewriting the underlying facts.
This workflow lets us explore locally using a stable, untouched record of every API response, while feature tables stay lightweight and iteration-friendly during model development. The raw layer stays immutable; the ML layer is the abstraction we can evolve.
Query Layer: 30+ Analytics Views
Rather than expose raw tables to notebooks and dashboards, we built a view layer that abstracts common query patterns. Downstream consumers get clean interfaces like pitcher_season_stats or batter_platoon_splits without needing to understand the underlying joins. This decouples the schema from its consumers—we can refactor tables without breaking queries.
Schema V1 → V2 Migration
Schema V2 fixed numerous normalization violations from V1—removing denormalized player names, team records, and derived fields. All computed metrics now live in SQL views or feature tables, keeping the core schema clean and maintainable.
The feature engineering pipeline computes cumulative statistics with strict temporal awareness to prevent data leakage. All features represent what was known before each game.
Critical: No Data Leakage
A common ML mistake is computing features using data from after the prediction target. This pipeline ensures all features use only historical data by filtering on game_date < as_of_date.
-- Correct: Only use data BEFORE the game WHERE g.game_date < :as_of_date -- Wrong: Would leak future information WHERE g.game_date <= :as_of_date
Feature Tables
pitcher_stats_snapshots
Cumulative pitcher stats at each game date: velocity, spin rate, K%, BB%, zone%.
batter_stats_snapshots
Batter stats with L/R platoon splits: contact rate, whiff%, swing%, chase rate.
game_starter_features
Final ML-ready features combining pitcher, lineup, and matchup data.
Feature Categories
Pitcher Features
- • Average velocity by pitch type
- • Spin rate and movement metrics
- • Zone% (pitches in strike zone)
- • K% and BB% rates
- • Whiff% and chase% induced
- • First-pitch strike%
Batter Features
- • Contact quality (exit velo, launch angle)
- • Swing rate and chase rate
- • K% and BB% at the plate
- • xBA, xSLG, xwOBA expected stats
- • L/R platoon splits
Recent Form Features
- • Last 3 starts (pitcher trends)
- • Last 7 days (lineup trends)
- • K% trend (recent vs season)
- • Days since last start
Platoon Splits
- • Pitcher stats vs LHB / vs RHB
- • Batter stats vs LHP / vs RHP
- • Lineup-weighted contact quality
- • Matchup-specific K rates
Feature Computation Example
def compute_pitcher_snapshot(pitcher_id: int, as_of_date: date):
"""Compute cumulative stats BEFORE game date (no data leakage)"""
return db.execute("""
SELECT
COUNT(DISTINCT game_id) as games_started,
AVG(release_speed) as avg_velocity,
SUM(CASE WHEN events = 'strikeout' THEN 1 ELSE 0 END) /
NULLIF(COUNT(DISTINCT pa_id), 0) as k_rate,
-- All stats use data BEFORE as_of_date
FROM pitches p
JOIN games g ON p.game_id = g.game_id
WHERE p.pitcher_id = :pitcher_id
AND g.game_date < :as_of_date -- Critical: no leakage
""", {"pitcher_id": pitcher_id, "as_of_date": as_of_date})Target Variable: Strikeout Prediction
The primary ML target is predicting pitcher strikeout outcomes—relevant for sports analytics and betting applications. Both binary classification (over/under threshold) and regression (exact count) approaches have been tested using the feature set above.
The data pipeline isn't just infrastructure—it powers real analytical tools. Here's what this foundation enables, both today and in future iterations.
Live Now
Pitcher Similarity Engine
KNN-based tool that finds comparable pitchers using 15+ Statcast features. Useful for identifying trade targets, comp contracts, or development templates.
Pitcher Archetypes
Clustering analysis that groups pitchers by repertoire and approach—power arms, finesse pitchers, ground-ball specialists. Helps scouts categorize and compare.
Future Explorations
Strikeout Prediction
In ProgressRidge Regression model predicting pitcher K totals. Current baseline: 55.3% accuracy overall, 66.1% on high-confidence picks. Exploring ensemble methods.
Stuff+ Model
Build a custom pitch quality metric from raw Statcast inputs—velocity, movement, spin, location—benchmarked against league averages.
Development Tracking
Track pitcher evolution over time—velocity trends, new pitch adoption, command improvements. Useful for player development staff.
Where This Started & Where It's Going
The original goal was simple: aggregate Statcast data into something I could query locally, similar to what MLB provides but without rate limits and with full SQL access. I wanted to explore the data on my own terms.
That foundation grew into this webapp, and now the focus has shifted to the ML side. Current work includes improving the KNN pitcher similarity engine, building out a parallel system for hitter comparisons, and refining the strikeout prediction model.
The strikeout model is already showing results—a 9-feature Ridge Regression hitting 55.3% overall accuracy and 66.1% on high-confidence predictions (above the 54% threshold needed to beat the vig). Still iterating on feature selection and looking at ensemble approaches, but the baseline seems to be promising and should (theoretically) be profitable. *Assuming my approach is correct - would love to get feedback on this!
Iterative Feature Refinement
The pipeline supports rapid iteration on feature engineering. Train a model, analyze coefficients and feature importance, drop what doesn't work, add new hypotheses. The strikeout model went from 22 features that were probably redundant to the 9 I ended up with.
Chunked Processing
The RangeGamesProcessor uses monthly chunking by default—processing a full season requires only ~6 API calls for Statcast data instead of ~180 (one per day). Memory-efficient and respectful of rate limits.
Flexible Game Type Filtering
Currently processing regular season games only (game_type = 'R') to keep ML training data consistent. Adding spring training, playoffs, or All-Star games is a one-line filter change—useful if you want to analyze postseason performance separately or build October-specific models.
Database-First Player Lookup
The processor queries the local database before making API calls for player info. This caching strategy reduces external API load significantly—most players are already stored after the first few games of processing.