Back to About

MLB Statcast Data Explorer

Data Engineering & ML Pipeline for Baseball Analytics

Personal Project | 2024-2025

PythonpybaseballSQLAlchemySQLitepandasMLB Stats APIStatcast

You're looking at it! This baseball webapp is powered by the data pipeline described below. The games, players, and pitch data you see throughout this site were collected and processed using these tools.

Overview

A comprehensive MLB data collection and analysis pipeline designed to build a complete database for machine learning models and predictive analytics in baseball. The system ingests pitch-by-pitch tracking data from Baseball Savant, stores it in a normalized SQLite database, and extracts features for predictive modeling.

The pipeline processes every pitch thrown in MLB games—capturing velocity, spin rate, movement, location, and outcome data. This foundation enables ML applications like pitch outcome prediction, hit probability models, and player performance forecasting.

Technical Highlights

Processor Suite Architecture

Abstract base class with specialized implementations for single-game, daily batch, and date-range processing patterns.

Normalized Database Schema

Fully normalized Schema V2 with proper foreign keys, surrogate keys, and 30+ analytics views for efficient querying.

ML Feature Engineering

Temporal-aware feature computation with no data leakage. Platoon splits, recent form tracking, and pitcher-batter matchup features.

Batch Processing Optimization

Intelligent API batching reduces calls significantly—fetching Statcast data once per day instead of per-game.

The pipeline follows a four-stage architecture from raw API data to ML-ready features.

SOURCE APIs

Baseball Savant
MLB Stats API
via
pybaseball, statsapi
OUTPUT
JSON → DataFrame

PROCESSOR ENGINE

Base Class: GameProcessorBase
• SingleGameProcessor
• DailyGamesProcessor
• RangeGamesProcessor
TRANSFORM
DataFrame → Rows

ML FEATURES

Cumulative stats
Platoon splits (L/R)
Recent form (L3, L7)
Pitcher-batter matchups
Output Table: game_starter_features
OUTPUT
Tables → Vector

SQLITE DB

Schema V2 - Normalized
Reference:
teams, players,
ballparks, umpires
Fact:
games, pitches,
plate_appearances
Feature:
pitcher_stats,
batter_stats
STORE
Rows → Tables
Connection point
Data flow
[code]Data shape

The processor suite uses an abstract base class pattern for flexibility. Each processor type is optimized for different collection scenarios.

ProcessorInputUse Case
SingleGameProcessorgame_pk + dateReprocessing specific games, testing
DailyGamesProcessordateAll games for a day (batch optimized)
RangeGamesProcessorstart_date, end_dateFull season processing with chunking
MLBGameProcessorgame_pk, pitch_dfCore engine (handles all uploads)

Usage Example

from processors import DailyGamesProcessor

# Process all MLB games for a specific date
processor = DailyGamesProcessor(date="2024-07-15")
processor.run()

# Or process a date range with monthly chunking
from processors import RangeGamesProcessor
processor = RangeGamesProcessor(
    start_date="2024-04-01",
    end_date="2024-09-30",
    chunk_size="monthly"
)
processor.run()

Architecture Pattern

class GameProcessorBase(ABC):
    """Abstract base class for all processors"""

    @abstractmethod
    def run(self) -> None:
        pass

class DailyGamesProcessor(GameProcessorBase):
    """Batch-optimized: fetches Statcast ONCE per day"""

    def run(self):
        # 1. Get all games for date
        games = statsapi.schedule(date=self.date)

        # 2. Fetch Statcast data ONCE (not per-game)
        pitch_df = pybaseball.statcast(self.date)

        # 3. Process each game with shared data
        for game in games:
            processor = MLBGameProcessor(game, pitch_df)
            processor.process()

Key Design Decision: Batch Optimization

The DailyGamesProcessor fetches Statcast data once per day and shares the DataFrame across all games. This reduces API calls from ~30 per day (one per game) to just 2 (one for game schedule, one for pitch data).

Schema V2 is fully normalized with proper foreign key constraints and surrogate keys. The design separates reference data, fact tables, and ML feature tables.

Database Core Tables diagram
Generated from PlantUML • Dark theme styled to match site

We designed the schema so the facts themselves are the source of truth. Each individual pitch from Statcast is treated as the authoritative record, and everything else rolls up from that grain. That decision keeps the dataset faithful to the raw feed and makes every downstream metric traceable to the original event.

Implementation-wise, we built processors that parse pybaseball's Statcast feeds and MLB Stats API responses into normalized tables, supplementing with GUMBO JSON for game-level metadata like weather conditions and umpire assignments. The goal was one clean copy of the data—no duplicated attributes, no competing sources—and explicit PK/FK relationships so joins are obvious and repeatable.

We separated ML features from raw facts on purpose. Snapshots are computed later from the base tables using time-accurate queries, which makes it easier to validate that feature calculations match Statcast definitions and avoid leakage. It also keeps feature generation flexible: we can revise feature logic without rewriting the underlying facts.

This workflow lets us explore locally using a stable, untouched record of every API response, while feature tables stay lightweight and iteration-friendly during model development. The raw layer stays immutable; the ML layer is the abstraction we can evolve.

Query Layer: 30+ Analytics Views

Rather than expose raw tables to notebooks and dashboards, we built a view layer that abstracts common query patterns. Downstream consumers get clean interfaces like pitcher_season_stats or batter_platoon_splits without needing to understand the underlying joins. This decouples the schema from its consumers—we can refactor tables without breaking queries.

pitcher_season_statspitcher_platoon_splitsbatter_platoon_splitspitcher_batter_matchupsteam_records_by_dategames_with_details+24 more

Schema V1 → V2 Migration

Schema V2 fixed numerous normalization violations from V1—removing denormalized player names, team records, and derived fields. All computed metrics now live in SQL views or feature tables, keeping the core schema clean and maintainable.

The feature engineering pipeline computes cumulative statistics with strict temporal awareness to prevent data leakage. All features represent what was known before each game.

Critical: No Data Leakage

A common ML mistake is computing features using data from after the prediction target. This pipeline ensures all features use only historical data by filtering on game_date < as_of_date.

-- Correct: Only use data BEFORE the game
WHERE g.game_date < :as_of_date

-- Wrong: Would leak future information
WHERE g.game_date <= :as_of_date

Feature Tables

pitcher_stats_snapshots

Cumulative pitcher stats at each game date: velocity, spin rate, K%, BB%, zone%.

batter_stats_snapshots

Batter stats with L/R platoon splits: contact rate, whiff%, swing%, chase rate.

game_starter_features

Final ML-ready features combining pitcher, lineup, and matchup data.

Feature Categories

Pitcher Features
  • • Average velocity by pitch type
  • • Spin rate and movement metrics
  • • Zone% (pitches in strike zone)
  • • K% and BB% rates
  • • Whiff% and chase% induced
  • • First-pitch strike%
Batter Features
  • • Contact quality (exit velo, launch angle)
  • • Swing rate and chase rate
  • • K% and BB% at the plate
  • • xBA, xSLG, xwOBA expected stats
  • • L/R platoon splits
Recent Form Features
  • • Last 3 starts (pitcher trends)
  • • Last 7 days (lineup trends)
  • • K% trend (recent vs season)
  • • Days since last start
Platoon Splits
  • • Pitcher stats vs LHB / vs RHB
  • • Batter stats vs LHP / vs RHP
  • • Lineup-weighted contact quality
  • • Matchup-specific K rates

Feature Computation Example

def compute_pitcher_snapshot(pitcher_id: int, as_of_date: date):
    """Compute cumulative stats BEFORE game date (no data leakage)"""

    return db.execute("""
        SELECT
            COUNT(DISTINCT game_id) as games_started,
            AVG(release_speed) as avg_velocity,
            SUM(CASE WHEN events = 'strikeout' THEN 1 ELSE 0 END) /
                NULLIF(COUNT(DISTINCT pa_id), 0) as k_rate,
            -- All stats use data BEFORE as_of_date
        FROM pitches p
        JOIN games g ON p.game_id = g.game_id
        WHERE p.pitcher_id = :pitcher_id
          AND g.game_date < :as_of_date  -- Critical: no leakage
    """, {"pitcher_id": pitcher_id, "as_of_date": as_of_date})

Target Variable: Strikeout Prediction

The primary ML target is predicting pitcher strikeout outcomes—relevant for sports analytics and betting applications. Both binary classification (over/under threshold) and regression (exact count) approaches have been tested using the feature set above.

The data pipeline isn't just infrastructure—it powers real analytical tools. Here's what this foundation enables, both today and in future iterations.

Live Now

Pitcher Similarity Engine

KNN-based tool that finds comparable pitchers using 15+ Statcast features. Useful for identifying trade targets, comp contracts, or development templates.

Try it live →
Pitcher Archetypes

Clustering analysis that groups pitchers by repertoire and approach—power arms, finesse pitchers, ground-ball specialists. Helps scouts categorize and compare.

Velocity profilesPitch mix

Future Explorations

Strikeout Prediction
In Progress

Ridge Regression model predicting pitcher K totals. Current baseline: 55.3% accuracy overall, 66.1% on high-confidence picks. Exploring ensemble methods.

Stuff+ Model

Build a custom pitch quality metric from raw Statcast inputs—velocity, movement, spin, location—benchmarked against league averages.

Development Tracking

Track pitcher evolution over time—velocity trends, new pitch adoption, command improvements. Useful for player development staff.

Where This Started & Where It's Going

The original goal was simple: aggregate Statcast data into something I could query locally, similar to what MLB provides but without rate limits and with full SQL access. I wanted to explore the data on my own terms.

That foundation grew into this webapp, and now the focus has shifted to the ML side. Current work includes improving the KNN pitcher similarity engine, building out a parallel system for hitter comparisons, and refining the strikeout prediction model.

The strikeout model is already showing results—a 9-feature Ridge Regression hitting 55.3% overall accuracy and 66.1% on high-confidence predictions (above the 54% threshold needed to beat the vig). Still iterating on feature selection and looking at ensemble approaches, but the baseline seems to be promising and should (theoretically) be profitable. *Assuming my approach is correct - would love to get feedback on this!

Iterative Feature Refinement

The pipeline supports rapid iteration on feature engineering. Train a model, analyze coefficients and feature importance, drop what doesn't work, add new hypotheses. The strikeout model went from 22 features that were probably redundant to the 9 I ended up with.

Chunked Processing

The RangeGamesProcessor uses monthly chunking by default—processing a full season requires only ~6 API calls for Statcast data instead of ~180 (one per day). Memory-efficient and respectful of rate limits.

Flexible Game Type Filtering

Currently processing regular season games only (game_type = 'R') to keep ML training data consistent. Adding spring training, playoffs, or All-Star games is a one-line filter change—useful if you want to analyze postseason performance separately or build October-specific models.

Database-First Player Lookup

The processor queries the local database before making API calls for player info. This caching strategy reduces external API load significantly—most players are already stored after the first few games of processing.