# NBA Playoff Series Outcome Predictor

Predicts which team wins each NBA playoff series using pre-series regular season data. Built as an end-to-end machine learning pipeline: data collection, EDA, feature engineering, model comparison, and SHAP explainability.

**Stack:** Python 3.11 · nba_api · pandas · scikit-learn · XGBoost · SHAP

---

## 2025–26 Playoff Predictions

Current-season metrics sourced from Basketball Reference. Model applied to live second-round matchups.

### Bracket Projection

![Bracket](nba-playoffs/bracket_2025-26.png)

### Championship Probability

Computed by simulating all possible opponent paths through the bracket — not just the single most-likely path.

![Championship probability](nba-playoffs/championship_prob_2025-26.png)

---

## Model Performance

![Model performance](nba-playoffs/model_performance.png)

**Best model:** XGBoost — **AUC 0.683** on held-out 2022–24 seasons
**Naive baseline:** 56.7% (always pick the higher seed)

| Model | Accuracy | ROC-AUC |
|-------|----------|---------|
| XGBoost | 53.3% | **0.683** |
| Random Forest | 56.7% | 0.674 |
| Logistic Regression | 53.3% | 0.647 |

> Accuracy looks modest because 2022–24 were upset-heavy seasons. AUC is the right metric here — it measures how well the model ranks probabilities, not just whether it picks the right binary outcome.

### Why series, not individual games?

Individual game outcomes have ~40% irreducible variance (a 10-point favorite loses 40% of the time). Averaging over 4–7 games in a series dramatically reduces noise, making team quality a much stronger signal. Series-level AUC (~0.68) consistently outperforms game-level AUC (~0.60) with the same feature set.

---

## What Drives Playoff Series Outcomes (SHAP)

Top features by mean absolute SHAP value:

| Rank | Feature | Interpretation |
|------|---------|---------------|
| 1 | `ortg_diff` | Offensive rating gap — scoring edge is the strongest signal |
| 2 | `home_tov_pct` | Turnover rate — sloppier teams get exposed over 4–7 games |
| 3 | `away_win_pct` | Lower seed's regular season win % — a 6-seed with 48 wins is dangerous |
| 4 | `away_net_rtg` | Lower seed's overall efficiency — quality of the opponent matters |
| 5 | `home_playoff_win_pct_3yr` | Playoff experience — teams that have been here before |

**Key finding:** Offensive rating differential is the single strongest predictor — not seeding, not win percentage, not playoff history. And the *lower seed's* quality features rank almost as highly as the higher seed's, capturing the "dangerous underdog" effect.

---

## Data

- **Source:** `nba_api` (stats.nba.com — public, no API key required)
- **Scope:** 2015–2024 NBA playoffs (9 seasons · 135 series · 748 games)
- **Split:** Train on 2015–2022, hold out 2023–2024

---

## Features (23 total)

| Group | Features |
|-------|---------|
| Team efficiency | ORtg, DRtg, net rating, pace (home + away + differentials) |
| Team quality | Regular season win %, offensive rebound %, turnover % |
| Historical | Playoff win % over prior 3 seasons, Finals appearances in prior 5 years |

---

## Notebooks

Run in order — each depends on the previous output:

| # | Notebook | What it does |
|---|----------|-------------|
| 01 | `01-data-collection.ipynb` | Fetch playoff game logs + team metrics via nba_api |
| 02 | `02-eda.ipynb` | Home court advantage, efficiency distributions, rest days |
| 03b | `03b-series-feature-engineering.ipynb` | Aggregate games → series, build feature matrix |
| 04b | `04b-series-modeling.ipynb` | Train LR / RF / XGBoost, compare on holdout |
| 05b | `05b-series-explainability.ipynb` | SHAP analysis on best model |
| 06 | `06-predict-current-playoffs.ipynb` | Apply model to 2025–26 bracket; fetch live metrics from Basketball Reference |
| 07 | `07-visuals.ipynb` | Bracket diagram, championship probability, model performance card |

---

## Setup

```bash
git clone https://github.com/<your-username>/Sports.git
cd Sports/nba-playoffs
python3 -m venv .venv && source .venv/bin/activate
pip install nba_api pandas numpy scikit-learn xgboost shap matplotlib seaborn joblib tqdm pyarrow jupyterlab
jupyter lab
```

Data collection (~5–10 min first run, cached after):
```bash
# Open and run notebooks in order starting with 01-data-collection.ipynb
```

## Round-by-Round Playoff Predictions

Use the JSON state file to run predictions before each playoff stage and record actual winners after the round ends:

```bash
cd Sports/nba-playoffs
../.venv/bin/python scripts/retrain_series_xgboost.py
../.venv/bin/python scripts/backtest_series_models.py
../.venv/bin/python scripts/predict_playoffs.py --stage second_round
../.venv/bin/python scripts/predict_playoffs.py --stage all
../.venv/bin/python scripts/predict_playoffs.py --stage second_round --plot
../.venv/bin/python scripts/predict_playoffs.py --record-result second_round east_top 1610612752
```

`retrain_series_xgboost.py` trains the default context model, `series_xgboost_context.joblib`, from `data/processed/series_features.parquet`. The default state file is `config/playoffs_2025-26.json`. Update that file before each round with the known matchups, keeping the better regular-season team as `higher_seed`. After a round concludes, set each matchup's `winner_team_id` directly or use `--record-result`. Later stages with `source_slots` resolve from actual winners when present, otherwise from projected winners.

Use `backtest_series_models.py` before changing model defaults. It runs rolling-season tests and compares baseline XGBoost against context variants using accuracy, ROC-AUC, log loss, and Brier score.

Use `context_adjustments` in the same JSON file for information the trained model does not know, such as late-season form, roster availability, or matchup concerns:

```json
"context_adjustments": {
  "probability_shrinkage": 0.10,
  "team_adjustments": {
    "1610612747": {
      "net_rtg_delta": 2.0,
      "ortg_delta": 1.5
    }
  },
  "matchup_adjustments": {
    "second_round:west_top": {
      "prob_delta": -0.04,
      "reason": "Lakers health/rotation context"
    }
  }
}
```

Positive team deltas improve that team's model inputs; `prob_delta` is relative to the higher seed in that matchup, so negative values favor the lower seed. Keep these entries prospective, not retrofitted after a result is known.

Prediction CSVs are written to `outputs/predictions/`, which is gitignored. Add `--plot` to also write a PNG chart.

---

## Project Structure

```
nba-playoffs/
├── config/             ← playoff state files for round-by-round runs
├── scripts/            ← CLI entry points such as predict_playoffs.py
├── notebooks/          ← 5 notebooks, run in order
├── src/
│   ├── data.py         ← nba_api fetch helpers with parquet caching
│   ├── features.py     ← feature engineering functions
│   ├── models.py       ← pipeline builders, eval, SHAP helpers
│   └── playoff_predictions.py ← reusable round prediction logic
├── data/
│   ├── raw/            ← gitignored — nba_api parquet cache
│   └── processed/      ← gitignored — feature matrices
├── models/             ← gitignored — saved .joblib artifacts
└── PRD.md              ← full project requirements and feature spec
```

---

## Key Design Decisions

**Series vs. game prediction** — Tried individual game prediction first (AUC ~0.60), then switched to series-level aggregation which improved AUC to ~0.68 with the same features.

**Logistic Regression as baseline** — Always establish an interpretable linear baseline before using tree models. LR showed that the features have genuine linear signal; XGBoost added ~3% AUC via non-linear interactions.

**Temporal train/test split** — Used `TimeSeriesSplit` for cross-validation and held out the two most recent seasons (2023–24) as the final test set. Never trained on future data.

**Why not player-level stats** — `nba_api` rate limits made fetching per-player data for 135+ team-seasons impractical. Team-level efficiency metrics (ORtg, DRtg) capture aggregate quality effectively and are well-documented to be predictive.
