ECON 542 — Applied Econometrics · Fall 2025

Predicting Housing Prices with
Machine Learning & NLP-Extracted Features

Author: Michael Early
Data: Redfin · Costa Mesa, CA
n = 13,346 sales
Stack: Python · GPT-4.1 · OLS · Random Forest
Window: 5 Years

01 · The Question

Standard hedonic price models explain housing values well using structural attributes — square footage, bedrooms, bathrooms, lot size, year built, and location. The theoretical foundation goes back to Rosen (1974): observed prices reveal households' implicit valuations of bundled housing characteristics, and OLS on log-prices recovers those marginal willingness-to-pay estimates cleanly.

But real estate listings contain a second layer of information that never makes it into a Redfin CSV: qualitative language. Phrases like "recently renovated," "near the beach," or "top-rated schools" signal value that doesn't show up in any structured column. Historically, extracting that signal at scale required manual coding — expensive, slow, and not reproducible. Large language models change that calculus entirely.

This project asks: how much additional predictive power can AI-extracted qualitative features from listing descriptions add to a standard housing price model? The approach combines a traditional hedonic framework with a GPT-4.1-powered NLP pipeline, then benchmarks OLS and Random Forest models with and without the extracted features.

02 · Data & Scraping Pipeline

The structured dataset consists of 13,346 single-family residential sales in Costa Mesa, CA over five years, sourced from Redfin's internal GIS API. The API returns price, beds, baths, square footage, lot size, year built, and listing URL — but not the description text. Descriptions required a separate async scraping pass.

Fig 1 — Data & Feature Extraction Pipeline
REDFIN GIS API 13,346 listings ASYNC SCRAPER aiohttp · 2 req/s BeautifulSoup GPT-4.1 Junk filter + Feature extract BINARY FLAGS renovated · schools beach · private MODELS OLS · Random Forest LangChain · text-embedding-3-large vector store for retrieval

Each listing URL was fetched with rate-limited async HTTP (2 req/sec) and retried with exponential backoff. HTML was parsed via BeautifulSoup targeting Redfin's data-rf-test-id="abp-description" block. Junk listings — boilerplate agent-speak with no substantive content — were flagged and dropped before extraction ran. Noisy descriptions left uncleaned would have attenuated NLP coefficients toward zero.

rate_limiter = AsyncLimiter(max_rate=2, time_period=1)

async def fetch_html(session, url, max_retries=3):
    async with rate_limiter:
        for attempt in range(max_retries):
            try:
                async with session.get(url, headers=HEADERS) as r:
                    return await r.text()
            except Exception:
                await asyncio.sleep(2 ** attempt)

03 · GPT-4.1 as a Feature Extractor

Cleaned descriptions were passed through GPT-4.1 via LangChain to extract four binary qualitative indicators. This treats the LLM as a zero-shot classifier with natural language supervision — a meaningful upgrade over regex keyword matching that fails on semantic variation.

Fig 2 — Extracted Binary Features
Feature Signal Example Trigger Phrases
renovated Recent upgrades "newly remodeled," "updated kitchen," "fresh renovation"
good_schools School quality cited "top-rated schools," "award-winning district," "walking to school"
near_beach Beach proximity "steps to the beach," "ocean views," "coastal living"
private Privacy / seclusion "private retreat," "secluded lot," "gated community"

04 · Methodology

Before modeling, standard feature engineering was applied. Log transformations were applied to price and coastal distance. Age was computed as 2026 − year_built and clipped at zero. Quadratic terms for beds, baths, and square footage captured nonlinear returns.

df["log_price"] = np.log(df.price)
df["age"]       = (2026 - df.year_built).clip(lower=0)
df["beds2"]     = df.beds  ** 2
df["baths2"]    = df.baths ** 2
df["sqft2"]     = df.sqft  ** 2

OLS Hedonic Regression: The primary specification is a semi-log hedonic model with neighborhood fixed effects. Neighborhood is defined as a lat/lon grid cell (~0.7-mile resolution) — no shapefiles required, and it absorbs granular location heterogeneity cleanly. Standard errors are heteroskedasticity-robust (White). The four NLP-extracted indicators are added in a second specification to measure their marginal contribution.

log(price) ~ sqft + beds + baths + age
           + log(dist_coast) + C(neighborhood)

Random Forest: Two models were trained on an 80/20 train-test split — numeric-only features vs. numeric + NLP-extracted. Comparing test R² and RMSE across specifications isolates the predictive value of the text signals independent of OLS linearity assumptions.

05 · Results

0.898 OLS R²
0.887 Adj. R²
13,346 Observations

The baseline OLS model achieves R² = 0.898 — structural characteristics and neighborhood fixed effects explain nearly 90% of variation in log-prices. This is consistent with well-identified hedonic models in dense urban markets where location controls are rich.

Fig 3 — NLP Feature Price Premiums (OLS Semi-Elasticities)
% Price Premium 0% 2.5% 5% +5.1% renovated p < 0.01 ✓ +4.5% good_schools p < 0.01 ✓ ≈ 0% near_beach not sig ✗ ≈ 0% private not sig ✗
Fig 4 — OLS NLP Coefficient Summary
FeatureCoefficientSemi-Elasticityp-valueSignificant
renovated +0.050 ~+5.1% premium < 0.01 ✓ Yes
good_schools +0.044 ~+4–5% premium < 0.01 ✓ Yes
near_beach ~0.000 Not significant > 0.10 ✗ No
private ~0.000 Not significant > 0.10 ✗ No
Fig 5 — Model Comparison: R² with vs. without NLP Features
Numeric only + NLP features 0.0 0.5 1.0 0.875 0.898 OLS 0.820 0.845 Random Forest Test R²

Note: RF values are approximate test-set estimates from 80/20 split. OLS R² is full-sample.

Renovation carries a statistically significant ~5% price premium conditional on all structural controls. A remodeled 1970s home and an unremodeled 1970s home of identical square footage are not the same good, and buyers price that distinction. School quality is similarly significant at 4–5%, likely reflecting both direct school quality effects and higher-income buyer sorting that persists even after neighborhood fixed effects.

Beach proximity is fully absorbed by the log(dist_coast) continuous control — once actual distance to the coast is in the model, a binary "near beach" flag adds nothing. Privacy language is insignificant throughout, suggesting it functions as marketing copy rather than a signal of genuine property differentiation.

06 · Takeaways & Future Directions