Standard hedonic price models explain housing values well using structural attributes — square footage, bedrooms, bathrooms, lot size, year built, and location. The theoretical foundation goes back to Rosen (1974): observed prices reveal households' implicit valuations of bundled housing characteristics, and OLS on log-prices recovers those marginal willingness-to-pay estimates cleanly.
But real estate listings contain a second layer of information that never makes it into a Redfin CSV: qualitative language. Phrases like "recently renovated," "near the beach," or "top-rated schools" signal value that doesn't show up in any structured column. Historically, extracting that signal at scale required manual coding — expensive, slow, and not reproducible. Large language models change that calculus entirely.
This project asks: how much additional predictive power can AI-extracted qualitative features from listing descriptions add to a standard housing price model? The approach combines a traditional hedonic framework with a GPT-4.1-powered NLP pipeline, then benchmarks OLS and Random Forest models with and without the extracted features.
The structured dataset consists of 13,346 single-family residential sales in Costa Mesa, CA over five years, sourced from Redfin's internal GIS API. The API returns price, beds, baths, square footage, lot size, year built, and listing URL — but not the description text. Descriptions required a separate async scraping pass.
Each listing URL was fetched with rate-limited async HTTP (2 req/sec) and retried with exponential backoff. HTML was parsed via BeautifulSoup targeting Redfin's data-rf-test-id="abp-description" block. Junk listings — boilerplate agent-speak with no substantive content — were flagged and dropped before extraction ran. Noisy descriptions left uncleaned would have attenuated NLP coefficients toward zero.
rate_limiter = AsyncLimiter(max_rate=2, time_period=1)
async def fetch_html(session, url, max_retries=3):
async with rate_limiter:
for attempt in range(max_retries):
try:
async with session.get(url, headers=HEADERS) as r:
return await r.text()
except Exception:
await asyncio.sleep(2 ** attempt)
Cleaned descriptions were passed through GPT-4.1 via LangChain to extract four binary qualitative indicators. This treats the LLM as a zero-shot classifier with natural language supervision — a meaningful upgrade over regex keyword matching that fails on semantic variation.
| Feature | Signal | Example Trigger Phrases |
|---|---|---|
renovated |
Recent upgrades | "newly remodeled," "updated kitchen," "fresh renovation" |
good_schools |
School quality cited | "top-rated schools," "award-winning district," "walking to school" |
near_beach |
Beach proximity | "steps to the beach," "ocean views," "coastal living" |
private |
Privacy / seclusion | "private retreat," "secluded lot," "gated community" |
Before modeling, standard feature engineering was applied. Log transformations were applied to price and coastal distance. Age was computed as 2026 − year_built and clipped at zero. Quadratic terms for beds, baths, and square footage captured nonlinear returns.
df["log_price"] = np.log(df.price)
df["age"] = (2026 - df.year_built).clip(lower=0)
df["beds2"] = df.beds ** 2
df["baths2"] = df.baths ** 2
df["sqft2"] = df.sqft ** 2
OLS Hedonic Regression: The primary specification is a semi-log hedonic model with neighborhood fixed effects. Neighborhood is defined as a lat/lon grid cell (~0.7-mile resolution) — no shapefiles required, and it absorbs granular location heterogeneity cleanly. Standard errors are heteroskedasticity-robust (White). The four NLP-extracted indicators are added in a second specification to measure their marginal contribution.
log(price) ~ sqft + beds + baths + age
+ log(dist_coast) + C(neighborhood)
Random Forest: Two models were trained on an 80/20 train-test split — numeric-only features vs. numeric + NLP-extracted. Comparing test R² and RMSE across specifications isolates the predictive value of the text signals independent of OLS linearity assumptions.
The baseline OLS model achieves R² = 0.898 — structural characteristics and neighborhood fixed effects explain nearly 90% of variation in log-prices. This is consistent with well-identified hedonic models in dense urban markets where location controls are rich.
| Feature | Coefficient | Semi-Elasticity | p-value | Significant |
|---|---|---|---|---|
renovated |
+0.050 | ~+5.1% premium | < 0.01 | ✓ Yes |
good_schools |
+0.044 | ~+4–5% premium | < 0.01 | ✓ Yes |
near_beach |
~0.000 | Not significant | > 0.10 | ✗ No |
private |
~0.000 | Not significant | > 0.10 | ✗ No |
Note: RF values are approximate test-set estimates from 80/20 split. OLS R² is full-sample.
Renovation carries a statistically significant ~5% price premium conditional on all structural controls. A remodeled 1970s home and an unremodeled 1970s home of identical square footage are not the same good, and buyers price that distinction. School quality is similarly significant at 4–5%, likely reflecting both direct school quality effects and higher-income buyer sorting that persists even after neighborhood fixed effects.
Beach proximity is fully absorbed by the log(dist_coast) continuous control — once actual distance to the coast is in the model, a binary "near beach" flag adds nothing. Privacy language is insignificant throughout, suggesting it functions as marketing copy rather than a signal of genuine property differentiation.