Housing Price Prediction with Machine Learning and NLP

Applied econometrics Machine learning NLP features

Orange County housing price project testing whether qualitative listing language can add predictive power to a structured hedonic pricing model.

Read full case study →

0.898 OLS R2

13,346 Observations

GPT-4.1 Feature extraction

Problem

Real estate listing descriptions contain qualitative signals such as renovation quality, neighborhood amenities, school references, and location cues. These signals are not present in standard structured housing data, but they may carry price-relevant information.

Method

The workflow combines structured Redfin housing attributes with NLP-extracted listing features. GPT-4.1 is used as a feature extractor, and the resulting variables are evaluated in a hedonic model alongside structural controls and neighborhood fixed effects.

Structured controls include size, bedrooms, bathrooms, age, lot size, and location.
NLP features capture renovation, school quality, coastal proximity, and luxury language.
Models compare whether extracted qualitative features improve explanatory power.

Result

The NLP-extracted features add modest but interpretable signal. Renovation language and school-quality references show positive relationships with price after conditioning on standard controls, making the project a useful example of combining econometric structure with modern language-model feature extraction.