Housing Price Prediction with Machine Learning and NLP

Applied econometrics Machine learning NLP features

Orange County housing price project testing whether qualitative listing language can add predictive power to a structured hedonic pricing model.

0.898 OLS R2
13,346 Observations
GPT-4.1 Feature extraction

Problem

Real estate listing descriptions contain qualitative signals such as renovation quality, neighborhood amenities, school references, and location cues. These signals are not present in standard structured housing data, but they may carry price-relevant information.

Method

The workflow combines structured Redfin housing attributes with NLP-extracted listing features. GPT-4.1 is used as a feature extractor, and the resulting variables are evaluated in a hedonic model alongside structural controls and neighborhood fixed effects.

Result

The NLP-extracted features add modest but interpretable signal. Renovation language and school-quality references show positive relationships with price after conditioning on standard controls, making the project a useful example of combining econometric structure with modern language-model feature extraction.