Housing Price Prediction with Machine Learning and NLP
Orange County housing price project testing whether qualitative listing language can add predictive power to a structured hedonic pricing model.
Problem
Real estate listing descriptions contain qualitative signals such as renovation quality, neighborhood amenities, school references, and location cues. These signals are not present in standard structured housing data, but they may carry price-relevant information.
Method
The workflow combines structured Redfin housing attributes with NLP-extracted listing features. GPT-4.1 is used as a feature extractor, and the resulting variables are evaluated in a hedonic model alongside structural controls and neighborhood fixed effects.
- Structured controls include size, bedrooms, bathrooms, age, lot size, and location.
- NLP features capture renovation, school quality, coastal proximity, and luxury language.
- Models compare whether extracted qualitative features improve explanatory power.
Result
The NLP-extracted features add modest but interpretable signal. Renovation language and school-quality references show positive relationships with price after conditioning on standard controls, making the project a useful example of combining econometric structure with modern language-model feature extraction.