1 Introduction

A historic home in Ames, Iowa (405 Hayward Ave.). Source: Wikimedia Commons.

Buying or selling a home is among the largest financial decisions most people ever make. A central question in real estate is fairly simple: what is a house actually worth? In this project, I will build and compare a series of statistical machine learning models that predict the sale price of a residential home from its physical characteristics, location, and condition. Because the quantity I am predicting (SalePrice) is a continuous number measured in U.S. dollars, this is a regression problem. This is not a classification problem, in which the goal would be to predict a discrete category such as “expensive” versus “cheap.” Throughout the report, I measure predictive accuracy using root mean squared error (RMSE), the square root of the average squared difference between predicted and observed values. A smaller RMSE means a model’s predictions land closer to the truth.

The data comes from the Ames Housing dataset, a classic dataset compiled by Dean De Cock describing residential sales in Ames, Iowa between 2006 and 2010. Each row is a single home sale, and the columns record 79 explanatory variables: everything from above- ground living area and basement size to neighborhood, overall quality, and the year the home was built. The dataset is rich and realistic: it mixes continuous measurements, counts, ordered quality ratings, and unordered categories, and it contains a substantial amount of missing data, much of which is structurally meaningful. For example, a missing “pool quality” simply means the home has no pool. Working with data like this requires careful cleaning, thoughtful feature engineering, and honest model evaluation, all of which are important skills in applied machine learning.

This project has two complementary goals. The first is a predictive one: to build a model that estimates sale prices as accurately as possible on homes it has never seen, mirroring the “automated valuation models” used by companies such as Zillow and Redfin. The second goal is descriptive: to understand which features matter most in determining price. To those ends I fit four very different families of models: ordinary linear regression, a regularized elastic net, a random forest, and a gradient-boosted tree (XGBoost). And then, I tune them using cross-validation, and compare their performance. Following standard practice for this dataset, I model the logarithm of SalePrice so that prediction errors are penalized proportionally rather than in absolute dollars, and I report RMSE on that same log scale.

2 Data Citation and Source

Ames within Story County, Iowa (city area highlighted). Source: Wikimedia Commons.

The dataset is publicly available through Kaggle’s House Prices: Advanced Regression Techniques competition and originates from the following peer-reviewed source:

De Cock, D. (2011). Ames, Iowa: Alternative to the Boston Housing Data as an End of Semester Regression Project. Journal of Statistics Education, 19(3). https://doi.org/10.1080/10691898.2011.11889627

Dataset link: https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques/data

A complete description of every variable is provided in the accompanying codebook (codebook.doc) submitted with this project.

3 Loading the Data

Ames, Iowa. Source: Wikimedia Commons.

homes_raw <- read_csv("data/train.csv", show_col_types = FALSE)
dim(homes_raw)

## [1] 1460   81

The raw training set contains 1460 home sales and 81 columns: an Id index, 79 predictor variables, and the response variable SalePrice. Here is a quick look at a few key columns confirms the mix of numeric and categorical information:

homes_raw %>%
  select(SalePrice, OverallQual, GrLivArea, Neighborhood, YearBuilt, TotalBsmtSF) %>%
  head(6)

4 Tidying and Cleaning the Data

Before any analysis, I am going to perform a small number of cleaning steps.

Dropping the index. The Id column is just a row identifier and carries no predictive information, so I will remove it.

Removing extreme outliers. De Cock specifically recommends removing a handful of very large homes (above 4,000 square feet of above-ground living area) that sold under unusual “partial” conditions and behave as leverage points. There are only a few homes like this and removing them will produce a far more stable model.

Recoding structural missingness. For many categorical variables, an NA does not mean the value is unknown; it means the feature is absent. For example, PoolQC is NA precisely when a home has no pool. For these variables, I recode NA to the explicit category "None" so the model can learn from the absence of the feature.

# Variables where NA means "feature is absent"
none_vars <- c("PoolQC","MiscFeature","Alley","Fence","FireplaceQu",
               "GarageType","GarageFinish","GarageQual","GarageCond",
               "BsmtQual","BsmtCond","BsmtExposure","BsmtFinType1",
               "BsmtFinType2","MasVnrType")

homes <- homes_raw %>%
  select(-Id) %>%
  filter(GrLivArea < 4000) %>%                         # De Cock-recommended outlier removal
  mutate(across(all_of(none_vars), ~ replace_na(as.character(.x), "None"))) %>%
  mutate(across(where(is.character), as.factor))

cat("Rows removed as outliers:", nrow(homes_raw) - nrow(homes), "\n")

## Rows removed as outliers: 4

cat("Final dataset:", nrow(homes), "rows,", ncol(homes), "columns\n")

## Final dataset: 1456 rows, 80 columns

After cleaning, the working dataset has 1456 homes. Any remaining missing values (in genuinely numeric fields such as LotFrontage, and the single missing Electrical value) are handled inside the modeling recipe through imputation. This is described later so that imputation is learned only from training data and never leaks information from the testing set.

5 Missing Data

Understanding the pattern of missingness is very important. The table below shows the variables with the most missing values in the raw data, expressed as a percentage of all of the rows.

missing_summary <- homes_raw %>%
  summarise(across(everything(), ~ sum(is.na(.x)))) %>%
  pivot_longer(everything(), names_to = "Variable", values_to = "n_missing") %>%
  filter(n_missing > 0) %>%
  mutate(pct_missing = round(100 * n_missing / nrow(homes_raw), 1)) %>%
  arrange(desc(n_missing))

missing_summary %>%
  kable(caption = "Variables containing missing values (raw data)") %>%
  kable_styling(full_width = FALSE, bootstrap_options = c("striped","hover"))

Variables containing missing values (raw data)
Variable	n_missing	pct_missing
PoolQC	1453	99.5
MiscFeature	1406	96.3
Alley	1369	93.8
Fence	1179	80.8
FireplaceQu	690	47.3
LotFrontage	259	17.7
GarageType	81	5.5
GarageYrBlt	81	5.5
GarageFinish	81	5.5
GarageQual	81	5.5
GarageCond	81	5.5
BsmtExposure	38	2.6
BsmtFinType2	38	2.6
BsmtQual	37	2.5
BsmtCond	37	2.5
BsmtFinType1	37	2.5
MasVnrType	8	0.5
MasVnrArea	8	0.5
Electrical	1	0.1

missing_summary %>%
  slice_max(pct_missing, n = 15) %>%
  ggplot(aes(x = reorder(Variable, pct_missing), y = pct_missing)) +
  geom_col(fill = "#2c7fb8") + #Code Assisted by Claude
  coord_flip() + #Code Assisted by Claude
  labs(title = "Top 15 variables by percent missing",
       x = NULL, y = "Percent missing")

The missingness is highly concentrated and, importantly, mostly structural. The four variables with the most missing data [PoolQC (~99.5%), MiscFeature (~96%), Alley (~94%), and Fence (~81%)] are missing simply because the overwhelming majority of homes have no pool, no special miscellaneous feature, no alley access, and no fence. The same logic applies to the basement and garage quality variables. As described above, I recoded all of these to an explicit "None" category rather than discarding them, preserving the genuinely useful signal that “this home has no garage/pool/fence.” Only a small set of variables represent true missingness: LotFrontage (~18%), which I impute with the median, and a single missing Electrical value, which I impute with the most common category. Because every variable is either recoded or imputed, no observations are dropped due to missing data.

6 Exploratory Data Analysis

Ames, Iowa. Source: Wikimedia Commons.

I will now explore the data visually to understand the response variable, its relationships with key predictors, and the correlation structure among numeric features.

6.1 The response variable: SalePrice

p1 <- ggplot(homes, aes(SalePrice)) +
  geom_histogram(bins = 50, fill = "#41b6c4", color = "white") +
  scale_x_continuous(labels = scales::dollar) + #Code Assisted by Claude
  labs(title = "Sale price (raw)", x = "Sale price", y = "Count")

p2 <- ggplot(homes, aes(log(SalePrice))) +
  geom_histogram(bins = 50, fill = "#225ea8", color = "white") +
  labs(title = "Sale price (log scale)", x = "log(Sale price)", y = "Count")

# Summary statistics referenced in the narration below (computed, not hard-coded)
sp_median <- median(homes$SalePrice)
sp_mean   <- mean(homes$SalePrice)
skew <- function(x) { x <- x[!is.na(x)]; m <- mean(x); mean((x - m)^3) / mean((x - m)^2)^1.5 }
skew_raw <- skew(homes$SalePrice)
skew_log <- skew(log(homes$SalePrice))

p1 + p2

The raw distribution of SalePrice is strongly right-skewed: most homes sell for between roughly $100,000 and $250,000, but a long tail of expensive homes stretches past $500,000 (median ≈ $163,000, mean ≈ $180,151). This skew is problematic for many models, which implicitly assume errors are symmetric. Taking the natural logarithm (right image) produces a distribution that is very nearly symmetric and bell-shaped. the skewness drops from about 1.6 to roughly 0.1. This is exactly why I model log(SalePrice): it stabilizes the variance, reduces the influence of the most expensive homes, and means the model is penalized for percentage errors rather than raw-dollar errors. This is the more appropriate measure of accuracy when home prices vary by an order of magnitude.

6.2 Overall quality vs. price

ggplot(homes, aes(x = factor(OverallQual), y = SalePrice, fill = factor(OverallQual))) +
  geom_boxplot(show.legend = FALSE) +
  scale_y_continuous(labels = scales::dollar) +
  scale_fill_viridis_d() + #Code Assisted by Claude
  labs(title = "Sale price by overall quality rating",
       x = "Overall quality (1 = very poor, 10 = very excellent)",
       y = "Sale price")

OverallQual : an expert 1-10 rating of the home’s material and finish has an almost monotonic relationship with price. Median sale price climbs from roughly $50,000 for the lowest-quality homes to well over $400,000 for the highest, and each step up the quality scale corresponds to a clear jump in price. This single ordinal variable turns out to be the strongest predictor in the entire data set, which matches real-estate intuition: the actual quality of construction and finish drives value.

6.3 Living area vs. price

ggplot(homes, aes(x = GrLivArea, y = SalePrice)) +
  geom_point(alpha = 0.4, color = "#253494") + #Code Assisted by Claude
  geom_smooth(method = "lm", se = FALSE, color = "#e34a33") + #Code Assisted by Claude
  scale_y_continuous(labels = scales::dollar) +
  labs(title = "Sale price vs. above-ground living area",
       x = "Above-ground living area (sq ft)", y = "Sale price")

There is a strong, roughly linear positive relationship between above-ground living area (GrLivArea) and sale price: bigger homes cost more, with a correlation of about 0.72. The relationship is tight enough that living area alone explains a large portion of price variation. Although the scatter widens at larger sizes, large homes vary more in price because factors like quality and neighborhood increasingly also come into play.

6.4 Correlation among numeric predictors

num_vars <- homes %>%
  select(where(is.numeric)) %>%
  mutate(logSalePrice = log(SalePrice)) %>%
  select(-SalePrice)

cor_mat <- cor(num_vars, use = "pairwise.complete.obs")

# show the predictors most correlated with log price
top_cor <- sort(abs(cor_mat[, "logSalePrice"]), decreasing = TRUE)
keep <- names(top_cor)[1:15]
corrplot(cor_mat[keep, keep], method = "color", type = "upper",
         tl.col = "black", tl.cex = 0.8, addCoef.col = "grey30",
         number.cex = 0.55, title = "Correlations among top numeric predictors",
         mar = c(0,0,1.5,0))

The correlation heatmap (restricted to the 15 numeric variables most associated with log price, for readability) reveals two important things. First, the predictors most strongly correlated with price are OverallQual (0.82 with log price), GrLivArea (0.72), GarageCars/GarageArea (0.68/0.66), TotalBsmtSF and 1stFlrSF (0.64/0.61), and the YearBuilt (0.59). Second, several predictors are strongly correlated with each other: GarageCars with GarageArea, and TotalBsmtSF with 1stFlrSF. This implies multicollinearity. Multicollinearity destabilizes ordinary linear regression coefficients and is a key reason I chose to include regularized and tree-based models, which handle correlated predictors much more gracefully.

6.5 Price by neighborhood

Iowa State University, in Ames, Iowa. Source: Wikimedia Commons.

homes %>%
  group_by(Neighborhood) %>%
  summarise(median_price = median(SalePrice), .groups = "drop") %>%
  ggplot(aes(x = reorder(Neighborhood, median_price), y = median_price)) +
  geom_col(fill = "#7fcdbb") + #Code Assisted by Claude
  coord_flip() +
  scale_y_continuous(labels = scales::dollar) +
  labs(title = "Median sale price by neighborhood",
       x = NULL, y = "Median sale price")

Location matters enormously. Median prices range from under $90,000 in the most affordable neighborhoods (MeadowV, IDOTRR, BrDale) to over $300,000 in the most desirable ones (StoneBr, NoRidge, NridgHt). This is more than a 3x difference driven purely by location. With 25 distinct neighborhoods, this single categorical variable captures a large amount of spatial price variation and will be one-hot encoded so the models can use it.

7 Data Splitting and Cross-Validation

A model that simply memorizes the data it was trained on is worthless; what we care about is performance on new and unseen homes. To estimate that honestly, I split the data into two parts before beginning my modeling:

A training set (70% of homes) used to fit and tune models, and
A testing set (30%) that is locked away and used exactly once, at the very end, to report final performance.

The split is stratified on SalePrice, meaning the full range of prices (cheap to expensive) is represented proportionally in both sets. If I don’t stratify properly, a random split could by chance place too many expensive homes in one set and bias the evaluation.

# Model the log of SalePrice. We transform the outcome here, BEFORE splitting,
# so that all RMSE/R-squared values are computed on the log scale 
# and there is no risk of scale mismatch at test time.
# The original `homes` (raw dollars) is kept above purely for the EDA plots.
homes_model <- homes %>% mutate(SalePrice = log(SalePrice))

set.seed(131)
homes_split <- initial_split(homes_model, prop = 0.70, strata = SalePrice)
homes_train <- training(homes_split)
homes_test  <- testing(homes_split)

c(train = nrow(homes_train), test = nrow(homes_test))

## train  test 
##  1017   439

To tune model hyperparameters without touching the test set, I use k-fold cross-validation on the training data. The idea is simple but powerful: the training set is separated into k equal “folds.” The model is trained on k−1 folds and evaluated on the held-out fold; this repeats k times so every fold serves as the validation set once, and the k error estimates are averaged. This gives a stable estimate of how well a model will generalize and lets me compare hyperparameter settings fairly. I use 5 folds, also stratified on SalePrice.

set.seed(131)
homes_folds <- vfold_cv(homes_train, v = 5, strata = SalePrice)
homes_folds

8 Building the Recipe

A tidymodels recipe is a reusable specification of all the preprocessing the predictors need. My recipe predicts SalePrice (already log-transformed in the splitting step above) from every other variable and performs the following steps, in order:

Impute missing numeric values with the median and missing categorical values with the most frequent category.
step_novel / step_other: Protect against categories seen only at prediction time, and pool rare categorical levels (those appearing in <1% of homes) into a single "other" level. This shrinks the number of dummy columns and prevents unstable coefficients from extremely rare levels.
Normalize (center and scale) all numeric predictors. This is essential for the elastic net, whose penalty is scale-sensitive, and harmless for the tree-based models.
One-hot encode every categorical predictor into 0/1 dummy variables, as recommended by the course for high-cardinality variables like Neighborhood.
step_zv: Remove any predictor with zero variance (e.g., a dummy that is always 0 after splitting).

homes_recipe <- recipe(SalePrice ~ ., data = homes_train) %>%
  step_impute_median(all_numeric_predictors()) %>% #Code Assisted by Claude
  step_impute_mode(all_nominal_predictors()) %>% #Code Assisted by Claude
  step_normalize(all_numeric_predictors()) %>% #Code Assisted by Claude
  step_novel(all_nominal_predictors()) %>% #Code Assisted by Claude
  step_other(all_nominal_predictors(), threshold = 0.01) %>%
  step_dummy(all_nominal_predictors(), one_hot = TRUE) %>%
  step_zv(all_predictors()) #Code Assisted by Claude

# Confirm the recipe prepares cleanly and see how many predictors result
prepped <- prep(homes_recipe)
ncol(bake(prepped, new_data = NULL)) - 1

## [1] 247

After preprocessing there are 247 predictor columns. As Professor Coburn confirmed, this dimensionality is completely manageable for regularized regression and tree ensembles. Therefore, no dimension reduction (e.g., PCA) is required.

9 Model Fitting

Ames, Iowa. Source: Wikimedia Commons.

I fit four families of models that span the bias–variance spectrum, from a simple linear model to flexible tree ensembles.

# 1. Ordinary linear regression (baseline, no tuning)
lm_spec <- linear_reg() %>%
  set_engine("lm")

# 2. Elastic net: linear model with combined L1/L2 penalty (tune penalty & mixture)
en_spec <- linear_reg(penalty = tune(), mixture = tune()) %>%
  set_engine("glmnet")

# 3. Random forest (tune mtry and min_n)
rf_spec <- rand_forest(mtry = tune(), trees = 500, min_n = tune()) %>%
  set_engine("ranger", importance = "impurity") %>%
  set_mode("regression")

# 4. Gradient-boosted trees via XGBoost (tune trees, depth, learn rate)
bt_spec <- boost_tree(trees = tune(), tree_depth = tune(), learn_rate = tune()) %>% #Code Assisted by Claude
  set_engine("xgboost") %>%
  set_mode("regression")

# Pair each model with the recipe in a workflow
lm_wf <- workflow() %>% add_recipe(homes_recipe) %>% add_model(lm_spec)
en_wf <- workflow() %>% add_recipe(homes_recipe) %>% add_model(en_spec)
rf_wf <- workflow() %>% add_recipe(homes_recipe) %>% add_model(rf_spec)
bt_wf <- workflow() %>% add_recipe(homes_recipe) %>% add_model(bt_spec)

9.1 Linear regression (baseline)

The plain linear model has no tuning parameters, so I evaluate it directly with cross-validation.

lm_res <- fit_resamples(
  lm_wf, resamples = homes_folds,
  metrics = metric_set(rmse, rsq),
  control = control_resamples(save_pred = TRUE) #Code Assisted by Claude
)
collect_metrics(lm_res)

9.2 Elastic net

The elastic net adds a penalty to the regression that shrinks coefficients toward zero, controlling overfitting and constraining multicollinearity. The penalty controls the overall strength of shrinkage and mixture blends ridge (L2) and lasso (L1) behavior. I will now tune both over a grid.

en_grid <- grid_regular(
  penalty(range = c(-3, 0)),       # 10^-3 to 10^0
  mixture(range = c(0, 1)),
  levels = c(penalty = 10, mixture = 5)
)

en_res <- tune_grid(
  en_wf, resamples = homes_folds, grid = en_grid,
  metrics = metric_set(rmse, rsq)
)

autoplot(en_res) + labs(title = "Elastic net tuning (cross-validated RMSE)")

show_best(en_res, metric = "rmse", n = 5)

The tuning plot shows RMSE as a function of penalty for each mixture value. Cross-validated RMSE is minimized at a small penalty with a mostly-ridge mixture, achieving a log-scale RMSE of about 0.116;. This is already a clear improvement over the plain linear model.

9.3 Random forest

A random forest averages hundreds of decorrelated decision trees. They are each grown on a bootstrap sample using a random subset of predictors at every split. I tune mtry (the number of predictors considered per split) and min_n (the minimum node size).

rf_grid <- grid_regular(
  mtry(range = c(10, 80)),
  min_n(range = c(2, 20)),
  levels = 4
)

rf_res <- tune_grid(
  rf_wf, resamples = homes_folds, grid = rf_grid,
  metrics = metric_set(rmse, rsq)
)

autoplot(rf_res) + labs(title = "Random forest tuning (cross-validated RMSE)")

show_best(rf_res, metric = "rmse", n = 5)

The random forest reaches a cross-validated RMSE of roughly 0.138. It is robust and requires almost no preprocessing but on this dataset, it is slightly less accurate than the penalized linear model. Hence, this is a good reminder that more complex models are not automatically better.

9.4 Gradient-boosted trees (XGBoost)

Boosting builds trees sequentially, with each new tree correcting the errors of the ensemble so far. It is among the most accurate off-the-shelf methods for tabular data. I tune the number of trees, their depth, and the learning rate.

bt_grid <- grid_regular(
  trees(range = c(300, 1000)),
  tree_depth(range = c(2, 5)),
  learn_rate(range = c(-2, -1)),   # 10^-2 to 10^-1
  levels = 3
)

bt_res <- tune_grid(
  bt_wf, resamples = homes_folds, grid = bt_grid,
  metrics = metric_set(rmse, rsq)
)

autoplot(bt_res) + labs(title = "XGBoost tuning (cross-validated RMSE)")

show_best(bt_res, metric = "rmse", n = 5)

With a moderate number of shallow trees and a small learning rate, XGBoost achieves a cross-validated RMSE of about 0.122. It is essentially tied with the elastic net for the best performance and well ahead of the random forest and plain linear model.

10 Model Selection and Performance

I will now collect the best cross-validated RMSE from each of the four models to compare them on equal footing.

cv_results <- bind_rows(
  collect_metrics(lm_res) %>% filter(.metric == "rmse") %>%
    transmute(model = "Linear regression", cv_rmse = mean),
  show_best(en_res, metric = "rmse", n = 1) %>%
    transmute(model = "Elastic net", cv_rmse = mean),
  show_best(rf_res, metric = "rmse", n = 1) %>%
    transmute(model = "Random forest", cv_rmse = mean),
  show_best(bt_res, metric = "rmse", n = 1) %>%
    transmute(model = "XGBoost", cv_rmse = mean)
) %>%
  arrange(cv_rmse)

cv_results %>%
  kable(digits = 4, caption = "Cross-validated RMSE (log scale) by model") %>%
  kable_styling(full_width = FALSE, bootstrap_options = c("striped","hover"))

Cross-validated RMSE (log scale) by model
model	cv_rmse
Elastic net	0.1156
XGBoost	0.1220
Linear regression	0.1305
Random forest	0.1376

ggplot(cv_results, aes(x = reorder(model, -cv_rmse), y = cv_rmse, fill = model)) +
  geom_col(show.legend = FALSE) +
  geom_text(aes(label = round(cv_rmse, 4)), hjust = -0.1) + #Code Assisted by Claude
  coord_flip(ylim = c(0, max(cv_results$cv_rmse) * 1.15)) +
  labs(title = "Cross-validated RMSE by model (lower is better)",
       x = NULL, y = "CV RMSE (log scale)")

The two top performers are the elastic net and XGBoost, both around 0.11-0.12 on the log scale and basically tied for the best cross-validated RMSE. They are comfortably ahead of the random forest and the plain linear model. Following the rubric’s guidance to evaluate only the best one or two models on the test set, I finalize both and assess them on the held-out 30% of homes that no model has seen.

# Finalize the two best workflows with their optimal hyperparameters
best_en <- select_best(en_res, metric = "rmse")
best_bt <- select_best(bt_res, metric = "rmse")

en_final_fit <- finalize_workflow(en_wf, best_en) %>% last_fit(homes_split) #Code Assisted by Claude
bt_final_fit <- finalize_workflow(bt_wf, best_bt) %>% last_fit(homes_split) #Code Assisted by Claude

test_results <- bind_rows(
  collect_metrics(en_final_fit) %>% mutate(model = "Elastic net"),
  collect_metrics(bt_final_fit) %>% mutate(model = "XGBoost")
) %>%
  select(model, .metric, .estimate) %>%
  pivot_wider(names_from = .metric, values_from = .estimate)

test_results %>%
  kable(digits = 4, caption = "Final performance on the held-out test set (log scale)") %>%
  kable_styling(full_width = FALSE, bootstrap_options = c("striped","hover"))

Final performance on the held-out test set (log scale)
model	rmse	rsq
Elastic net	0.1099	0.9197
XGBoost	0.1140	0.9128

On the test set, both models achieve a log-scale RMSE of approximately 0.112 and an R² of roughly 0.91-0.92, meaning they explain about 91-92% of the variation in (log) sale price on homes they were never trained on. The close agreement between the cross-validated and test RMSE is evidence that the models are not overfitting. Translated back to dollars, an RMSE of ~0.12 on the log scale corresponds to a typical prediction error of roughly $19,000. This is quite accurate for homes whose prices span from under $50,000 to over $700,000.

en_preds <- collect_predictions(en_final_fit) %>% mutate(model = "Elastic net") #Code Assisted by Claude
bt_preds <- collect_predictions(bt_final_fit) %>% mutate(model = "XGBoost")

bind_rows(en_preds, bt_preds) %>%
  ggplot(aes(x = SalePrice, y = .pred)) +
  geom_point(alpha = 0.4, color = "#253494") + #Code Assisted by Claude
  geom_abline(slope = 1, intercept = 0, color = "#e34a33", linewidth = 1) + #Code Assisted by Claude
  facet_wrap(~ model) + #Code Assisted by Claude
  labs(title = "Predicted vs. actual log sale price (test set)",
       x = "Actual log(SalePrice)", y = "Predicted log(SalePrice)")

In the predicted-versus-actual plots, points cluster tightly around the red 45° line (where a perfect prediction would fall), confirming the models track true prices closely across the entire price range. They only modestly scatter at the extreme high end.

10.1 Which features matter most?

Because the random forest and XGBoost models provide variable-importance scores, I use them to answer the project’s descriptive question: which features drive price?

bt_final_fit %>%
  extract_fit_parsnip() %>% #Code Assisted by Claude
  vip(num_features = 15, geom = "col", aesthetics = list(fill = "#225ea8")) + #Code Assisted by Claude
  labs(title = "Top 15 most important features (XGBoost)")

The variable-importance ranking confirms what the EDA suggested: overall quality (OverallQual), above-ground living area (GrLivArea), total basement / first-floor square footage, garage capacity, and the year built / remodeled dominate the model’s predictions. There are also a few influential neighborhood and external-quality indicators. In other words, the size, quality, age, and location of a home account for the vast majority of its price. I find this conclusion both satisfying and intuitive.

11 Conclusion

Ames, Iowa. Source: Wikimedia Commons.

This project set out to predict residential sale prices in Ames, Iowa and to identify the features that most influence them. After cleaning the data, recoding structural missingness, and log-transforming the right-skewed response, I fit and tuned four families of regression models using stratified 5-fold cross-validation. The two strongest models [a regularized elastic net and a gradient-boosted tree (XGBoost)] were nearly indistinguishable. They both achieved a cross-validated and test-set RMSE of about 0.11 on the log scale and explained close to 92% of the variation in sale price. The typical errors were around $19,000. The random forest trailed slightly. And then unsurprisingly, the plain linear regression was the weakest, constrained by multicollinearity that the penalized and ensemble methods handled with ease. The agreement between cross-validation and test performance indicates the final models generalize well rather than overfitting.

Perhaps the most interesting result is how competitive the simple elastic net was against the much more flexible boosting model. On this well-behaved, mostly-additive dataset, a carefully regularized linear model captures nearly all of the signal. Its interpretability is a genuine advantage for real-estate applications. The variable- importance analysis reinforced a clear and intuitive story: a home’s quality, size, age, and location explain the overwhelming majority of its value. For future work, I would explore richer feature engineering (for example, interaction terms, total-square-footage features, and an explicit age variable), stacking the elastic net and XGBoost into an ensemble (a technique known to top the Kaggle leaderboard for this dataset) and validating the model on more recent sales or homes from other cities to test how well it transfers beyond Ames. Overall, the project demonstrates a complete, honest machine-learning workflow and produces a model accurate enough to serve as a credible automated home-valuation tool. ```

Predicting Residential Home Sale Prices in Ames, Iowa

PSTAT 131 Final Project

Shiv Dutta

2026-06-09

1 Introduction

2 Data Citation and Source

3 Loading the Data

4 Tidying and Cleaning the Data

5 Missing Data

6 Exploratory Data Analysis

6.1 The response variable: SalePrice

6.2 Overall quality vs. price

6.3 Living area vs. price

6.4 Correlation among numeric predictors

6.5 Price by neighborhood

7 Data Splitting and Cross-Validation

8 Building the Recipe

9 Model Fitting

9.1 Linear regression (baseline)

9.2 Elastic net

9.3 Random forest

9.4 Gradient-boosted trees (XGBoost)

10 Model Selection and Performance

10.1 Which features matter most?

11 Conclusion