Buying or selling a home is among the largest financial decisions
most people ever make. A central question in real estate is fairly
simple: what is a house actually worth? In this project, I will
build and compare a series of statistical machine learning models that
predict the sale price of a residential home from its
physical characteristics, location, and condition. Because the quantity
I am predicting (SalePrice) is a continuous number measured
in U.S. dollars, this is a regression problem. This is
not a classification problem, in which the goal would
be to predict a discrete category such as “expensive” versus “cheap.”
Throughout the report, I measure predictive accuracy using root
mean squared error (RMSE), the square root of the average
squared difference between predicted and observed values. A smaller RMSE
means a model’s predictions land closer to the truth.
The data comes from the Ames Housing dataset, a classic dataset compiled by Dean De Cock describing residential sales in Ames, Iowa between 2006 and 2010. Each row is a single home sale, and the columns record 79 explanatory variables: everything from above- ground living area and basement size to neighborhood, overall quality, and the year the home was built. The dataset is rich and realistic: it mixes continuous measurements, counts, ordered quality ratings, and unordered categories, and it contains a substantial amount of missing data, much of which is structurally meaningful. For example, a missing “pool quality” simply means the home has no pool. Working with data like this requires careful cleaning, thoughtful feature engineering, and honest model evaluation, all of which are important skills in applied machine learning.
This project has two complementary goals. The first is a
predictive one: to build a model that estimates sale
prices as accurately as possible on homes it has never seen, mirroring
the “automated valuation models” used by companies such as Zillow and
Redfin. The second goal is descriptive: to understand
which features matter most in determining price. To those ends I fit
four very different families of models: ordinary linear regression, a
regularized elastic net, a random
forest, and a gradient-boosted tree (XGBoost).
And then, I tune them using cross-validation, and compare their
performance. Following standard practice for this dataset, I model the
logarithm of SalePrice so that prediction
errors are penalized proportionally rather than in absolute dollars, and
I report RMSE on that same log scale.
The dataset is publicly available through Kaggle’s House Prices: Advanced Regression Techniques competition and originates from the following peer-reviewed source:
De Cock, D. (2011). Ames, Iowa: Alternative to the Boston Housing Data as an End of Semester Regression Project. Journal of Statistics Education, 19(3). https://doi.org/10.1080/10691898.2011.11889627
Dataset link: https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques/data
A complete description of every variable is provided in the
accompanying codebook (codebook.doc)
submitted with this project.
## [1] 1460 81
The raw training set contains 1460 home sales and 81 columns: an
Id index, 79 predictor variables, and the response variable
SalePrice. Here is a quick look at a few key columns
confirms the mix of numeric and categorical information:
homes_raw %>%
select(SalePrice, OverallQual, GrLivArea, Neighborhood, YearBuilt, TotalBsmtSF) %>%
head(6)Before any analysis, I am going to perform a small number of cleaning steps.
Dropping the index. The Id column is
just a row identifier and carries no predictive information, so I will
remove it.
Removing extreme outliers. De Cock specifically recommends removing a handful of very large homes (above 4,000 square feet of above-ground living area) that sold under unusual “partial” conditions and behave as leverage points. There are only a few homes like this and removing them will produce a far more stable model.
Recoding structural missingness. For many
categorical variables, an NA does not mean the value is
unknown; it means the feature is absent. For example,
PoolQC is NA precisely when a home has no
pool. For these variables, I recode NA to the explicit
category "None" so the model can learn from the absence of
the feature.
# Variables where NA means "feature is absent"
none_vars <- c("PoolQC","MiscFeature","Alley","Fence","FireplaceQu",
"GarageType","GarageFinish","GarageQual","GarageCond",
"BsmtQual","BsmtCond","BsmtExposure","BsmtFinType1",
"BsmtFinType2","MasVnrType")
homes <- homes_raw %>%
select(-Id) %>%
filter(GrLivArea < 4000) %>% # De Cock-recommended outlier removal
mutate(across(all_of(none_vars), ~ replace_na(as.character(.x), "None"))) %>%
mutate(across(where(is.character), as.factor))
cat("Rows removed as outliers:", nrow(homes_raw) - nrow(homes), "\n")## Rows removed as outliers: 4
## Final dataset: 1456 rows, 80 columns
After cleaning, the working dataset has 1456 homes. Any remaining
missing values (in genuinely numeric fields such as
LotFrontage, and the single missing Electrical
value) are handled inside the modeling recipe through
imputation. This is described later so that imputation is learned only
from training data and never leaks information from the testing set.
Understanding the pattern of missingness is very important. The table below shows the variables with the most missing values in the raw data, expressed as a percentage of all of the rows.
missing_summary <- homes_raw %>%
summarise(across(everything(), ~ sum(is.na(.x)))) %>%
pivot_longer(everything(), names_to = "Variable", values_to = "n_missing") %>%
filter(n_missing > 0) %>%
mutate(pct_missing = round(100 * n_missing / nrow(homes_raw), 1)) %>%
arrange(desc(n_missing))
missing_summary %>%
kable(caption = "Variables containing missing values (raw data)") %>%
kable_styling(full_width = FALSE, bootstrap_options = c("striped","hover"))| Variable | n_missing | pct_missing |
|---|---|---|
| PoolQC | 1453 | 99.5 |
| MiscFeature | 1406 | 96.3 |
| Alley | 1369 | 93.8 |
| Fence | 1179 | 80.8 |
| FireplaceQu | 690 | 47.3 |
| LotFrontage | 259 | 17.7 |
| GarageType | 81 | 5.5 |
| GarageYrBlt | 81 | 5.5 |
| GarageFinish | 81 | 5.5 |
| GarageQual | 81 | 5.5 |
| GarageCond | 81 | 5.5 |
| BsmtExposure | 38 | 2.6 |
| BsmtFinType2 | 38 | 2.6 |
| BsmtQual | 37 | 2.5 |
| BsmtCond | 37 | 2.5 |
| BsmtFinType1 | 37 | 2.5 |
| MasVnrType | 8 | 0.5 |
| MasVnrArea | 8 | 0.5 |
| Electrical | 1 | 0.1 |
missing_summary %>%
slice_max(pct_missing, n = 15) %>%
ggplot(aes(x = reorder(Variable, pct_missing), y = pct_missing)) +
geom_col(fill = "#2c7fb8") + #Code Assisted by Claude
coord_flip() + #Code Assisted by Claude
labs(title = "Top 15 variables by percent missing",
x = NULL, y = "Percent missing")The missingness is highly concentrated and, importantly, mostly
structural. The four variables with the most missing data
[PoolQC (~99.5%), MiscFeature (~96%),
Alley (~94%), and Fence (~81%)] are missing
simply because the overwhelming majority of homes have no pool, no
special miscellaneous feature, no alley access, and no fence. The same
logic applies to the basement and garage quality variables. As described
above, I recoded all of these to an explicit "None"
category rather than discarding them, preserving the genuinely useful
signal that “this home has no garage/pool/fence.” Only a small set of
variables represent true missingness: LotFrontage (~18%),
which I impute with the median, and a single missing
Electrical value, which I impute with the most common
category. Because every variable is either recoded or imputed, no
observations are dropped due to missing data.
I will now explore the data visually to understand the response variable, its relationships with key predictors, and the correlation structure among numeric features.
p1 <- ggplot(homes, aes(SalePrice)) +
geom_histogram(bins = 50, fill = "#41b6c4", color = "white") +
scale_x_continuous(labels = scales::dollar) + #Code Assisted by Claude
labs(title = "Sale price (raw)", x = "Sale price", y = "Count")
p2 <- ggplot(homes, aes(log(SalePrice))) +
geom_histogram(bins = 50, fill = "#225ea8", color = "white") +
labs(title = "Sale price (log scale)", x = "log(Sale price)", y = "Count")
# Summary statistics referenced in the narration below (computed, not hard-coded)
sp_median <- median(homes$SalePrice)
sp_mean <- mean(homes$SalePrice)
skew <- function(x) { x <- x[!is.na(x)]; m <- mean(x); mean((x - m)^3) / mean((x - m)^2)^1.5 }
skew_raw <- skew(homes$SalePrice)
skew_log <- skew(log(homes$SalePrice))
p1 + p2The raw distribution of SalePrice is strongly
right-skewed: most homes sell for between roughly $100,000 and $250,000,
but a long tail of expensive homes stretches past $500,000 (median ≈
$163,000, mean ≈ $180,151). This skew is problematic for many models,
which implicitly assume errors are symmetric. Taking the natural
logarithm (right image) produces a distribution that is very nearly
symmetric and bell-shaped. the skewness drops from about 1.6 to roughly
0.1. This is exactly why I model log(SalePrice): it
stabilizes the variance, reduces the influence of the most expensive
homes, and means the model is penalized for percentage errors rather
than raw-dollar errors. This is the more appropriate measure of accuracy
when home prices vary by an order of magnitude.
ggplot(homes, aes(x = factor(OverallQual), y = SalePrice, fill = factor(OverallQual))) +
geom_boxplot(show.legend = FALSE) +
scale_y_continuous(labels = scales::dollar) +
scale_fill_viridis_d() + #Code Assisted by Claude
labs(title = "Sale price by overall quality rating",
x = "Overall quality (1 = very poor, 10 = very excellent)",
y = "Sale price")OverallQual : an expert 1-10 rating of the home’s
material and finish has an almost monotonic relationship with price.
Median sale price climbs from roughly $50,000 for the lowest-quality
homes to well over $400,000 for the highest, and each step up the
quality scale corresponds to a clear jump in price. This single ordinal
variable turns out to be the strongest predictor in the entire data set,
which matches real-estate intuition: the actual quality of construction
and finish drives value.
ggplot(homes, aes(x = GrLivArea, y = SalePrice)) +
geom_point(alpha = 0.4, color = "#253494") + #Code Assisted by Claude
geom_smooth(method = "lm", se = FALSE, color = "#e34a33") + #Code Assisted by Claude
scale_y_continuous(labels = scales::dollar) +
labs(title = "Sale price vs. above-ground living area",
x = "Above-ground living area (sq ft)", y = "Sale price")There is a strong, roughly linear positive relationship between
above-ground living area (GrLivArea) and sale price: bigger
homes cost more, with a correlation of about 0.72. The relationship is
tight enough that living area alone explains a large portion of price
variation. Although the scatter widens at larger sizes, large homes vary
more in price because factors like quality and neighborhood increasingly
also come into play.
num_vars <- homes %>%
select(where(is.numeric)) %>%
mutate(logSalePrice = log(SalePrice)) %>%
select(-SalePrice)
cor_mat <- cor(num_vars, use = "pairwise.complete.obs")
# show the predictors most correlated with log price
top_cor <- sort(abs(cor_mat[, "logSalePrice"]), decreasing = TRUE)
keep <- names(top_cor)[1:15]
corrplot(cor_mat[keep, keep], method = "color", type = "upper",
tl.col = "black", tl.cex = 0.8, addCoef.col = "grey30",
number.cex = 0.55, title = "Correlations among top numeric predictors",
mar = c(0,0,1.5,0))The correlation heatmap (restricted to the 15 numeric variables most
associated with log price, for readability) reveals two important
things. First, the predictors most strongly correlated with price are
OverallQual (0.82 with log price), GrLivArea
(0.72), GarageCars/GarageArea (0.68/0.66),
TotalBsmtSF and 1stFlrSF (0.64/0.61), and the
YearBuilt (0.59). Second, several predictors are strongly
correlated with each other: GarageCars with
GarageArea, and TotalBsmtSF with
1stFlrSF. This implies multicollinearity. Multicollinearity
destabilizes ordinary linear regression coefficients and is a key reason
I chose to include regularized and tree-based models, which handle
correlated predictors much more gracefully.
homes %>%
group_by(Neighborhood) %>%
summarise(median_price = median(SalePrice), .groups = "drop") %>%
ggplot(aes(x = reorder(Neighborhood, median_price), y = median_price)) +
geom_col(fill = "#7fcdbb") + #Code Assisted by Claude
coord_flip() +
scale_y_continuous(labels = scales::dollar) +
labs(title = "Median sale price by neighborhood",
x = NULL, y = "Median sale price")Location matters enormously. Median prices range from under $90,000 in the most affordable neighborhoods (MeadowV, IDOTRR, BrDale) to over $300,000 in the most desirable ones (StoneBr, NoRidge, NridgHt). This is more than a 3x difference driven purely by location. With 25 distinct neighborhoods, this single categorical variable captures a large amount of spatial price variation and will be one-hot encoded so the models can use it.
A model that simply memorizes the data it was trained on is worthless; what we care about is performance on new and unseen homes. To estimate that honestly, I split the data into two parts before beginning my modeling:
The split is stratified on SalePrice,
meaning the full range of prices (cheap to expensive) is represented
proportionally in both sets. If I don’t stratify properly, a random
split could by chance place too many expensive homes in one set and bias
the evaluation.
# Model the log of SalePrice. We transform the outcome here, BEFORE splitting,
# so that all RMSE/R-squared values are computed on the log scale
# and there is no risk of scale mismatch at test time.
# The original `homes` (raw dollars) is kept above purely for the EDA plots.
homes_model <- homes %>% mutate(SalePrice = log(SalePrice))
set.seed(131)
homes_split <- initial_split(homes_model, prop = 0.70, strata = SalePrice)
homes_train <- training(homes_split)
homes_test <- testing(homes_split)
c(train = nrow(homes_train), test = nrow(homes_test))## train test
## 1017 439
To tune model hyperparameters without touching the test set, I use
k-fold cross-validation on the training data. The idea
is simple but powerful: the training set is separated into k
equal “folds.” The model is trained on k−1 folds and evaluated
on the held-out fold; this repeats k times so every fold serves
as the validation set once, and the k error estimates are
averaged. This gives a stable estimate of how well a model will
generalize and lets me compare hyperparameter settings fairly. I use
5 folds, also stratified on SalePrice.
A tidymodels recipe is a reusable specification of
all the preprocessing the predictors need. My recipe predicts
SalePrice (already log-transformed in the splitting step
above) from every other variable and performs the following steps, in
order:
step_novel / step_other:
Protect against categories seen only at prediction time, and pool rare
categorical levels (those appearing in <1% of homes) into a single
"other" level. This shrinks the number of dummy columns and
prevents unstable coefficients from extremely rare levels.Neighborhood.step_zv: Remove any predictor with
zero variance (e.g., a dummy that is always 0 after splitting).homes_recipe <- recipe(SalePrice ~ ., data = homes_train) %>%
step_impute_median(all_numeric_predictors()) %>% #Code Assisted by Claude
step_impute_mode(all_nominal_predictors()) %>% #Code Assisted by Claude
step_normalize(all_numeric_predictors()) %>% #Code Assisted by Claude
step_novel(all_nominal_predictors()) %>% #Code Assisted by Claude
step_other(all_nominal_predictors(), threshold = 0.01) %>%
step_dummy(all_nominal_predictors(), one_hot = TRUE) %>%
step_zv(all_predictors()) #Code Assisted by Claude
# Confirm the recipe prepares cleanly and see how many predictors result
prepped <- prep(homes_recipe)
ncol(bake(prepped, new_data = NULL)) - 1## [1] 247
After preprocessing there are 247 predictor columns. As Professor Coburn confirmed, this dimensionality is completely manageable for regularized regression and tree ensembles. Therefore, no dimension reduction (e.g., PCA) is required.
I fit four families of models that span the bias–variance spectrum, from a simple linear model to flexible tree ensembles.
# 1. Ordinary linear regression (baseline, no tuning)
lm_spec <- linear_reg() %>%
set_engine("lm")
# 2. Elastic net: linear model with combined L1/L2 penalty (tune penalty & mixture)
en_spec <- linear_reg(penalty = tune(), mixture = tune()) %>%
set_engine("glmnet")
# 3. Random forest (tune mtry and min_n)
rf_spec <- rand_forest(mtry = tune(), trees = 500, min_n = tune()) %>%
set_engine("ranger", importance = "impurity") %>%
set_mode("regression")
# 4. Gradient-boosted trees via XGBoost (tune trees, depth, learn rate)
bt_spec <- boost_tree(trees = tune(), tree_depth = tune(), learn_rate = tune()) %>% #Code Assisted by Claude
set_engine("xgboost") %>%
set_mode("regression")
# Pair each model with the recipe in a workflow
lm_wf <- workflow() %>% add_recipe(homes_recipe) %>% add_model(lm_spec)
en_wf <- workflow() %>% add_recipe(homes_recipe) %>% add_model(en_spec)
rf_wf <- workflow() %>% add_recipe(homes_recipe) %>% add_model(rf_spec)
bt_wf <- workflow() %>% add_recipe(homes_recipe) %>% add_model(bt_spec)The plain linear model has no tuning parameters, so I evaluate it directly with cross-validation.
lm_res <- fit_resamples(
lm_wf, resamples = homes_folds,
metrics = metric_set(rmse, rsq),
control = control_resamples(save_pred = TRUE) #Code Assisted by Claude
)
collect_metrics(lm_res)The elastic net adds a penalty to the regression
that shrinks coefficients toward zero, controlling overfitting and
constraining multicollinearity. The penalty controls the
overall strength of shrinkage and mixture blends ridge (L2)
and lasso (L1) behavior. I will now tune both over a grid.
en_grid <- grid_regular(
penalty(range = c(-3, 0)), # 10^-3 to 10^0
mixture(range = c(0, 1)),
levels = c(penalty = 10, mixture = 5)
)
en_res <- tune_grid(
en_wf, resamples = homes_folds, grid = en_grid,
metrics = metric_set(rmse, rsq)
)
autoplot(en_res) + labs(title = "Elastic net tuning (cross-validated RMSE)")The tuning plot shows RMSE as a function of penalty for each mixture value. Cross-validated RMSE is minimized at a small penalty with a mostly-ridge mixture, achieving a log-scale RMSE of about 0.116;. This is already a clear improvement over the plain linear model.
A random forest averages hundreds of decorrelated
decision trees. They are each grown on a bootstrap sample using a random
subset of predictors at every split. I tune mtry (the
number of predictors considered per split) and min_n (the
minimum node size).
rf_grid <- grid_regular(
mtry(range = c(10, 80)),
min_n(range = c(2, 20)),
levels = 4
)
rf_res <- tune_grid(
rf_wf, resamples = homes_folds, grid = rf_grid,
metrics = metric_set(rmse, rsq)
)
autoplot(rf_res) + labs(title = "Random forest tuning (cross-validated RMSE)")The random forest reaches a cross-validated RMSE of roughly 0.138. It is robust and requires almost no preprocessing but on this dataset, it is slightly less accurate than the penalized linear model. Hence, this is a good reminder that more complex models are not automatically better.
Boosting builds trees sequentially, with each new tree correcting the errors of the ensemble so far. It is among the most accurate off-the-shelf methods for tabular data. I tune the number of trees, their depth, and the learning rate.
bt_grid <- grid_regular(
trees(range = c(300, 1000)),
tree_depth(range = c(2, 5)),
learn_rate(range = c(-2, -1)), # 10^-2 to 10^-1
levels = 3
)
bt_res <- tune_grid(
bt_wf, resamples = homes_folds, grid = bt_grid,
metrics = metric_set(rmse, rsq)
)
autoplot(bt_res) + labs(title = "XGBoost tuning (cross-validated RMSE)")With a moderate number of shallow trees and a small learning rate, XGBoost achieves a cross-validated RMSE of about 0.122. It is essentially tied with the elastic net for the best performance and well ahead of the random forest and plain linear model.
I will now collect the best cross-validated RMSE from each of the four models to compare them on equal footing.
cv_results <- bind_rows(
collect_metrics(lm_res) %>% filter(.metric == "rmse") %>%
transmute(model = "Linear regression", cv_rmse = mean),
show_best(en_res, metric = "rmse", n = 1) %>%
transmute(model = "Elastic net", cv_rmse = mean),
show_best(rf_res, metric = "rmse", n = 1) %>%
transmute(model = "Random forest", cv_rmse = mean),
show_best(bt_res, metric = "rmse", n = 1) %>%
transmute(model = "XGBoost", cv_rmse = mean)
) %>%
arrange(cv_rmse)
cv_results %>%
kable(digits = 4, caption = "Cross-validated RMSE (log scale) by model") %>%
kable_styling(full_width = FALSE, bootstrap_options = c("striped","hover"))| model | cv_rmse |
|---|---|
| Elastic net | 0.1156 |
| XGBoost | 0.1220 |
| Linear regression | 0.1305 |
| Random forest | 0.1376 |
ggplot(cv_results, aes(x = reorder(model, -cv_rmse), y = cv_rmse, fill = model)) +
geom_col(show.legend = FALSE) +
geom_text(aes(label = round(cv_rmse, 4)), hjust = -0.1) + #Code Assisted by Claude
coord_flip(ylim = c(0, max(cv_results$cv_rmse) * 1.15)) +
labs(title = "Cross-validated RMSE by model (lower is better)",
x = NULL, y = "CV RMSE (log scale)")The two top performers are the elastic net and XGBoost, both around 0.11-0.12 on the log scale and basically tied for the best cross-validated RMSE. They are comfortably ahead of the random forest and the plain linear model. Following the rubric’s guidance to evaluate only the best one or two models on the test set, I finalize both and assess them on the held-out 30% of homes that no model has seen.
# Finalize the two best workflows with their optimal hyperparameters
best_en <- select_best(en_res, metric = "rmse")
best_bt <- select_best(bt_res, metric = "rmse")
en_final_fit <- finalize_workflow(en_wf, best_en) %>% last_fit(homes_split) #Code Assisted by Claude
bt_final_fit <- finalize_workflow(bt_wf, best_bt) %>% last_fit(homes_split) #Code Assisted by Claude
test_results <- bind_rows(
collect_metrics(en_final_fit) %>% mutate(model = "Elastic net"),
collect_metrics(bt_final_fit) %>% mutate(model = "XGBoost")
) %>%
select(model, .metric, .estimate) %>%
pivot_wider(names_from = .metric, values_from = .estimate)
test_results %>%
kable(digits = 4, caption = "Final performance on the held-out test set (log scale)") %>%
kable_styling(full_width = FALSE, bootstrap_options = c("striped","hover"))| model | rmse | rsq |
|---|---|---|
| Elastic net | 0.1099 | 0.9197 |
| XGBoost | 0.1140 | 0.9128 |
On the test set, both models achieve a log-scale RMSE of approximately 0.112 and an R² of roughly 0.91-0.92, meaning they explain about 91-92% of the variation in (log) sale price on homes they were never trained on. The close agreement between the cross-validated and test RMSE is evidence that the models are not overfitting. Translated back to dollars, an RMSE of ~0.12 on the log scale corresponds to a typical prediction error of roughly $19,000. This is quite accurate for homes whose prices span from under $50,000 to over $700,000.
en_preds <- collect_predictions(en_final_fit) %>% mutate(model = "Elastic net") #Code Assisted by Claude
bt_preds <- collect_predictions(bt_final_fit) %>% mutate(model = "XGBoost")
bind_rows(en_preds, bt_preds) %>%
ggplot(aes(x = SalePrice, y = .pred)) +
geom_point(alpha = 0.4, color = "#253494") + #Code Assisted by Claude
geom_abline(slope = 1, intercept = 0, color = "#e34a33", linewidth = 1) + #Code Assisted by Claude
facet_wrap(~ model) + #Code Assisted by Claude
labs(title = "Predicted vs. actual log sale price (test set)",
x = "Actual log(SalePrice)", y = "Predicted log(SalePrice)")In the predicted-versus-actual plots, points cluster tightly around the red 45° line (where a perfect prediction would fall), confirming the models track true prices closely across the entire price range. They only modestly scatter at the extreme high end.
Because the random forest and XGBoost models provide variable-importance scores, I use them to answer the project’s descriptive question: which features drive price?
bt_final_fit %>%
extract_fit_parsnip() %>% #Code Assisted by Claude
vip(num_features = 15, geom = "col", aesthetics = list(fill = "#225ea8")) + #Code Assisted by Claude
labs(title = "Top 15 most important features (XGBoost)")The variable-importance ranking confirms what the EDA suggested:
overall quality (OverallQual),
above-ground living area (GrLivArea),
total basement / first-floor square footage,
garage capacity, and the year built /
remodeled dominate the model’s predictions. There are also a
few influential neighborhood and external-quality indicators. In other
words, the size, quality, age, and location of a home account for the
vast majority of its price. I find this conclusion both satisfying and
intuitive.
This project set out to predict residential sale prices in Ames, Iowa and to identify the features that most influence them. After cleaning the data, recoding structural missingness, and log-transforming the right-skewed response, I fit and tuned four families of regression models using stratified 5-fold cross-validation. The two strongest models [a regularized elastic net and a gradient-boosted tree (XGBoost)] were nearly indistinguishable. They both achieved a cross-validated and test-set RMSE of about 0.11 on the log scale and explained close to 92% of the variation in sale price. The typical errors were around $19,000. The random forest trailed slightly. And then unsurprisingly, the plain linear regression was the weakest, constrained by multicollinearity that the penalized and ensemble methods handled with ease. The agreement between cross-validation and test performance indicates the final models generalize well rather than overfitting.
Perhaps the most interesting result is how competitive the simple elastic net was against the much more flexible boosting model. On this well-behaved, mostly-additive dataset, a carefully regularized linear model captures nearly all of the signal. Its interpretability is a genuine advantage for real-estate applications. The variable- importance analysis reinforced a clear and intuitive story: a home’s quality, size, age, and location explain the overwhelming majority of its value. For future work, I would explore richer feature engineering (for example, interaction terms, total-square-footage features, and an explicit age variable), stacking the elastic net and XGBoost into an ensemble (a technique known to top the Kaggle leaderboard for this dataset) and validating the model on more recent sales or homes from other cities to test how well it transfers beyond Ames. Overall, the project demonstrates a complete, honest machine-learning workflow and produces a model accurate enough to serve as a credible automated home-valuation tool. ```