Hi all! 👋 This is my first project organized by R markdown. The goal of this project is to predict the price of each house(\(Y\)). I used {ranger} package for forecasting house prices, which is a fast implementation of random forests. I did some works before modeling, including imputations, label encoding, and factorizing variables. Also, I’ll try to improve forecast performances with other models later. Now, let’s go dive into details! 🔥
library(tidyverse)
library(vroom)
library(mlr)
library(randomForest)
library(scales)
library(rlist) # list.save(), list.load()
loaded_package <- c("tidyverse", "vroom", "mlr", "randomForest", "scales", "rlist")
.version <- map(loaded_package, packageVersion)
names(.version) <- loaded_package
.version
## $tidyverse
## [1] '1.3.0'
##
## $vroom
## [1] '1.3.2'
##
## $mlr
## [1] '2.18.0'
##
## $randomForest
## [1] '4.6.14'
##
## $scales
## [1] '1.1.1'
##
## $rlist
## [1] '0.4.6.1'
## Rows: 1,460
## Columns: 81
## Delimiter: ","
## chr [43]: MSZoning, Street, Alley, LotShape, LandContour, Utilities, LotConfig, LandSlope,...
## dbl [38]: Id, MSSubClass, LotFrontage, LotArea, OverallQual, OverallCond, YearBuilt, YearR...
##
## Use `spec()` to retrieve the guessed column specification
## Pass a specification to the `col_types` argument to quiet this message
## user system elapsed
## 0.10 0.01 0.14
## Rows: 1,459
## Columns: 80
## Delimiter: ","
## chr [42]: MSZoning, Street, Alley, LotShape, LandContour, Utilities, LotConfig, LandSlope,...
## dbl [37]: Id, MSSubClass, LotFrontage, LotArea, OverallQual, OverallCond, YearBuilt, YearR...
## lgl [ 1]: PoolQC
##
## Use `spec()` to retrieve the guessed column specification
## Pass a specification to the `col_types` argument to quiet this message
## user system elapsed
## 0.03 0.00 0.06
## There are 1460 observations and 81 variables in training dataset.
## There are 1459 observations and 80 variables in test dataset.
And then, let’s prepare a submission file.
I’m going to do some works before modeling. First of all, I checked the distribution of a target variable \(Y\)(or dependent variable, “SalePrice”) and missing values. Second, I did re-encode labels for categorical variables, which have typos. then, I did imputations in various ways for the rest of the variables with missing values. Fourth, I did factorize the remaining character variables. Finally, I modified some variables which violate variable naming conventions in R.
Let’s check details!😄 before diving, bind training and test set to check entire dataset.
ggplot(data = house %>% filter(!is.na(SalePrice))) +
geom_histogram(aes(x = SalePrice), fill = "blue", alpha = 1/2, binwidth = 10000) +
scale_x_continuous(labels = dollar_format()) +
labs(
title = "SalePrice(Y) is skewed to the right"
) +
theme(
plot.title = element_text(hjust = 0.5, size = 15, face = "bold"),
)
I’m going to check the proportions of missing values per columns. If you want to know the number of missing values per columns, use sum() function in map_dfr().
na_prop <- house %>%
select(-SalePrice) %>%
map(is.na) %>%
map_dfr(mean) %>%
pivot_longer(cols = everything(), names_to = "variables", values_to = "prop") %>%
filter(prop > 0) %>%
arrange(desc(prop))
na_prop %>%
ggplot(aes(x = fct_reorder(variables, prop), y = prop, fill = variables)) +
geom_bar(stat = "identity") +
coord_flip() +
theme(legend.position = "none") +
labs(
x = "Explanatory variables",
y = "The proportions of NA values per column"
) +
scale_y_continuous(breaks = seq(0, 1, by = 0.1)) +
theme(axis.text.y = element_text(size = 10))
There are many variables with missing values.😩 But, we can do it! let’s go.
First, I’m going to do re-encode labels for categorical variables with typos. The details of the each variables are below. Refer this to get the details of other variables.
I specify the levels when factorizing ordered factors.
house$PoolQC[is.na(house$PoolQC)] <- "None"
house$MiscFeature[is.na(house$MiscFeature)] <- "None"
house$Alley[is.na(house$Alley)] <- "No"
house$Fence[is.na(house$Fence)] <- "No"
house$FireplaceQu[is.na(house$FireplaceQu)] <- "No"
house$GarageType[is.na(house$GarageType)] <- "No"
house$GarageFinish[is.na(house$GarageFinish)] <- "No"
house$GarageQual[is.na(house$GarageQual)] <- "No"
house$GarageCond[is.na(house$GarageCond)] <- "No"
house$BsmtExposure[is.na(house$BsmtExposure)] <- "NoBs"
house$BsmtCond[is.na(house$BsmtCond)] <- "NoBs"
house$BsmtQual[is.na(house$BsmtQual)] <- "NoBs"
house$BsmtFinType1[is.na(house$BsmtFinType1)] <- "NoBs"
house$BsmtFinType2[is.na(house$BsmtFinType2)] <- "NoBs"
# To specify the levels of ordered factors
PoolQC_lev <- c("None", "Fa", "TA", "Gd", "Ex")
Fence_lev <- c("No", "MnWw", "GdWo", "MnPrv", "GdPrv")
FireplaceQu_lev <- c("No", "Po", "Fa", "TA", "Gd", "Ex")
GarageFinish_lev <- c("No", "Unf", "RFn", "Fin")
GarageQual_lev <- c("No", "Po", "Fa", "TA", "Gd", "Ex")
GarageCond_lev <- c("No", "Po", "Fa", "TA", "Gd", "Ex")
BsmtExposure_lev <- c("NoBs", "No", "Mn", "Av", "Gd")
BsmtCond_lev <- c("NoBs", "Po", "Fa", "TA", "Gd", "Ex")
BsmtQual_lev <- c("NoBs", "Po", "Fa", "TA", "Gd", "Ex")
BsmtFinType1_lev <- c("NoBs", "Unf", "LwQ", "Rec", "BLQ", "ALQ", "GLQ")
BsmtFinType2_lev <- c("NoBs", "Unf", "LwQ", "Rec", "BLQ", "ALQ", "GLQ")
house2 <- house %>%
mutate(PoolQC = parse_factor(PoolQC, levels = PoolQC_lev, ordered = TRUE),
MiscFeature = parse_factor(MiscFeature),
Alley = parse_factor(Alley),
Fence = parse_factor(Fence, levels = Fence_lev, ordered = TRUE),
FireplaceQu = parse_factor(FireplaceQu, levels = FireplaceQu_lev, ordered = TRUE),
GarageType = parse_factor(GarageType),
GarageFinish = parse_factor(GarageFinish, levels = GarageFinish_lev, ordered = TRUE),
GarageQual = parse_factor(GarageQual, levels = GarageQual_lev, ordered = TRUE),
GarageCond = parse_factor(GarageCond, levels = GarageCond_lev, ordered = TRUE),
BsmtExposure = parse_factor(BsmtExposure, levels = BsmtExposure_lev, ordered = TRUE),
BsmtCond = parse_factor(BsmtCond, levels = BsmtCond_lev, ordered = TRUE),
BsmtQual = parse_factor(BsmtQual, levels = BsmtQual_lev, ordered = TRUE),
BsmtFinType1 = parse_factor(BsmtFinType1, levels = BsmtFinType1_lev, ordered = TRUE),
BsmtFinType2 = parse_factor(BsmtFinType2, levels = BsmtFinType2_lev, ordered = TRUE))
Now, I’m going to examine and impute the remaining variables which have the missing values. It was a very boring job. 😑
## # A tibble: 1 x 6
## GarageType GarageFinish GarageCars GarageArea GarageQual GarageCond
## <fct> <ord> <dbl> <dbl> <ord> <ord>
## 1 Detchd No NA NA No No
Since this observation doesn’t seem to have Garage, I’m going to impute 0 in GarageCars and GarageArea respectively.
house3 <- house2 %>%
mutate(GarageArea = ifelse(is.na(GarageArea), 0, GarageArea),
GarageCars = ifelse(is.na(GarageCars), 0, GarageCars))
ggplot(house3, aes(x = Neighborhood, y = LotFrontage)) +
stat_boxplot(geom = 'errorbar') +
geom_boxplot(fill = "skyblue", alpha = 0.8) +
coord_flip() +
labs(
x = "Neighborhood"
)
Let’s take the median per neighbor.
LotFront_med <- house3 %>%
group_by(Neighborhood) %>%
summarize(
LotFrontage_med = median(LotFrontage, na.rm = TRUE)
) %>%
arrange(desc(LotFrontage_med))
Lotfrontage_new <- house3 %>%
select(Neighborhood, LotFrontage) %>%
left_join(LotFront_med, by = "Neighborhood") %>%
mutate(LotFrontage_new =
ifelse(is.na(LotFrontage), LotFrontage_med, LotFrontage)) %>%
pull(LotFrontage_new)
house4 <- house3 %>%
mutate(LotFrontage = Lotfrontage_new)
house4 %>%
filter(is.na(MasVnrType), !is.na(MasVnrArea)) %>%
select(MasVnrType, MasVnrArea, everything())
## # A tibble: 1 x 79
## MasVnrType MasVnrArea MSSubClass MSZoning LotFrontage LotArea Street Alley
## <chr> <dbl> <dbl> <chr> <dbl> <dbl> <chr> <fct>
## 1 <NA> 198 20 RL 124 27697 Pave No
## # ... with 71 more variables: LotShape <chr>, LandContour <chr>,
## # Utilities <chr>, LotConfig <chr>, LandSlope <chr>, Neighborhood <chr>,
## # Condition1 <chr>, Condition2 <chr>, BldgType <chr>, HouseStyle <chr>,
## # OverallQual <dbl>, OverallCond <dbl>, YearBuilt <dbl>, YearRemodAdd <dbl>,
## # RoofStyle <chr>, RoofMatl <chr>, Exterior1st <chr>, Exterior2nd <chr>,
## # ExterQual <chr>, ExterCond <chr>, Foundation <chr>, BsmtQual <ord>,
## # BsmtCond <ord>, BsmtExposure <ord>, BsmtFinType1 <ord>, BsmtFinSF1 <dbl>,
## # BsmtFinType2 <ord>, BsmtFinSF2 <dbl>, BsmtUnfSF <dbl>, TotalBsmtSF <dbl>,
## # Heating <chr>, HeatingQC <chr>, CentralAir <chr>, Electrical <chr>,
## # `1stFlrSF` <dbl>, `2ndFlrSF` <dbl>, LowQualFinSF <dbl>, GrLivArea <dbl>,
## # BsmtFullBath <dbl>, BsmtHalfBath <dbl>, FullBath <dbl>, HalfBath <dbl>,
## # BedroomAbvGr <dbl>, KitchenAbvGr <dbl>, KitchenQual <chr>,
## # TotRmsAbvGrd <dbl>, Functional <chr>, Fireplaces <dbl>, FireplaceQu <ord>,
## # GarageType <fct>, GarageFinish <ord>, GarageCars <dbl>, GarageArea <dbl>,
## # GarageQual <ord>, GarageCond <ord>, PavedDrive <chr>, WoodDeckSF <dbl>,
## # OpenPorchSF <dbl>, EnclosedPorch <dbl>, `3SsnPorch` <dbl>,
## # ScreenPorch <dbl>, PoolArea <dbl>, PoolQC <ord>, Fence <ord>,
## # MiscFeature <fct>, MiscVal <dbl>, MoSold <dbl>, YrSold <dbl>,
## # SaleType <chr>, SaleCondition <chr>, SalePrice <dbl>
Lastly, I impute “None” to NA values of MasVnrtype because the values of MasVnrArea are also NA. and then I impute 0 to NA values of MasVnrArea. Since I’m going to repeatedly do the works that impute the mode, I created a function, which is named mode_vector().
mode_vector <- function(tb, .col){
.col <- enquo(.col) # prevent the evaluation of R code
# Defusing prevents the evaluation of R code, but you can still force evaluation inside a defused expression # with the forcing operators !! and !!!.
tb %>%
group_by(!!.col) %>%
count() %>%
arrange(desc(n))
}
mode_MasType <- mode_vector(house4, MasVnrType) %>%
filter(MasVnrType != "None") %>%
pull(MasVnrType)
MasVnrType_lev <- c("Stone", "None", "CBlock", "BrkFace", "BrkCmn")
house5 <- house4 %>%
select(MasVnrType, MasVnrArea, everything()) %>%
mutate(MasVnrType = case_when(
is.na(MasVnrType) & !is.na(MasVnrArea) ~ mode_MasType[[1]],
is.na(MasVnrType) ~ "None",
TRUE ~ MasVnrType),
MasVnrArea = ifelse(is.na(MasVnrArea), 0, MasVnrArea),
MasVnrType = parse_factor(MasVnrType, levels = MasVnrType_lev))
## # A tibble: 4 x 79
## MasVnrType MasVnrArea MSSubClass MSZoning LotFrontage LotArea Street Alley
## <fct> <dbl> <dbl> <chr> <dbl> <dbl> <chr> <fct>
## 1 None 0 30 <NA> 109 21780 Grvl No
## 2 None 0 20 <NA> 80 14584 Pave No
## 3 None 0 70 <NA> 60 56600 Pave No
## 4 None 0 20 <NA> 125 31250 Pave No
## # ... with 71 more variables: LotShape <chr>, LandContour <chr>,
## # Utilities <chr>, LotConfig <chr>, LandSlope <chr>, Neighborhood <chr>,
## # Condition1 <chr>, Condition2 <chr>, BldgType <chr>, HouseStyle <chr>,
## # OverallQual <dbl>, OverallCond <dbl>, YearBuilt <dbl>, YearRemodAdd <dbl>,
## # RoofStyle <chr>, RoofMatl <chr>, Exterior1st <chr>, Exterior2nd <chr>,
## # ExterQual <chr>, ExterCond <chr>, Foundation <chr>, BsmtQual <ord>,
## # BsmtCond <ord>, BsmtExposure <ord>, BsmtFinType1 <ord>, BsmtFinSF1 <dbl>,
## # BsmtFinType2 <ord>, BsmtFinSF2 <dbl>, BsmtUnfSF <dbl>, TotalBsmtSF <dbl>,
## # Heating <chr>, HeatingQC <chr>, CentralAir <chr>, Electrical <chr>,
## # `1stFlrSF` <dbl>, `2ndFlrSF` <dbl>, LowQualFinSF <dbl>, GrLivArea <dbl>,
## # BsmtFullBath <dbl>, BsmtHalfBath <dbl>, FullBath <dbl>, HalfBath <dbl>,
## # BedroomAbvGr <dbl>, KitchenAbvGr <dbl>, KitchenQual <chr>,
## # TotRmsAbvGrd <dbl>, Functional <chr>, Fireplaces <dbl>, FireplaceQu <ord>,
## # GarageType <fct>, GarageFinish <ord>, GarageCars <dbl>, GarageArea <dbl>,
## # GarageQual <ord>, GarageCond <ord>, PavedDrive <chr>, WoodDeckSF <dbl>,
## # OpenPorchSF <dbl>, EnclosedPorch <dbl>, `3SsnPorch` <dbl>,
## # ScreenPorch <dbl>, PoolArea <dbl>, PoolQC <ord>, Fence <ord>,
## # MiscFeature <fct>, MiscVal <dbl>, MoSold <dbl>, YrSold <dbl>,
## # SaleType <chr>, SaleCondition <chr>, SalePrice <dbl>
There is a typo in MSZoning. C is written in C (all). I’m going to fix these values.
mode_MSZoning <- mode_vector(house5, MSZoning) %>%
pull(MSZoning)
mode_MSZoning # There is a typo in MSZoning. C is written in C (all). I'm going to fix these values.
## [1] "RL" "RM" "FV" "RH" "C (all)" NA
MSZoning_lev <- c("RM", "RP", "RL", "RH", "I", "FV", "C", "A") # To factorize at the end.
house6 <- house5 %>%
mutate(MSZoning = case_when(
is.na(MSZoning) ~ mode_MSZoning[[1]],
MSZoning == "C (all)" ~ "C",
TRUE ~ MSZoning),
MSZoning = parse_factor(MSZoning, levels = MSZoning_lev))
## # A tibble: 2 x 79
## Utilities MasVnrType MasVnrArea MSSubClass MSZoning LotFrontage LotArea Street
## <chr> <fct> <dbl> <dbl> <fct> <dbl> <dbl> <chr>
## 1 <NA> None 0 30 RL 109 21780 Grvl
## 2 <NA> None 0 20 RL 64 31220 Pave
## # ... with 71 more variables: Alley <fct>, LotShape <chr>, LandContour <chr>,
## # LotConfig <chr>, LandSlope <chr>, Neighborhood <chr>, Condition1 <chr>,
## # Condition2 <chr>, BldgType <chr>, HouseStyle <chr>, OverallQual <dbl>,
## # OverallCond <dbl>, YearBuilt <dbl>, YearRemodAdd <dbl>, RoofStyle <chr>,
## # RoofMatl <chr>, Exterior1st <chr>, Exterior2nd <chr>, ExterQual <chr>,
## # ExterCond <chr>, Foundation <chr>, BsmtQual <ord>, BsmtCond <ord>,
## # BsmtExposure <ord>, BsmtFinType1 <ord>, BsmtFinSF1 <dbl>,
## # BsmtFinType2 <ord>, BsmtFinSF2 <dbl>, BsmtUnfSF <dbl>, TotalBsmtSF <dbl>,
## # Heating <chr>, HeatingQC <chr>, CentralAir <chr>, Electrical <chr>,
## # `1stFlrSF` <dbl>, `2ndFlrSF` <dbl>, LowQualFinSF <dbl>, GrLivArea <dbl>,
## # BsmtFullBath <dbl>, BsmtHalfBath <dbl>, FullBath <dbl>, HalfBath <dbl>,
## # BedroomAbvGr <dbl>, KitchenAbvGr <dbl>, KitchenQual <chr>,
## # TotRmsAbvGrd <dbl>, Functional <chr>, Fireplaces <dbl>, FireplaceQu <ord>,
## # GarageType <fct>, GarageFinish <ord>, GarageCars <dbl>, GarageArea <dbl>,
## # GarageQual <ord>, GarageCond <ord>, PavedDrive <chr>, WoodDeckSF <dbl>,
## # OpenPorchSF <dbl>, EnclosedPorch <dbl>, `3SsnPorch` <dbl>,
## # ScreenPorch <dbl>, PoolArea <dbl>, PoolQC <ord>, Fence <ord>,
## # MiscFeature <fct>, MiscVal <dbl>, MoSold <dbl>, YrSold <dbl>,
## # SaleType <chr>, SaleCondition <chr>, SalePrice <dbl>
## # A tibble: 3 x 2
## # Groups: Utilities [3]
## Utilities n
## <chr> <int>
## 1 AllPub 2916
## 2 <NA> 2
## 3 NoSeWa 1
Only one data point is “NoSeWa”. Let’s find where it is. It’s in training set, which means all houses in the test set has “AllPub”
## [1] 945
This makes the variable “Utilities” useless for prediction. Consequently, I’m going to get rid of it.
We need to inspect the data points which has NA in BsmtFullBath and BsmtHalfBath and BsmtFinSF1
house7 %>%
filter(is.na(BsmtFullBath) | is.na(BsmtHalfBath) | is.na(BsmtFinSF1) ) %>%
select(BsmtFullBath, BsmtHalfBath, starts_with("BsmtFin"), BsmtUnfSF, ends_with("BsmtSF"))
## # A tibble: 2 x 8
## BsmtFullBath BsmtHalfBath BsmtFinType1 BsmtFinSF1 BsmtFinType2 BsmtFinSF2
## <dbl> <dbl> <ord> <dbl> <ord> <dbl>
## 1 NA NA NoBs NA NoBs NA
## 2 NA NA NoBs 0 NoBs 0
## # ... with 2 more variables: BsmtUnfSF <dbl>, TotalBsmtSF <dbl>
That is, “NoBs” in the result means we can impute 0 to NA values. we can impute zero to the data points.
house8 <- house7 %>%
mutate(BsmtFullBath = ifelse(is.na(BsmtFullBath), 0, BsmtFullBath),
BsmtHalfBath = ifelse(is.na(BsmtHalfBath), 0, BsmtHalfBath),
BsmtFinSF1 = ifelse(is.na(BsmtFinSF1), 0, BsmtFinSF1),
BsmtFinSF2 = ifelse(is.na(BsmtFinSF2), 0, BsmtFinSF2),
BsmtUnfSF = ifelse(is.na(BsmtUnfSF), 0, BsmtUnfSF),
TotalBsmtSF = ifelse(is.na(TotalBsmtSF), 0, TotalBsmtSF),
BsmtFullBath = factor(BsmtFullBath, ordered = TRUE),
BsmtHalfBath = factor(BsmtHalfBath, ordered = TRUE))
## # A tibble: 2 x 78
## Functional MasVnrType MasVnrArea MSSubClass MSZoning LotFrontage LotArea
## <chr> <fct> <dbl> <dbl> <fct> <dbl> <dbl>
## 1 <NA> None 0 20 RL 80 14584
## 2 <NA> None 0 50 RM 60 10320
## # ... with 71 more variables: Street <chr>, Alley <fct>, LotShape <chr>,
## # LandContour <chr>, LotConfig <chr>, LandSlope <chr>, Neighborhood <chr>,
## # Condition1 <chr>, Condition2 <chr>, BldgType <chr>, HouseStyle <chr>,
## # OverallQual <dbl>, OverallCond <dbl>, YearBuilt <dbl>, YearRemodAdd <dbl>,
## # RoofStyle <chr>, RoofMatl <chr>, Exterior1st <chr>, Exterior2nd <chr>,
## # ExterQual <chr>, ExterCond <chr>, Foundation <chr>, BsmtQual <ord>,
## # BsmtCond <ord>, BsmtExposure <ord>, BsmtFinType1 <ord>, BsmtFinSF1 <dbl>,
## # BsmtFinType2 <ord>, BsmtFinSF2 <dbl>, BsmtUnfSF <dbl>, TotalBsmtSF <dbl>,
## # Heating <chr>, HeatingQC <chr>, CentralAir <chr>, Electrical <chr>,
## # `1stFlrSF` <dbl>, `2ndFlrSF` <dbl>, LowQualFinSF <dbl>, GrLivArea <dbl>,
## # BsmtFullBath <ord>, BsmtHalfBath <ord>, FullBath <dbl>, HalfBath <dbl>,
## # BedroomAbvGr <dbl>, KitchenAbvGr <dbl>, KitchenQual <chr>,
## # TotRmsAbvGrd <dbl>, Fireplaces <dbl>, FireplaceQu <ord>, GarageType <fct>,
## # GarageFinish <ord>, GarageCars <dbl>, GarageArea <dbl>, GarageQual <ord>,
## # GarageCond <ord>, PavedDrive <chr>, WoodDeckSF <dbl>, OpenPorchSF <dbl>,
## # EnclosedPorch <dbl>, `3SsnPorch` <dbl>, ScreenPorch <dbl>, PoolArea <dbl>,
## # PoolQC <ord>, Fence <ord>, MiscFeature <fct>, MiscVal <dbl>, MoSold <dbl>,
## # YrSold <dbl>, SaleType <chr>, SaleCondition <chr>, SalePrice <dbl>
## # A tibble: 8 x 2
## # Groups: Functional [8]
## Functional n
## <chr> <int>
## 1 Typ 2717
## 2 Min2 70
## 3 Min1 65
## 4 Mod 35
## 5 Maj1 19
## 6 Maj2 9
## 7 Sev 2
## 8 <NA> 2
impute mode
Functional_lev <- c("Sal", "Sev", "Maj2", "Maj1", "Mod", "Min2", "Min1", "Typ")
house9 <- house8 %>%
mutate(Functional = ifelse(is.na(Functional), mode_Func[[1]], Functional),
Functional = parse_factor(Functional, levels = Functional_lev, ordered = TRUE))
house9 %>%
filter(is.na(Exterior1st) | is.na(Exterior2nd)) %>%
select(Exterior1st, Exterior2nd, everything())
## # A tibble: 1 x 78
## Exterior1st Exterior2nd MasVnrType MasVnrArea MSSubClass MSZoning LotFrontage
## <chr> <chr> <fct> <dbl> <dbl> <fct> <dbl>
## 1 <NA> <NA> None 0 30 RL 85
## # ... with 71 more variables: LotArea <dbl>, Street <chr>, Alley <fct>,
## # LotShape <chr>, LandContour <chr>, LotConfig <chr>, LandSlope <chr>,
## # Neighborhood <chr>, Condition1 <chr>, Condition2 <chr>, BldgType <chr>,
## # HouseStyle <chr>, OverallQual <dbl>, OverallCond <dbl>, YearBuilt <dbl>,
## # YearRemodAdd <dbl>, RoofStyle <chr>, RoofMatl <chr>, ExterQual <chr>,
## # ExterCond <chr>, Foundation <chr>, BsmtQual <ord>, BsmtCond <ord>,
## # BsmtExposure <ord>, BsmtFinType1 <ord>, BsmtFinSF1 <dbl>,
## # BsmtFinType2 <ord>, BsmtFinSF2 <dbl>, BsmtUnfSF <dbl>, TotalBsmtSF <dbl>,
## # Heating <chr>, HeatingQC <chr>, CentralAir <chr>, Electrical <chr>,
## # `1stFlrSF` <dbl>, `2ndFlrSF` <dbl>, LowQualFinSF <dbl>, GrLivArea <dbl>,
## # BsmtFullBath <ord>, BsmtHalfBath <ord>, FullBath <dbl>, HalfBath <dbl>,
## # BedroomAbvGr <dbl>, KitchenAbvGr <dbl>, KitchenQual <chr>,
## # TotRmsAbvGrd <dbl>, Functional <ord>, Fireplaces <dbl>, FireplaceQu <ord>,
## # GarageType <fct>, GarageFinish <ord>, GarageCars <dbl>, GarageArea <dbl>,
## # GarageQual <ord>, GarageCond <ord>, PavedDrive <chr>, WoodDeckSF <dbl>,
## # OpenPorchSF <dbl>, EnclosedPorch <dbl>, `3SsnPorch` <dbl>,
## # ScreenPorch <dbl>, PoolArea <dbl>, PoolQC <ord>, Fence <ord>,
## # MiscFeature <fct>, MiscVal <dbl>, MoSold <dbl>, YrSold <dbl>,
## # SaleType <chr>, SaleCondition <chr>, SalePrice <dbl>
## # A tibble: 16 x 2
## # Groups: Exterior1st [16]
## Exterior1st n
## <chr> <int>
## 1 VinylSd 1025
## 2 MetalSd 450
## 3 HdBoard 442
## 4 Wd Sdng 411
## 5 Plywood 221
## 6 CemntBd 126
## 7 BrkFace 87
## 8 WdShing 56
## 9 AsbShng 44
## 10 Stucco 43
## 11 BrkComm 6
## 12 AsphShn 2
## 13 CBlock 2
## 14 Stone 2
## 15 ImStucc 1
## 16 <NA> 1
## # A tibble: 17 x 2
## # Groups: Exterior2nd [17]
## Exterior2nd n
## <chr> <int>
## 1 VinylSd 1014
## 2 MetalSd 447
## 3 HdBoard 406
## 4 Wd Sdng 391
## 5 Plywood 270
## 6 CmentBd 126
## 7 Wd Shng 81
## 8 BrkFace 47
## 9 Stucco 47
## 10 AsbShng 38
## 11 Brk Cmn 22
## 12 ImStucc 15
## 13 Stone 6
## 14 AsphShn 4
## 15 CBlock 3
## 16 Other 1
## 17 <NA> 1
mode_1st <- mode_vector(house9, Exterior1st) %>%
pull(Exterior1st)
mode_2nd <- mode_vector(house9, Exterior2nd) %>%
pull(Exterior2nd)
Impute mode and, as you can see above, there are typos in Exterior2nd. CemntBd is written in CmentBD and WdShing is written in WD Shng. BrkComm is written in Brk Cmn. I’m going to modify these values.
Exterior_lev <- c("WdShing", "Wd Sdng", "VinylSd", "Stucco", "Stone", "PreCast", "Plywood", "Other", "MetalSd",
"ImStucc", "HdBoard", "CemntBd", "CBlock", "BrkFace", "BrkComm", "AsphShn", "AsbShng")
house10 <- house9 %>%
mutate(Exterior1st = ifelse(is.na(Exterior1st), mode_1st[[1]], Exterior1st),
Exterior2nd = case_when(
is.na(Exterior2nd) ~ mode_2nd[[1]],
Exterior2nd == "CmentBd" ~ "CemntBd",
Exterior2nd == "Wd Shng" ~ "WdShing",
Exterior2nd == "Brk Cmn" ~ "BrkComm",
TRUE ~ Exterior2nd),
Exterior1st = parse_factor(Exterior1st, levels = Exterior_lev),
Exterior2nd = parse_factor(Exterior2nd, levels = Exterior_lev))
## # A tibble: 1 x 78
## Electrical MasVnrType MasVnrArea MSSubClass MSZoning LotFrontage LotArea
## <chr> <fct> <dbl> <dbl> <fct> <dbl> <dbl>
## 1 <NA> None 0 80 RL 73 9735
## # ... with 71 more variables: Street <chr>, Alley <fct>, LotShape <chr>,
## # LandContour <chr>, LotConfig <chr>, LandSlope <chr>, Neighborhood <chr>,
## # Condition1 <chr>, Condition2 <chr>, BldgType <chr>, HouseStyle <chr>,
## # OverallQual <dbl>, OverallCond <dbl>, YearBuilt <dbl>, YearRemodAdd <dbl>,
## # RoofStyle <chr>, RoofMatl <chr>, Exterior1st <fct>, Exterior2nd <fct>,
## # ExterQual <chr>, ExterCond <chr>, Foundation <chr>, BsmtQual <ord>,
## # BsmtCond <ord>, BsmtExposure <ord>, BsmtFinType1 <ord>, BsmtFinSF1 <dbl>,
## # BsmtFinType2 <ord>, BsmtFinSF2 <dbl>, BsmtUnfSF <dbl>, TotalBsmtSF <dbl>,
## # Heating <chr>, HeatingQC <chr>, CentralAir <chr>, `1stFlrSF` <dbl>,
## # `2ndFlrSF` <dbl>, LowQualFinSF <dbl>, GrLivArea <dbl>, BsmtFullBath <ord>,
## # BsmtHalfBath <ord>, FullBath <dbl>, HalfBath <dbl>, BedroomAbvGr <dbl>,
## # KitchenAbvGr <dbl>, KitchenQual <chr>, TotRmsAbvGrd <dbl>,
## # Functional <ord>, Fireplaces <dbl>, FireplaceQu <ord>, GarageType <fct>,
## # GarageFinish <ord>, GarageCars <dbl>, GarageArea <dbl>, GarageQual <ord>,
## # GarageCond <ord>, PavedDrive <chr>, WoodDeckSF <dbl>, OpenPorchSF <dbl>,
## # EnclosedPorch <dbl>, `3SsnPorch` <dbl>, ScreenPorch <dbl>, PoolArea <dbl>,
## # PoolQC <ord>, Fence <ord>, MiscFeature <fct>, MiscVal <dbl>, MoSold <dbl>,
## # YrSold <dbl>, SaleType <chr>, SaleCondition <chr>, SalePrice <dbl>
## # A tibble: 6 x 2
## # Groups: Electrical [6]
## Electrical n
## <chr> <int>
## 1 SBrkr 2671
## 2 FuseA 188
## 3 FuseF 50
## 4 FuseP 8
## 5 Mix 1
## 6 <NA> 1
Factorizing. it seems to be ordinal.
Electrical_lev <- c("Mix", str_c("Fuse", c("P", "F", "A")), "SBrkr")
house11 <- house10 %>%
mutate(Electrical = ifelse(is.na(Electrical), mode_Elec[[1]], Electrical),
Electrical = parse_factor(Electrical, levels = Electrical_lev, ordered = TRUE))
## # A tibble: 1 x 78
## KitchenAbvGr KitchenQual MasVnrType MasVnrArea MSSubClass MSZoning LotFrontage
## <dbl> <chr> <fct> <dbl> <dbl> <fct> <dbl>
## 1 1 <NA> None 0 50 RL 72
## # ... with 71 more variables: LotArea <dbl>, Street <chr>, Alley <fct>,
## # LotShape <chr>, LandContour <chr>, LotConfig <chr>, LandSlope <chr>,
## # Neighborhood <chr>, Condition1 <chr>, Condition2 <chr>, BldgType <chr>,
## # HouseStyle <chr>, OverallQual <dbl>, OverallCond <dbl>, YearBuilt <dbl>,
## # YearRemodAdd <dbl>, RoofStyle <chr>, RoofMatl <chr>, Exterior1st <fct>,
## # Exterior2nd <fct>, ExterQual <chr>, ExterCond <chr>, Foundation <chr>,
## # BsmtQual <ord>, BsmtCond <ord>, BsmtExposure <ord>, BsmtFinType1 <ord>,
## # BsmtFinSF1 <dbl>, BsmtFinType2 <ord>, BsmtFinSF2 <dbl>, BsmtUnfSF <dbl>,
## # TotalBsmtSF <dbl>, Heating <chr>, HeatingQC <chr>, CentralAir <chr>,
## # Electrical <ord>, `1stFlrSF` <dbl>, `2ndFlrSF` <dbl>, LowQualFinSF <dbl>,
## # GrLivArea <dbl>, BsmtFullBath <ord>, BsmtHalfBath <ord>, FullBath <dbl>,
## # HalfBath <dbl>, BedroomAbvGr <dbl>, TotRmsAbvGrd <dbl>, Functional <ord>,
## # Fireplaces <dbl>, FireplaceQu <ord>, GarageType <fct>, GarageFinish <ord>,
## # GarageCars <dbl>, GarageArea <dbl>, GarageQual <ord>, GarageCond <ord>,
## # PavedDrive <chr>, WoodDeckSF <dbl>, OpenPorchSF <dbl>, EnclosedPorch <dbl>,
## # `3SsnPorch` <dbl>, ScreenPorch <dbl>, PoolArea <dbl>, PoolQC <ord>,
## # Fence <ord>, MiscFeature <fct>, MiscVal <dbl>, MoSold <dbl>, YrSold <dbl>,
## # SaleType <chr>, SaleCondition <chr>, SalePrice <dbl>
## # A tibble: 4 x 2
## # Groups: KitchenAbvGr [4]
## KitchenAbvGr n
## <dbl> <int>
## 1 1 2785
## 2 2 129
## 3 0 3
## 4 3 2
KitchenQual_lev <- c("Po", "Fa", "TA", "Gd", "Ex")
house12 <- house11 %>%
mutate(KitchenQual = ifelse(is.na(KitchenQual), mode_KitchenQual[[1]], KitchenQual),
KitchenQual = parse_factor(KitchenQual, levels = KitchenQual_lev, ordered = TRUE))
## # A tibble: 1 x 78
## SaleType MasVnrType MasVnrArea MSSubClass MSZoning LotFrontage LotArea Street
## <chr> <fct> <dbl> <dbl> <fct> <dbl> <dbl> <chr>
## 1 <NA> BrkFace 340 20 RL 85 13770 Pave
## # ... with 70 more variables: Alley <fct>, LotShape <chr>, LandContour <chr>,
## # LotConfig <chr>, LandSlope <chr>, Neighborhood <chr>, Condition1 <chr>,
## # Condition2 <chr>, BldgType <chr>, HouseStyle <chr>, OverallQual <dbl>,
## # OverallCond <dbl>, YearBuilt <dbl>, YearRemodAdd <dbl>, RoofStyle <chr>,
## # RoofMatl <chr>, Exterior1st <fct>, Exterior2nd <fct>, ExterQual <chr>,
## # ExterCond <chr>, Foundation <chr>, BsmtQual <ord>, BsmtCond <ord>,
## # BsmtExposure <ord>, BsmtFinType1 <ord>, BsmtFinSF1 <dbl>,
## # BsmtFinType2 <ord>, BsmtFinSF2 <dbl>, BsmtUnfSF <dbl>, TotalBsmtSF <dbl>,
## # Heating <chr>, HeatingQC <chr>, CentralAir <chr>, Electrical <ord>,
## # `1stFlrSF` <dbl>, `2ndFlrSF` <dbl>, LowQualFinSF <dbl>, GrLivArea <dbl>,
## # BsmtFullBath <ord>, BsmtHalfBath <ord>, FullBath <dbl>, HalfBath <dbl>,
## # BedroomAbvGr <dbl>, KitchenAbvGr <dbl>, KitchenQual <ord>,
## # TotRmsAbvGrd <dbl>, Functional <ord>, Fireplaces <dbl>, FireplaceQu <ord>,
## # GarageType <fct>, GarageFinish <ord>, GarageCars <dbl>, GarageArea <dbl>,
## # GarageQual <ord>, GarageCond <ord>, PavedDrive <chr>, WoodDeckSF <dbl>,
## # OpenPorchSF <dbl>, EnclosedPorch <dbl>, `3SsnPorch` <dbl>,
## # ScreenPorch <dbl>, PoolArea <dbl>, PoolQC <ord>, Fence <ord>,
## # MiscFeature <fct>, MiscVal <dbl>, MoSold <dbl>, YrSold <dbl>,
## # SaleCondition <chr>, SalePrice <dbl>
## # A tibble: 10 x 2
## # Groups: SaleType [10]
## SaleType n
## <chr> <int>
## 1 WD 2525
## 2 New 239
## 3 COD 87
## 4 ConLD 26
## 5 CWD 12
## 6 ConLI 9
## 7 ConLw 8
## 8 Oth 7
## 9 Con 5
## 10 <NA> 1
SaleType_lev <- c("Oth", "ConLD", "ConLI", "ConLw", "Con", "COD", "New", "VWD", "CWD", "WD")
house13 <- house12 %>%
mutate(SaleType = ifelse(is.na(SaleType), mode_SaleType[[1]], SaleType),
SaleType = parse_factor(SaleType, levels = SaleType_lev))
one more check the proportions of missing values per columns!
na_prop <- house13 %>%
select(-SalePrice) %>%
map(is.na) %>%
map_dfr(mean) %>%
pivot_longer(cols = everything(), names_to = "variables", values_to = "prop") %>%
filter(prop > 0) %>%
arrange(desc(prop))
na_prop
## # A tibble: 0 x 2
## # ... with 2 variables: variables <chr>, prop <dbl>
check the remaining character variables
map(house13, is.character) %>%
as_tibble %>%
pivot_longer(cols = everything(), names_to = "variables", values_to = "character") %>%
filter(character == 1) %>%
pull(variables)
## [1] "Street" "LotShape" "LandContour" "LotConfig"
## [5] "LandSlope" "Neighborhood" "Condition1" "Condition2"
## [9] "BldgType" "HouseStyle" "RoofStyle" "RoofMatl"
## [13] "ExterQual" "ExterCond" "Foundation" "Heating"
## [17] "HeatingQC" "CentralAir" "PavedDrive" "SaleCondition"
specify labels of factors which is seems to be ordinal. and then, I’m going to use across() in dplyr 1.0.0, in order to factorize the remaining character variables after parsing ordinal factors. It’s really useful! 😄
LotShape_lev <- c(str_c("IR", 3:1), "Reg")
LandSlope_lev <- c("Sev", "Mod", "Gtl")
Exter_lev <- c("Po", "Fa", "TA", "Gd", "Ex")
HeatingQC_lev <- c("Po", "Fa", "TA", "Gd", "Ex")
house14 <- house13 %>%
mutate(LotShape = parse_factor(LotShape, levels = LotShape_lev, ordered = TRUE),
LandSlope = parse_factor(LandSlope, levels = LandSlope_lev, ordered = TRUE),
ExterQual = parse_factor(ExterQual, levels = Exter_lev, ordered = TRUE),
ExterCond = parse_factor(ExterCond, levels = Exter_lev, ordered = TRUE),
HeatingQC = parse_factor(HeatingQC, levels = HeatingQC_lev, ordered = TRUE)) %>%
mutate(across(is.character, parse_factor))
we must follow R’s variable naming conventions, when modeling with the {mlr} pacakge. you know, some of the original column names cause problems with R’s variable naming conventions, so the following process is required:
idx_str <- str_detect(colnames(house14), "^\\d") # It starts with number.
colnames(house14)[idx_str] <- str_c("a", colnames(house14)[idx_str])
head(house14)
## # A tibble: 6 x 78
## MasVnrType MasVnrArea MSSubClass MSZoning LotFrontage LotArea Street Alley
## <fct> <dbl> <dbl> <fct> <dbl> <dbl> <fct> <fct>
## 1 BrkFace 196 60 RL 65 8450 Pave No
## 2 None 0 20 RL 80 9600 Pave No
## 3 BrkFace 162 60 RL 68 11250 Pave No
## 4 None 0 70 RL 60 9550 Pave No
## 5 BrkFace 350 60 RL 84 14260 Pave No
## 6 None 0 50 RL 85 14115 Pave No
## # ... with 70 more variables: LotShape <ord>, LandContour <fct>,
## # LotConfig <fct>, LandSlope <ord>, Neighborhood <fct>, Condition1 <fct>,
## # Condition2 <fct>, BldgType <fct>, HouseStyle <fct>, OverallQual <dbl>,
## # OverallCond <dbl>, YearBuilt <dbl>, YearRemodAdd <dbl>, RoofStyle <fct>,
## # RoofMatl <fct>, Exterior1st <fct>, Exterior2nd <fct>, ExterQual <ord>,
## # ExterCond <ord>, Foundation <fct>, BsmtQual <ord>, BsmtCond <ord>,
## # BsmtExposure <ord>, BsmtFinType1 <ord>, BsmtFinSF1 <dbl>,
## # BsmtFinType2 <ord>, BsmtFinSF2 <dbl>, BsmtUnfSF <dbl>, TotalBsmtSF <dbl>,
## # Heating <fct>, HeatingQC <ord>, CentralAir <fct>, Electrical <ord>,
## # a1stFlrSF <dbl>, a2ndFlrSF <dbl>, LowQualFinSF <dbl>, GrLivArea <dbl>,
## # BsmtFullBath <ord>, BsmtHalfBath <ord>, FullBath <dbl>, HalfBath <dbl>,
## # BedroomAbvGr <dbl>, KitchenAbvGr <dbl>, KitchenQual <ord>,
## # TotRmsAbvGrd <dbl>, Functional <ord>, Fireplaces <dbl>, FireplaceQu <ord>,
## # GarageType <fct>, GarageFinish <ord>, GarageCars <dbl>, GarageArea <dbl>,
## # GarageQual <ord>, GarageCond <ord>, PavedDrive <fct>, WoodDeckSF <dbl>,
## # OpenPorchSF <dbl>, EnclosedPorch <dbl>, a3SsnPorch <dbl>,
## # ScreenPorch <dbl>, PoolArea <dbl>, PoolQC <ord>, Fence <ord>,
## # MiscFeature <fct>, MiscVal <dbl>, MoSold <dbl>, YrSold <dbl>,
## # SaleType <fct>, SaleCondition <fct>, SalePrice <dbl>
For the columns with names beginning with a number, I added “a” to the beginning.
Let go modeling with {mlr} and {ranger}!🏃 First, we need to make a supervised task object.
train <- house14 %>%
filter(!is.na(SalePrice))
test <- house14 %>%
filter(is.na(SalePrice)) %>%
select(-SalePrice)
train_Task <- makeRegrTask(data = train, target = "SalePrice")
I normlized the variables before modelling.
train_Task <- normalizeFeatures(train_Task, method = "standardize")
test_Task <- normalizeFeatures(test, method = "standardize")
make ranger learner object.
# getParamSet("regr.ranger"): If you want to know a parameter set in ranger.
.ranger <- makeLearner("regr.ranger", predict.type = "response")
And then, set tunable parameters and a grid in order to get the optimal hyperparameters.
ranger_params <- makeParamSet(
makeIntegerParam("num.trees", lower = 450, upper = 550),
makeIntegerParam("mtry", lower = 3, upper = 15),
makeIntegerParam("min.node.size", lower = 5, upper = 20)
)
Most of the modeling steps would be similar to followed above, but this time I’ve done random search instead of grid search for parameter tuning. random search is faster than grid search, but what you need to know is sometimes it turns out to be less efficient. because, In grid search, the algorithm tunes over every possible combination of parameters provided. In a random search, we specify the number of iterations and it randomly passes over the parameter combinations. In this process, it might miss out some important combination of parameters which could have returned maximum accuracy. That’s all you need to know. let’s do random search for 100 iterations. This time, I will load and use the .RData that I have tuned in advance. If you run the comment part in the code below, you will get a .RData file. 😄
rancontrol <- makeTuneControlRandom(maxit = 100L)
# set 3 fold cross validation
set_cv <- makeResampleDesc("CV", iters = 5L)
# hypertuning
set.seed(1)
# ranger_tune <- tuneParams(learner = .ranger, resampling = set_cv, task = train_Task,
# par.set = ranger_params, control = rancontrol, measures = rmsle)
# list.save(ranger_tune, "./data/ranger_tune.RData")
ranger_tune <- list.load("./data/ranger_tune.RData")
In the training step, I chose RMSLE to evaluate the model, which is the same as the Kaggle competition criterion. The following is optimal hyperparameters based on RMSLE:
## $num.trees
## [1] 492
##
## $mtry
## [1] 14
##
## $min.node.size
## [1] 7
It’s RMSLE based on the best model.
## rmsle.test.mean
## 0.1383685
Finally, we can use the optimal hyperparameters for tranining model.
ranger_tree <- setHyperPars(.ranger, par.vals = ranger_tune$x)
set.seed(1)
mod_ranger <- mlr::train(ranger_tree, train_Task)
getLearnerModel(mod_ranger)
## Ranger result
##
## Call:
## ranger::ranger(formula = NULL, dependent.variable.name = tn, data = getTaskData(.task, .subset), case.weights = .weights, keep.inbag = keep.inbag, ...)
##
## Type: Regression
## Number of trees: 492
## Sample size: 1460
## Number of independent variables: 77
## Mtry: 14
## Target node size: 7
## Variable importance mode: impurity
## Splitrule: variance
## OOB prediction error (MSE): 747145968
## R squared (OOB): 0.8816142
Of couse, you can get the variable importances from {ranger}. I visualized the variable importances of the Top 10 based on “MeanDecreaseGini”. This is also the reason for adding “impurity” to optimal hyperparameters(the list object, ranger_tune) in advance.
imp <- getFeatureImportance(mod_ranger)
imp$res %>%
gather(key = "variable", value = "importance") %>%
arrange(desc(importance)) %>%
slice(1:10) %>%
ggplot() +
geom_bar(aes(x = fct_reorder(variable, importance), y = importance, fill = importance),
stat = "identity") +
scale_y_continuous(breaks = NULL) +
coord_flip() +
labs(x = "Variables", y = "Importances") +
theme(legend.position = "none")
make predictions for test set.💻
## Warning in predict.WrappedModel(mod_ranger, newdata = test_Task): Provided data
## for prediction is not a pure data.frame but from class tbl_df, hence it will be
## converted.
make a submission file.
ranger_submit <- data.frame(ID = test_ID, SalePrice = pred_ranger$data$response)
# write_excel_csv(ranger_submit, "./data/pred.csv") # RMSLE = 0.14430
you can get the result, if you upload the submission file on the Kaggle Website - House Prices - Submit Predictions. Anyway RMSLE on test set was 0.14430. Although the measure may seem complicated, you can make a meaningful interpretation! 💡
First, put \(log()\) aside and, that is, Let’s see \(\frac{p_i+1}{a_i+1}\). we can ignore the +1 of the numerator and denomitator, this just prevents the case of a zero value into the log. then, \(p_i/a_i\) is a ratio of predicted(\(p_i\)) and actual values(\(a_i\)). Good :-). Now log() is the problem, but it’s simple. Just take the exponential function, which is the inverse function of the log:
## [1] 1.155231
That is, \(exp(RMLSE)\) is the average ratio of predicted and actual values. For example, If \(RMSLE\) is 0, which probably won’t happen🌞, it means your model is predicting all the actual values correctly.
The above result means that the predicted value is overestimating the actual value by 15%! 💡