1 Introduction


Hi all! 👋 This is my first project organized by R markdown. The goal of this project is to predict the price of each house(\(Y\)). I used {ranger} package for forecasting house prices, which is a fast implementation of random forests. I did some works before modeling, including imputations, label encoding, and factorizing variables. Also, I’ll try to improve forecast performances with other models later. Now, let’s go dive into details! 🔥


2 Preparations


2.1 Loading libraries

library(tidyverse)
library(vroom)
library(mlr)
library(randomForest)
library(scales)
library(rlist) # list.save(), list.load()
loaded_package <- c("tidyverse", "vroom", "mlr", "randomForest", "scales", "rlist")
.version <- map(loaded_package, packageVersion)
names(.version) <- loaded_package
.version
## $tidyverse
## [1] '1.3.0'
## 
## $vroom
## [1] '1.3.2'
## 
## $mlr
## [1] '2.18.0'
## 
## $randomForest
## [1] '4.6.14'
## 
## $scales
## [1] '1.1.1'
## 
## $rlist
## [1] '0.4.6.1'

2.2 Loading data

system.time(
  train <- vroom("./data/train.csv"))
## Rows: 1,460
## Columns: 81
## Delimiter: ","
## chr [43]: MSZoning, Street, Alley, LotShape, LandContour, Utilities, LotConfig, LandSlope,...
## dbl [38]: Id, MSSubClass, LotFrontage, LotArea, OverallQual, OverallCond, YearBuilt, YearR...
## 
## Use `spec()` to retrieve the guessed column specification
## Pass a specification to the `col_types` argument to quiet this message
##    user  system elapsed 
##    0.10    0.01    0.14
system.time(
  test <- vroom("./data/test.csv"))
## Rows: 1,459
## Columns: 80
## Delimiter: ","
## chr [42]: MSZoning, Street, Alley, LotShape, LandContour, Utilities, LotConfig, LandSlope,...
## dbl [37]: Id, MSSubClass, LotFrontage, LotArea, OverallQual, OverallCond, YearBuilt, YearR...
## lgl [ 1]: PoolQC
## 
## Use `spec()` to retrieve the guessed column specification
## Pass a specification to the `col_types` argument to quiet this message
##    user  system elapsed 
##    0.03    0.00    0.06
## There are 1460 observations and 81 variables in training dataset.
## There are 1459 observations and 80 variables in test dataset.

And then, let’s prepare a submission file.

test_ID <- test$Id
test$Id <- NULL
test$SalePrice <- NA
train$Id <- NULL


3 Data wrangling


I’m going to do some works before modeling. First of all, I checked the distribution of a target variable \(Y\)(or dependent variable, “SalePrice”) and missing values. Second, I did re-encode labels for categorical variables, which have typos. then, I did imputations in various ways for the rest of the variables with missing values. Fourth, I did factorize the remaining character variables. Finally, I modified some variables which violate variable naming conventions in R.

Let’s check details!😄 before diving, bind training and test set to check entire dataset.

house <- bind_rows(train, test)

3.1 The distribution of \(Y\)

ggplot(data = house %>% filter(!is.na(SalePrice))) +
  geom_histogram(aes(x = SalePrice), fill = "blue", alpha = 1/2, binwidth = 10000) +
  scale_x_continuous(labels = dollar_format()) +
  labs(
    title = "SalePrice(Y) is skewed to the right"
  ) +
  theme(
    plot.title = element_text(hjust = 0.5, size = 15, face = "bold"), 
  )

3.2 Missing values

I’m going to check the proportions of missing values per columns. If you want to know the number of missing values per columns, use sum() function in map_dfr().

na_prop <- house %>% 
  select(-SalePrice) %>% 
  map(is.na) %>% 
  map_dfr(mean) %>%
  pivot_longer(cols = everything(), names_to = "variables", values_to = "prop") %>% 
  filter(prop > 0) %>% 
  arrange(desc(prop))

na_prop %>%
  ggplot(aes(x = fct_reorder(variables, prop), y = prop, fill = variables)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  theme(legend.position = "none") +
  labs(
    x = "Explanatory variables",
    y = "The proportions of NA values per column"
  ) +
  scale_y_continuous(breaks = seq(0, 1, by = 0.1)) +
  theme(axis.text.y = element_text(size = 10))

There are many variables with missing values.😩 But, we can do it! let’s go.

3.3 Labels encoding and imputation

First, I’m going to do re-encode labels for categorical variables with typos. The details of the each variables are below. Refer this to get the details of other variables.

  • Pool QC : Pool Quality. It’s ordinal.
  • MiscFeature : Miscellaneous feature not covered in other categories
  • Alley : Type of alley access to property
  • Fence : Fence quality. It’s ordinal.
  • FireplaceQu : Fireplace quality. It’s ordinal.
  • GarageType
  • GarageFinish : It’s seem to be ordinal.
  • GarageQual : It’s seem to be ordinal.
  • GarageCond : Garage condition. ordinal
  • GarageArea : Size of garage in square feet
  • BsmtExposure : Refers to walkout or garden level walls. ordinal
  • BsmtCond : Evaluates the general condition of the basement. ordinal
  • BsmtQual: Evaluates the height of the basement. ordinal
  • BsmtFinType1: Rating of basement finished area. ordinal
  • BsmtFinType2: Rating of basement finished area (if multiple types)

I specify the levels when factorizing ordered factors.

house$PoolQC[is.na(house$PoolQC)] <- "None"
house$MiscFeature[is.na(house$MiscFeature)] <- "None"
house$Alley[is.na(house$Alley)] <- "No"
house$Fence[is.na(house$Fence)] <- "No"
house$FireplaceQu[is.na(house$FireplaceQu)] <- "No"
house$GarageType[is.na(house$GarageType)] <- "No"
house$GarageFinish[is.na(house$GarageFinish)] <- "No"
house$GarageQual[is.na(house$GarageQual)] <- "No"
house$GarageCond[is.na(house$GarageCond)] <- "No"
house$BsmtExposure[is.na(house$BsmtExposure)] <- "NoBs"
house$BsmtCond[is.na(house$BsmtCond)] <- "NoBs"
house$BsmtQual[is.na(house$BsmtQual)] <- "NoBs"
house$BsmtFinType1[is.na(house$BsmtFinType1)] <- "NoBs"
house$BsmtFinType2[is.na(house$BsmtFinType2)] <- "NoBs"

# To specify the levels of ordered factors
PoolQC_lev <- c("None", "Fa", "TA", "Gd", "Ex")
Fence_lev <- c("No", "MnWw", "GdWo", "MnPrv", "GdPrv")
FireplaceQu_lev <- c("No", "Po", "Fa", "TA", "Gd", "Ex")
GarageFinish_lev <- c("No", "Unf", "RFn", "Fin")
GarageQual_lev <- c("No", "Po", "Fa", "TA", "Gd", "Ex")
GarageCond_lev <- c("No", "Po", "Fa", "TA", "Gd", "Ex")
BsmtExposure_lev <- c("NoBs", "No", "Mn", "Av", "Gd")
BsmtCond_lev <- c("NoBs", "Po", "Fa", "TA", "Gd", "Ex")
BsmtQual_lev <- c("NoBs", "Po", "Fa", "TA", "Gd", "Ex")
BsmtFinType1_lev <- c("NoBs", "Unf", "LwQ", "Rec", "BLQ", "ALQ", "GLQ")
BsmtFinType2_lev <- c("NoBs", "Unf", "LwQ", "Rec", "BLQ", "ALQ", "GLQ")

house2 <- house %>%
  mutate(PoolQC = parse_factor(PoolQC, levels = PoolQC_lev, ordered = TRUE),
         MiscFeature = parse_factor(MiscFeature),
         Alley = parse_factor(Alley),
         Fence = parse_factor(Fence, levels = Fence_lev, ordered = TRUE),
         FireplaceQu = parse_factor(FireplaceQu, levels = FireplaceQu_lev, ordered = TRUE),
         GarageType = parse_factor(GarageType),
         GarageFinish = parse_factor(GarageFinish, levels = GarageFinish_lev, ordered = TRUE),
         GarageQual = parse_factor(GarageQual, levels = GarageQual_lev, ordered = TRUE),
         GarageCond = parse_factor(GarageCond, levels = GarageCond_lev, ordered = TRUE),
         BsmtExposure = parse_factor(BsmtExposure, levels = BsmtExposure_lev, ordered = TRUE),
         BsmtCond = parse_factor(BsmtCond, levels = BsmtCond_lev, ordered = TRUE),
         BsmtQual = parse_factor(BsmtQual, levels = BsmtQual_lev, ordered = TRUE),
         BsmtFinType1 = parse_factor(BsmtFinType1, levels = BsmtFinType1_lev, ordered = TRUE),
         BsmtFinType2 = parse_factor(BsmtFinType2, levels = BsmtFinType2_lev, ordered = TRUE))

Now, I’m going to examine and impute the remaining variables which have the missing values. It was a very boring job. 😑


3.3.1 Garage series

  • GarageYrBlt : If there is a garage,
  • GarageArea : Size of garage in square feet maybe GarageYrBlt is similar to YearBuilt or YearRemodAdd. so I’m going to remove this column.
house2$GarageYrBlt <- NULL


house2 %>% 
  filter(is.na(GarageArea)) %>% 
  select(GarageType:GarageCond)
## # A tibble: 1 x 6
##   GarageType GarageFinish GarageCars GarageArea GarageQual GarageCond
##   <fct>      <ord>             <dbl>      <dbl> <ord>      <ord>     
## 1 Detchd     No                   NA         NA No         No

Since this observation doesn’t seem to have Garage, I’m going to impute 0 in GarageCars and GarageArea respectively.

house3 <- house2 %>% 
  mutate(GarageArea = ifelse(is.na(GarageArea), 0, GarageArea),
         GarageCars = ifelse(is.na(GarageCars), 0, GarageCars))


3.3.2 LotFrontage

  • LotFrontage: Linear feet of street connected to property I’m going to impute NA values of LotFrontage using the Neighborhood column.
ggplot(house3, aes(x = Neighborhood, y = LotFrontage)) +
  stat_boxplot(geom = 'errorbar') +
  geom_boxplot(fill = "skyblue", alpha = 0.8) +
  coord_flip() +
  labs(
    x = "Neighborhood"
  )

Let’s take the median per neighbor.

LotFront_med <- house3 %>%
  group_by(Neighborhood) %>%
  summarize(
    LotFrontage_med = median(LotFrontage, na.rm = TRUE)
  ) %>%
  arrange(desc(LotFrontage_med))

Lotfrontage_new <- house3 %>% 
  select(Neighborhood, LotFrontage) %>% 
  left_join(LotFront_med, by = "Neighborhood") %>% 
  mutate(LotFrontage_new = 
           ifelse(is.na(LotFrontage), LotFrontage_med, LotFrontage)) %>% 
  pull(LotFrontage_new)

house4 <- house3 %>% 
  mutate(LotFrontage = Lotfrontage_new)


3.3.3 MasVnr series

  • MasVnrType: Masonry veneer type
  • MasVnrArea: Masonry veneer area in square feet First of all, I find the one that should have MasVnrType. I will impute mode of MasVnrType without None type.
house4 %>% 
  filter(is.na(MasVnrType), !is.na(MasVnrArea)) %>% 
  select(MasVnrType, MasVnrArea, everything())
## # A tibble: 1 x 79
##   MasVnrType MasVnrArea MSSubClass MSZoning LotFrontage LotArea Street Alley
##   <chr>           <dbl>      <dbl> <chr>          <dbl>   <dbl> <chr>  <fct>
## 1 <NA>              198         20 RL               124   27697 Pave   No   
## # ... with 71 more variables: LotShape <chr>, LandContour <chr>,
## #   Utilities <chr>, LotConfig <chr>, LandSlope <chr>, Neighborhood <chr>,
## #   Condition1 <chr>, Condition2 <chr>, BldgType <chr>, HouseStyle <chr>,
## #   OverallQual <dbl>, OverallCond <dbl>, YearBuilt <dbl>, YearRemodAdd <dbl>,
## #   RoofStyle <chr>, RoofMatl <chr>, Exterior1st <chr>, Exterior2nd <chr>,
## #   ExterQual <chr>, ExterCond <chr>, Foundation <chr>, BsmtQual <ord>,
## #   BsmtCond <ord>, BsmtExposure <ord>, BsmtFinType1 <ord>, BsmtFinSF1 <dbl>,
## #   BsmtFinType2 <ord>, BsmtFinSF2 <dbl>, BsmtUnfSF <dbl>, TotalBsmtSF <dbl>,
## #   Heating <chr>, HeatingQC <chr>, CentralAir <chr>, Electrical <chr>,
## #   `1stFlrSF` <dbl>, `2ndFlrSF` <dbl>, LowQualFinSF <dbl>, GrLivArea <dbl>,
## #   BsmtFullBath <dbl>, BsmtHalfBath <dbl>, FullBath <dbl>, HalfBath <dbl>,
## #   BedroomAbvGr <dbl>, KitchenAbvGr <dbl>, KitchenQual <chr>,
## #   TotRmsAbvGrd <dbl>, Functional <chr>, Fireplaces <dbl>, FireplaceQu <ord>,
## #   GarageType <fct>, GarageFinish <ord>, GarageCars <dbl>, GarageArea <dbl>,
## #   GarageQual <ord>, GarageCond <ord>, PavedDrive <chr>, WoodDeckSF <dbl>,
## #   OpenPorchSF <dbl>, EnclosedPorch <dbl>, `3SsnPorch` <dbl>,
## #   ScreenPorch <dbl>, PoolArea <dbl>, PoolQC <ord>, Fence <ord>,
## #   MiscFeature <fct>, MiscVal <dbl>, MoSold <dbl>, YrSold <dbl>,
## #   SaleType <chr>, SaleCondition <chr>, SalePrice <dbl>

Lastly, I impute “None” to NA values of MasVnrtype because the values of MasVnrArea are also NA. and then I impute 0 to NA values of MasVnrArea. Since I’m going to repeatedly do the works that impute the mode, I created a function, which is named mode_vector().

mode_vector <- function(tb, .col){
  .col <- enquo(.col) # prevent the evaluation of R code
  # Defusing prevents the evaluation of R code, but you can still force evaluation inside a defused expression    # with the forcing operators !! and !!!.
  tb %>%
    group_by(!!.col) %>% 
    count() %>% 
    arrange(desc(n))
}

mode_MasType <- mode_vector(house4, MasVnrType) %>%
  filter(MasVnrType != "None") %>% 
  pull(MasVnrType)

MasVnrType_lev <- c("Stone", "None", "CBlock", "BrkFace", "BrkCmn")
house5 <- house4 %>% 
  select(MasVnrType, MasVnrArea, everything()) %>% 
  mutate(MasVnrType = case_when(
    is.na(MasVnrType) & !is.na(MasVnrArea) ~ mode_MasType[[1]],
    is.na(MasVnrType) ~ "None",
    TRUE ~ MasVnrType),
    MasVnrArea = ifelse(is.na(MasVnrArea), 0, MasVnrArea),
    MasVnrType = parse_factor(MasVnrType, levels = MasVnrType_lev))


3.3.4 MSZoning

  • MSZoning: Identifies the general zoning classification of the sale.
house5 %>% 
  filter(is.na(MSZoning))
## # A tibble: 4 x 79
##   MasVnrType MasVnrArea MSSubClass MSZoning LotFrontage LotArea Street Alley
##   <fct>           <dbl>      <dbl> <chr>          <dbl>   <dbl> <chr>  <fct>
## 1 None                0         30 <NA>             109   21780 Grvl   No   
## 2 None                0         20 <NA>              80   14584 Pave   No   
## 3 None                0         70 <NA>              60   56600 Pave   No   
## 4 None                0         20 <NA>             125   31250 Pave   No   
## # ... with 71 more variables: LotShape <chr>, LandContour <chr>,
## #   Utilities <chr>, LotConfig <chr>, LandSlope <chr>, Neighborhood <chr>,
## #   Condition1 <chr>, Condition2 <chr>, BldgType <chr>, HouseStyle <chr>,
## #   OverallQual <dbl>, OverallCond <dbl>, YearBuilt <dbl>, YearRemodAdd <dbl>,
## #   RoofStyle <chr>, RoofMatl <chr>, Exterior1st <chr>, Exterior2nd <chr>,
## #   ExterQual <chr>, ExterCond <chr>, Foundation <chr>, BsmtQual <ord>,
## #   BsmtCond <ord>, BsmtExposure <ord>, BsmtFinType1 <ord>, BsmtFinSF1 <dbl>,
## #   BsmtFinType2 <ord>, BsmtFinSF2 <dbl>, BsmtUnfSF <dbl>, TotalBsmtSF <dbl>,
## #   Heating <chr>, HeatingQC <chr>, CentralAir <chr>, Electrical <chr>,
## #   `1stFlrSF` <dbl>, `2ndFlrSF` <dbl>, LowQualFinSF <dbl>, GrLivArea <dbl>,
## #   BsmtFullBath <dbl>, BsmtHalfBath <dbl>, FullBath <dbl>, HalfBath <dbl>,
## #   BedroomAbvGr <dbl>, KitchenAbvGr <dbl>, KitchenQual <chr>,
## #   TotRmsAbvGrd <dbl>, Functional <chr>, Fireplaces <dbl>, FireplaceQu <ord>,
## #   GarageType <fct>, GarageFinish <ord>, GarageCars <dbl>, GarageArea <dbl>,
## #   GarageQual <ord>, GarageCond <ord>, PavedDrive <chr>, WoodDeckSF <dbl>,
## #   OpenPorchSF <dbl>, EnclosedPorch <dbl>, `3SsnPorch` <dbl>,
## #   ScreenPorch <dbl>, PoolArea <dbl>, PoolQC <ord>, Fence <ord>,
## #   MiscFeature <fct>, MiscVal <dbl>, MoSold <dbl>, YrSold <dbl>,
## #   SaleType <chr>, SaleCondition <chr>, SalePrice <dbl>

There is a typo in MSZoning. C is written in C (all). I’m going to fix these values.

mode_MSZoning <- mode_vector(house5, MSZoning) %>% 
  pull(MSZoning)
mode_MSZoning # There is a typo in MSZoning. C is written in C (all). I'm going to fix these values.
## [1] "RL"      "RM"      "FV"      "RH"      "C (all)" NA
MSZoning_lev <- c("RM", "RP", "RL", "RH", "I", "FV", "C", "A") # To factorize at the end.
house6 <- house5 %>% 
  mutate(MSZoning = case_when(
    is.na(MSZoning) ~ mode_MSZoning[[1]],
    MSZoning == "C (all)" ~ "C",
    TRUE ~ MSZoning),
    MSZoning = parse_factor(MSZoning, levels = MSZoning_lev))


3.3.5 Utilities

  • Utilities: Type of utilities available
house6 %>% 
  filter(is.na(Utilities)) %>% 
  select(Utilities, everything())
## # A tibble: 2 x 79
##   Utilities MasVnrType MasVnrArea MSSubClass MSZoning LotFrontage LotArea Street
##   <chr>     <fct>           <dbl>      <dbl> <fct>          <dbl>   <dbl> <chr> 
## 1 <NA>      None                0         30 RL               109   21780 Grvl  
## 2 <NA>      None                0         20 RL                64   31220 Pave  
## # ... with 71 more variables: Alley <fct>, LotShape <chr>, LandContour <chr>,
## #   LotConfig <chr>, LandSlope <chr>, Neighborhood <chr>, Condition1 <chr>,
## #   Condition2 <chr>, BldgType <chr>, HouseStyle <chr>, OverallQual <dbl>,
## #   OverallCond <dbl>, YearBuilt <dbl>, YearRemodAdd <dbl>, RoofStyle <chr>,
## #   RoofMatl <chr>, Exterior1st <chr>, Exterior2nd <chr>, ExterQual <chr>,
## #   ExterCond <chr>, Foundation <chr>, BsmtQual <ord>, BsmtCond <ord>,
## #   BsmtExposure <ord>, BsmtFinType1 <ord>, BsmtFinSF1 <dbl>,
## #   BsmtFinType2 <ord>, BsmtFinSF2 <dbl>, BsmtUnfSF <dbl>, TotalBsmtSF <dbl>,
## #   Heating <chr>, HeatingQC <chr>, CentralAir <chr>, Electrical <chr>,
## #   `1stFlrSF` <dbl>, `2ndFlrSF` <dbl>, LowQualFinSF <dbl>, GrLivArea <dbl>,
## #   BsmtFullBath <dbl>, BsmtHalfBath <dbl>, FullBath <dbl>, HalfBath <dbl>,
## #   BedroomAbvGr <dbl>, KitchenAbvGr <dbl>, KitchenQual <chr>,
## #   TotRmsAbvGrd <dbl>, Functional <chr>, Fireplaces <dbl>, FireplaceQu <ord>,
## #   GarageType <fct>, GarageFinish <ord>, GarageCars <dbl>, GarageArea <dbl>,
## #   GarageQual <ord>, GarageCond <ord>, PavedDrive <chr>, WoodDeckSF <dbl>,
## #   OpenPorchSF <dbl>, EnclosedPorch <dbl>, `3SsnPorch` <dbl>,
## #   ScreenPorch <dbl>, PoolArea <dbl>, PoolQC <ord>, Fence <ord>,
## #   MiscFeature <fct>, MiscVal <dbl>, MoSold <dbl>, YrSold <dbl>,
## #   SaleType <chr>, SaleCondition <chr>, SalePrice <dbl>
mode_vector(house6, Utilities)
## # A tibble: 3 x 2
## # Groups:   Utilities [3]
##   Utilities     n
##   <chr>     <int>
## 1 AllPub     2916
## 2 <NA>          2
## 3 NoSeWa        1

Only one data point is “NoSeWa”. Let’s find where it is. It’s in training set, which means all houses in the test set has “AllPub”

which(house6$Utilities == "NoSeWa")
## [1] 945

This makes the variable “Utilities” useless for prediction. Consequently, I’m going to get rid of it.

house7 <- house6 %>% 
  select(everything(), -Utilities)


3.3.6 Bsmt series

  • BsmtFullBath: Basement full bathrooms, ordered.
  • BsmtHalfBath: Basement half bathrooms, ordered.
  • BsmtFinSF1: Type 1 finished square feet
  • BsmtFinSF2: Type 2 finished square feet
  • BsmtUnfSF: Unfinished square feet of basement area
  • TotalBsmtSF: Total square feet of basement area

We need to inspect the data points which has NA in BsmtFullBath and BsmtHalfBath and BsmtFinSF1

house7 %>% 
  filter(is.na(BsmtFullBath) | is.na(BsmtHalfBath) | is.na(BsmtFinSF1) ) %>% 
  select(BsmtFullBath, BsmtHalfBath, starts_with("BsmtFin"), BsmtUnfSF, ends_with("BsmtSF"))
## # A tibble: 2 x 8
##   BsmtFullBath BsmtHalfBath BsmtFinType1 BsmtFinSF1 BsmtFinType2 BsmtFinSF2
##          <dbl>        <dbl> <ord>             <dbl> <ord>             <dbl>
## 1           NA           NA NoBs                 NA NoBs                 NA
## 2           NA           NA NoBs                  0 NoBs                  0
## # ... with 2 more variables: BsmtUnfSF <dbl>, TotalBsmtSF <dbl>

That is, “NoBs” in the result means we can impute 0 to NA values. we can impute zero to the data points.

house8 <- house7 %>% 
  mutate(BsmtFullBath = ifelse(is.na(BsmtFullBath), 0, BsmtFullBath),
         BsmtHalfBath = ifelse(is.na(BsmtHalfBath), 0, BsmtHalfBath),
         BsmtFinSF1 = ifelse(is.na(BsmtFinSF1), 0, BsmtFinSF1),
         BsmtFinSF2 = ifelse(is.na(BsmtFinSF2), 0, BsmtFinSF2),
         BsmtUnfSF = ifelse(is.na(BsmtUnfSF), 0, BsmtUnfSF),
         TotalBsmtSF = ifelse(is.na(TotalBsmtSF), 0, TotalBsmtSF),
         BsmtFullBath = factor(BsmtFullBath, ordered = TRUE),
         BsmtHalfBath = factor(BsmtHalfBath, ordered = TRUE))


3.3.7 Functional

  • Functional: Home functionality (Assume typical unless deductions are warranted)
house8 %>%
  filter(is.na(Functional)) %>% 
  select(Functional, everything())
## # A tibble: 2 x 78
##   Functional MasVnrType MasVnrArea MSSubClass MSZoning LotFrontage LotArea
##   <chr>      <fct>           <dbl>      <dbl> <fct>          <dbl>   <dbl>
## 1 <NA>       None                0         20 RL                80   14584
## 2 <NA>       None                0         50 RM                60   10320
## # ... with 71 more variables: Street <chr>, Alley <fct>, LotShape <chr>,
## #   LandContour <chr>, LotConfig <chr>, LandSlope <chr>, Neighborhood <chr>,
## #   Condition1 <chr>, Condition2 <chr>, BldgType <chr>, HouseStyle <chr>,
## #   OverallQual <dbl>, OverallCond <dbl>, YearBuilt <dbl>, YearRemodAdd <dbl>,
## #   RoofStyle <chr>, RoofMatl <chr>, Exterior1st <chr>, Exterior2nd <chr>,
## #   ExterQual <chr>, ExterCond <chr>, Foundation <chr>, BsmtQual <ord>,
## #   BsmtCond <ord>, BsmtExposure <ord>, BsmtFinType1 <ord>, BsmtFinSF1 <dbl>,
## #   BsmtFinType2 <ord>, BsmtFinSF2 <dbl>, BsmtUnfSF <dbl>, TotalBsmtSF <dbl>,
## #   Heating <chr>, HeatingQC <chr>, CentralAir <chr>, Electrical <chr>,
## #   `1stFlrSF` <dbl>, `2ndFlrSF` <dbl>, LowQualFinSF <dbl>, GrLivArea <dbl>,
## #   BsmtFullBath <ord>, BsmtHalfBath <ord>, FullBath <dbl>, HalfBath <dbl>,
## #   BedroomAbvGr <dbl>, KitchenAbvGr <dbl>, KitchenQual <chr>,
## #   TotRmsAbvGrd <dbl>, Fireplaces <dbl>, FireplaceQu <ord>, GarageType <fct>,
## #   GarageFinish <ord>, GarageCars <dbl>, GarageArea <dbl>, GarageQual <ord>,
## #   GarageCond <ord>, PavedDrive <chr>, WoodDeckSF <dbl>, OpenPorchSF <dbl>,
## #   EnclosedPorch <dbl>, `3SsnPorch` <dbl>, ScreenPorch <dbl>, PoolArea <dbl>,
## #   PoolQC <ord>, Fence <ord>, MiscFeature <fct>, MiscVal <dbl>, MoSold <dbl>,
## #   YrSold <dbl>, SaleType <chr>, SaleCondition <chr>, SalePrice <dbl>
mode_vector(house8, Functional)
## # A tibble: 8 x 2
## # Groups:   Functional [8]
##   Functional     n
##   <chr>      <int>
## 1 Typ         2717
## 2 Min2          70
## 3 Min1          65
## 4 Mod           35
## 5 Maj1          19
## 6 Maj2           9
## 7 Sev            2
## 8 <NA>           2
mode_Func <- mode_vector(house8, Functional) %>% 
  pull(Functional)

impute mode

Functional_lev <- c("Sal", "Sev", "Maj2", "Maj1", "Mod", "Min2", "Min1", "Typ") 
house9 <- house8 %>% 
  mutate(Functional = ifelse(is.na(Functional), mode_Func[[1]], Functional),
         Functional = parse_factor(Functional, levels = Functional_lev, ordered = TRUE))


3.3.8 Exterior series

  • Exterior1st : Exterior covering on house
  • Exterior2nd : Exterior covering on house (if more than one material)
house9 %>% 
  filter(is.na(Exterior1st) | is.na(Exterior2nd)) %>% 
  select(Exterior1st, Exterior2nd, everything())
## # A tibble: 1 x 78
##   Exterior1st Exterior2nd MasVnrType MasVnrArea MSSubClass MSZoning LotFrontage
##   <chr>       <chr>       <fct>           <dbl>      <dbl> <fct>          <dbl>
## 1 <NA>        <NA>        None                0         30 RL                85
## # ... with 71 more variables: LotArea <dbl>, Street <chr>, Alley <fct>,
## #   LotShape <chr>, LandContour <chr>, LotConfig <chr>, LandSlope <chr>,
## #   Neighborhood <chr>, Condition1 <chr>, Condition2 <chr>, BldgType <chr>,
## #   HouseStyle <chr>, OverallQual <dbl>, OverallCond <dbl>, YearBuilt <dbl>,
## #   YearRemodAdd <dbl>, RoofStyle <chr>, RoofMatl <chr>, ExterQual <chr>,
## #   ExterCond <chr>, Foundation <chr>, BsmtQual <ord>, BsmtCond <ord>,
## #   BsmtExposure <ord>, BsmtFinType1 <ord>, BsmtFinSF1 <dbl>,
## #   BsmtFinType2 <ord>, BsmtFinSF2 <dbl>, BsmtUnfSF <dbl>, TotalBsmtSF <dbl>,
## #   Heating <chr>, HeatingQC <chr>, CentralAir <chr>, Electrical <chr>,
## #   `1stFlrSF` <dbl>, `2ndFlrSF` <dbl>, LowQualFinSF <dbl>, GrLivArea <dbl>,
## #   BsmtFullBath <ord>, BsmtHalfBath <ord>, FullBath <dbl>, HalfBath <dbl>,
## #   BedroomAbvGr <dbl>, KitchenAbvGr <dbl>, KitchenQual <chr>,
## #   TotRmsAbvGrd <dbl>, Functional <ord>, Fireplaces <dbl>, FireplaceQu <ord>,
## #   GarageType <fct>, GarageFinish <ord>, GarageCars <dbl>, GarageArea <dbl>,
## #   GarageQual <ord>, GarageCond <ord>, PavedDrive <chr>, WoodDeckSF <dbl>,
## #   OpenPorchSF <dbl>, EnclosedPorch <dbl>, `3SsnPorch` <dbl>,
## #   ScreenPorch <dbl>, PoolArea <dbl>, PoolQC <ord>, Fence <ord>,
## #   MiscFeature <fct>, MiscVal <dbl>, MoSold <dbl>, YrSold <dbl>,
## #   SaleType <chr>, SaleCondition <chr>, SalePrice <dbl>
mode_vector(house9, Exterior1st)
## # A tibble: 16 x 2
## # Groups:   Exterior1st [16]
##    Exterior1st     n
##    <chr>       <int>
##  1 VinylSd      1025
##  2 MetalSd       450
##  3 HdBoard       442
##  4 Wd Sdng       411
##  5 Plywood       221
##  6 CemntBd       126
##  7 BrkFace        87
##  8 WdShing        56
##  9 AsbShng        44
## 10 Stucco         43
## 11 BrkComm         6
## 12 AsphShn         2
## 13 CBlock          2
## 14 Stone           2
## 15 ImStucc         1
## 16 <NA>            1
mode_vector(house9, Exterior2nd)
## # A tibble: 17 x 2
## # Groups:   Exterior2nd [17]
##    Exterior2nd     n
##    <chr>       <int>
##  1 VinylSd      1014
##  2 MetalSd       447
##  3 HdBoard       406
##  4 Wd Sdng       391
##  5 Plywood       270
##  6 CmentBd       126
##  7 Wd Shng        81
##  8 BrkFace        47
##  9 Stucco         47
## 10 AsbShng        38
## 11 Brk Cmn        22
## 12 ImStucc        15
## 13 Stone           6
## 14 AsphShn         4
## 15 CBlock          3
## 16 Other           1
## 17 <NA>            1
mode_1st <- mode_vector(house9, Exterior1st) %>% 
  pull(Exterior1st)
mode_2nd <- mode_vector(house9, Exterior2nd) %>% 
  pull(Exterior2nd)

Impute mode and, as you can see above, there are typos in Exterior2nd. CemntBd is written in CmentBD and WdShing is written in WD Shng. BrkComm is written in Brk Cmn. I’m going to modify these values.

Exterior_lev <- c("WdShing", "Wd Sdng", "VinylSd", "Stucco", "Stone", "PreCast", "Plywood", "Other", "MetalSd",
                  "ImStucc", "HdBoard", "CemntBd", "CBlock", "BrkFace", "BrkComm", "AsphShn", "AsbShng")
house10 <- house9 %>% 
  mutate(Exterior1st = ifelse(is.na(Exterior1st), mode_1st[[1]], Exterior1st),
         Exterior2nd = case_when(
           is.na(Exterior2nd) ~ mode_2nd[[1]],
           Exterior2nd == "CmentBd" ~ "CemntBd",
           Exterior2nd == "Wd Shng" ~ "WdShing",
           Exterior2nd == "Brk Cmn" ~ "BrkComm",
           TRUE ~ Exterior2nd),
         Exterior1st = parse_factor(Exterior1st, levels = Exterior_lev),
         Exterior2nd = parse_factor(Exterior2nd, levels = Exterior_lev))


3.3.9 Electrical

  • Electrical: Electrical system
house10 %>% 
  filter(is.na(Electrical)) %>% 
  select(Electrical, everything())
## # A tibble: 1 x 78
##   Electrical MasVnrType MasVnrArea MSSubClass MSZoning LotFrontage LotArea
##   <chr>      <fct>           <dbl>      <dbl> <fct>          <dbl>   <dbl>
## 1 <NA>       None                0         80 RL                73    9735
## # ... with 71 more variables: Street <chr>, Alley <fct>, LotShape <chr>,
## #   LandContour <chr>, LotConfig <chr>, LandSlope <chr>, Neighborhood <chr>,
## #   Condition1 <chr>, Condition2 <chr>, BldgType <chr>, HouseStyle <chr>,
## #   OverallQual <dbl>, OverallCond <dbl>, YearBuilt <dbl>, YearRemodAdd <dbl>,
## #   RoofStyle <chr>, RoofMatl <chr>, Exterior1st <fct>, Exterior2nd <fct>,
## #   ExterQual <chr>, ExterCond <chr>, Foundation <chr>, BsmtQual <ord>,
## #   BsmtCond <ord>, BsmtExposure <ord>, BsmtFinType1 <ord>, BsmtFinSF1 <dbl>,
## #   BsmtFinType2 <ord>, BsmtFinSF2 <dbl>, BsmtUnfSF <dbl>, TotalBsmtSF <dbl>,
## #   Heating <chr>, HeatingQC <chr>, CentralAir <chr>, `1stFlrSF` <dbl>,
## #   `2ndFlrSF` <dbl>, LowQualFinSF <dbl>, GrLivArea <dbl>, BsmtFullBath <ord>,
## #   BsmtHalfBath <ord>, FullBath <dbl>, HalfBath <dbl>, BedroomAbvGr <dbl>,
## #   KitchenAbvGr <dbl>, KitchenQual <chr>, TotRmsAbvGrd <dbl>,
## #   Functional <ord>, Fireplaces <dbl>, FireplaceQu <ord>, GarageType <fct>,
## #   GarageFinish <ord>, GarageCars <dbl>, GarageArea <dbl>, GarageQual <ord>,
## #   GarageCond <ord>, PavedDrive <chr>, WoodDeckSF <dbl>, OpenPorchSF <dbl>,
## #   EnclosedPorch <dbl>, `3SsnPorch` <dbl>, ScreenPorch <dbl>, PoolArea <dbl>,
## #   PoolQC <ord>, Fence <ord>, MiscFeature <fct>, MiscVal <dbl>, MoSold <dbl>,
## #   YrSold <dbl>, SaleType <chr>, SaleCondition <chr>, SalePrice <dbl>
mode_vector(house10, Electrical)
## # A tibble: 6 x 2
## # Groups:   Electrical [6]
##   Electrical     n
##   <chr>      <int>
## 1 SBrkr       2671
## 2 FuseA        188
## 3 FuseF         50
## 4 FuseP          8
## 5 Mix            1
## 6 <NA>           1
mode_Elec <- mode_vector(house10, Electrical) %>% 
  pull(Electrical)

Factorizing. it seems to be ordinal.

Electrical_lev <- c("Mix", str_c("Fuse", c("P", "F", "A")), "SBrkr")
house11 <- house10 %>% 
  mutate(Electrical = ifelse(is.na(Electrical), mode_Elec[[1]], Electrical),
         Electrical = parse_factor(Electrical, levels = Electrical_lev, ordered = TRUE))


3.3.10 KitchenQual

  • KitchenQual: Kitchen quality
house11 %>% 
  filter(is.na(KitchenQual)) %>% 
  select(starts_with("Kitchen"), everything())
## # A tibble: 1 x 78
##   KitchenAbvGr KitchenQual MasVnrType MasVnrArea MSSubClass MSZoning LotFrontage
##          <dbl> <chr>       <fct>           <dbl>      <dbl> <fct>          <dbl>
## 1            1 <NA>        None                0         50 RL                72
## # ... with 71 more variables: LotArea <dbl>, Street <chr>, Alley <fct>,
## #   LotShape <chr>, LandContour <chr>, LotConfig <chr>, LandSlope <chr>,
## #   Neighborhood <chr>, Condition1 <chr>, Condition2 <chr>, BldgType <chr>,
## #   HouseStyle <chr>, OverallQual <dbl>, OverallCond <dbl>, YearBuilt <dbl>,
## #   YearRemodAdd <dbl>, RoofStyle <chr>, RoofMatl <chr>, Exterior1st <fct>,
## #   Exterior2nd <fct>, ExterQual <chr>, ExterCond <chr>, Foundation <chr>,
## #   BsmtQual <ord>, BsmtCond <ord>, BsmtExposure <ord>, BsmtFinType1 <ord>,
## #   BsmtFinSF1 <dbl>, BsmtFinType2 <ord>, BsmtFinSF2 <dbl>, BsmtUnfSF <dbl>,
## #   TotalBsmtSF <dbl>, Heating <chr>, HeatingQC <chr>, CentralAir <chr>,
## #   Electrical <ord>, `1stFlrSF` <dbl>, `2ndFlrSF` <dbl>, LowQualFinSF <dbl>,
## #   GrLivArea <dbl>, BsmtFullBath <ord>, BsmtHalfBath <ord>, FullBath <dbl>,
## #   HalfBath <dbl>, BedroomAbvGr <dbl>, TotRmsAbvGrd <dbl>, Functional <ord>,
## #   Fireplaces <dbl>, FireplaceQu <ord>, GarageType <fct>, GarageFinish <ord>,
## #   GarageCars <dbl>, GarageArea <dbl>, GarageQual <ord>, GarageCond <ord>,
## #   PavedDrive <chr>, WoodDeckSF <dbl>, OpenPorchSF <dbl>, EnclosedPorch <dbl>,
## #   `3SsnPorch` <dbl>, ScreenPorch <dbl>, PoolArea <dbl>, PoolQC <ord>,
## #   Fence <ord>, MiscFeature <fct>, MiscVal <dbl>, MoSold <dbl>, YrSold <dbl>,
## #   SaleType <chr>, SaleCondition <chr>, SalePrice <dbl>
mode_vector(house11, KitchenAbvGr)
## # A tibble: 4 x 2
## # Groups:   KitchenAbvGr [4]
##   KitchenAbvGr     n
##          <dbl> <int>
## 1            1  2785
## 2            2   129
## 3            0     3
## 4            3     2
mode_KitchenQual <- mode_vector(house11, KitchenQual) %>% 
  pull(KitchenQual)
KitchenQual_lev <- c("Po", "Fa", "TA", "Gd", "Ex")
house12 <- house11 %>% 
  mutate(KitchenQual = ifelse(is.na(KitchenQual), mode_KitchenQual[[1]], KitchenQual),
         KitchenQual = parse_factor(KitchenQual, levels = KitchenQual_lev, ordered = TRUE))


3.3.11 SaleType

  • SaleType: Type of sale
house12 %>% 
  filter(is.na(SaleType )) %>% 
  select(SaleType , everything())
## # A tibble: 1 x 78
##   SaleType MasVnrType MasVnrArea MSSubClass MSZoning LotFrontage LotArea Street
##   <chr>    <fct>           <dbl>      <dbl> <fct>          <dbl>   <dbl> <chr> 
## 1 <NA>     BrkFace           340         20 RL                85   13770 Pave  
## # ... with 70 more variables: Alley <fct>, LotShape <chr>, LandContour <chr>,
## #   LotConfig <chr>, LandSlope <chr>, Neighborhood <chr>, Condition1 <chr>,
## #   Condition2 <chr>, BldgType <chr>, HouseStyle <chr>, OverallQual <dbl>,
## #   OverallCond <dbl>, YearBuilt <dbl>, YearRemodAdd <dbl>, RoofStyle <chr>,
## #   RoofMatl <chr>, Exterior1st <fct>, Exterior2nd <fct>, ExterQual <chr>,
## #   ExterCond <chr>, Foundation <chr>, BsmtQual <ord>, BsmtCond <ord>,
## #   BsmtExposure <ord>, BsmtFinType1 <ord>, BsmtFinSF1 <dbl>,
## #   BsmtFinType2 <ord>, BsmtFinSF2 <dbl>, BsmtUnfSF <dbl>, TotalBsmtSF <dbl>,
## #   Heating <chr>, HeatingQC <chr>, CentralAir <chr>, Electrical <ord>,
## #   `1stFlrSF` <dbl>, `2ndFlrSF` <dbl>, LowQualFinSF <dbl>, GrLivArea <dbl>,
## #   BsmtFullBath <ord>, BsmtHalfBath <ord>, FullBath <dbl>, HalfBath <dbl>,
## #   BedroomAbvGr <dbl>, KitchenAbvGr <dbl>, KitchenQual <ord>,
## #   TotRmsAbvGrd <dbl>, Functional <ord>, Fireplaces <dbl>, FireplaceQu <ord>,
## #   GarageType <fct>, GarageFinish <ord>, GarageCars <dbl>, GarageArea <dbl>,
## #   GarageQual <ord>, GarageCond <ord>, PavedDrive <chr>, WoodDeckSF <dbl>,
## #   OpenPorchSF <dbl>, EnclosedPorch <dbl>, `3SsnPorch` <dbl>,
## #   ScreenPorch <dbl>, PoolArea <dbl>, PoolQC <ord>, Fence <ord>,
## #   MiscFeature <fct>, MiscVal <dbl>, MoSold <dbl>, YrSold <dbl>,
## #   SaleCondition <chr>, SalePrice <dbl>
mode_vector(house12, SaleType)
## # A tibble: 10 x 2
## # Groups:   SaleType [10]
##    SaleType     n
##    <chr>    <int>
##  1 WD        2525
##  2 New        239
##  3 COD         87
##  4 ConLD       26
##  5 CWD         12
##  6 ConLI        9
##  7 ConLw        8
##  8 Oth          7
##  9 Con          5
## 10 <NA>         1
mode_SaleType <- mode_vector(house12, SaleType) %>% 
  pull(SaleType)
SaleType_lev <- c("Oth", "ConLD", "ConLI", "ConLw", "Con", "COD", "New", "VWD", "CWD", "WD")
house13 <- house12 %>% 
  mutate(SaleType = ifelse(is.na(SaleType), mode_SaleType[[1]], SaleType),
         SaleType = parse_factor(SaleType, levels = SaleType_lev))

one more check the proportions of missing values per columns!

na_prop <- house13 %>% 
  select(-SalePrice) %>% 
  map(is.na) %>% 
  map_dfr(mean) %>%
  pivot_longer(cols = everything(), names_to = "variables", values_to = "prop") %>% 
  filter(prop > 0) %>% 
  arrange(desc(prop))
na_prop
## # A tibble: 0 x 2
## # ... with 2 variables: variables <chr>, prop <dbl>


3.4 Factorize remaining character variables

check the remaining character variables

map(house13, is.character) %>% 
  as_tibble %>% 
  pivot_longer(cols = everything(), names_to = "variables", values_to = "character") %>% 
  filter(character == 1) %>% 
  pull(variables)
##  [1] "Street"        "LotShape"      "LandContour"   "LotConfig"    
##  [5] "LandSlope"     "Neighborhood"  "Condition1"    "Condition2"   
##  [9] "BldgType"      "HouseStyle"    "RoofStyle"     "RoofMatl"     
## [13] "ExterQual"     "ExterCond"     "Foundation"    "Heating"      
## [17] "HeatingQC"     "CentralAir"    "PavedDrive"    "SaleCondition"
  • Street: Type of road access to property
  • LotShape : General shape of property. ordinal
  • LandContour: Flatness of the property
  • LotConfig: Lot configuration
  • LandSlope : Slope of property. ordinal
  • Neighborhood: Physical locations within Ames city limits
  • Condition1 : Proximity to various conditions
  • Condition2 : Proximity to various conditions (if more than one is present)
  • BldgType : Type of dwelling
  • HouseStyle : Style of dwelling
  • RoofStyle : Type of roof
  • RoofMatl : Roof material
  • ExterQual : Evaluates the quality of the material on the exterior. ordinal
  • ExterCond : Evaluates the present condition of the material on the exterior. ordinal
  • Foundation : Type of foundation
  • Heating : Type of heating
  • HeatingQC : Heating quality and condition. ordinal
  • CentralAir : Central air conditioning
  • PavedDrive : Paved driveway
  • SaleCondition : Condition of sale

specify labels of factors which is seems to be ordinal. and then, I’m going to use across() in dplyr 1.0.0, in order to factorize the remaining character variables after parsing ordinal factors. It’s really useful! 😄

LotShape_lev <- c(str_c("IR", 3:1), "Reg")
LandSlope_lev <- c("Sev", "Mod", "Gtl")
Exter_lev <- c("Po", "Fa", "TA", "Gd", "Ex")
HeatingQC_lev <- c("Po", "Fa", "TA", "Gd", "Ex")
house14 <- house13 %>%
  mutate(LotShape = parse_factor(LotShape, levels = LotShape_lev, ordered = TRUE),
         LandSlope = parse_factor(LandSlope, levels = LandSlope_lev, ordered = TRUE),
         ExterQual = parse_factor(ExterQual, levels = Exter_lev, ordered = TRUE),
         ExterCond = parse_factor(ExterCond, levels = Exter_lev, ordered = TRUE),
         HeatingQC = parse_factor(HeatingQC, levels = HeatingQC_lev, ordered = TRUE)) %>% 
  mutate(across(is.character, parse_factor))

3.5 Variable naming conventions in R

we must follow R’s variable naming conventions, when modeling with the {mlr} pacakge. you know, some of the original column names cause problems with R’s variable naming conventions, so the following process is required:

idx_str <- str_detect(colnames(house14), "^\\d") # It starts with number. 
colnames(house14)[idx_str] <- str_c("a", colnames(house14)[idx_str])
head(house14)
## # A tibble: 6 x 78
##   MasVnrType MasVnrArea MSSubClass MSZoning LotFrontage LotArea Street Alley
##   <fct>           <dbl>      <dbl> <fct>          <dbl>   <dbl> <fct>  <fct>
## 1 BrkFace           196         60 RL                65    8450 Pave   No   
## 2 None                0         20 RL                80    9600 Pave   No   
## 3 BrkFace           162         60 RL                68   11250 Pave   No   
## 4 None                0         70 RL                60    9550 Pave   No   
## 5 BrkFace           350         60 RL                84   14260 Pave   No   
## 6 None                0         50 RL                85   14115 Pave   No   
## # ... with 70 more variables: LotShape <ord>, LandContour <fct>,
## #   LotConfig <fct>, LandSlope <ord>, Neighborhood <fct>, Condition1 <fct>,
## #   Condition2 <fct>, BldgType <fct>, HouseStyle <fct>, OverallQual <dbl>,
## #   OverallCond <dbl>, YearBuilt <dbl>, YearRemodAdd <dbl>, RoofStyle <fct>,
## #   RoofMatl <fct>, Exterior1st <fct>, Exterior2nd <fct>, ExterQual <ord>,
## #   ExterCond <ord>, Foundation <fct>, BsmtQual <ord>, BsmtCond <ord>,
## #   BsmtExposure <ord>, BsmtFinType1 <ord>, BsmtFinSF1 <dbl>,
## #   BsmtFinType2 <ord>, BsmtFinSF2 <dbl>, BsmtUnfSF <dbl>, TotalBsmtSF <dbl>,
## #   Heating <fct>, HeatingQC <ord>, CentralAir <fct>, Electrical <ord>,
## #   a1stFlrSF <dbl>, a2ndFlrSF <dbl>, LowQualFinSF <dbl>, GrLivArea <dbl>,
## #   BsmtFullBath <ord>, BsmtHalfBath <ord>, FullBath <dbl>, HalfBath <dbl>,
## #   BedroomAbvGr <dbl>, KitchenAbvGr <dbl>, KitchenQual <ord>,
## #   TotRmsAbvGrd <dbl>, Functional <ord>, Fireplaces <dbl>, FireplaceQu <ord>,
## #   GarageType <fct>, GarageFinish <ord>, GarageCars <dbl>, GarageArea <dbl>,
## #   GarageQual <ord>, GarageCond <ord>, PavedDrive <fct>, WoodDeckSF <dbl>,
## #   OpenPorchSF <dbl>, EnclosedPorch <dbl>, a3SsnPorch <dbl>,
## #   ScreenPorch <dbl>, PoolArea <dbl>, PoolQC <ord>, Fence <ord>,
## #   MiscFeature <fct>, MiscVal <dbl>, MoSold <dbl>, YrSold <dbl>,
## #   SaleType <fct>, SaleCondition <fct>, SalePrice <dbl>

For the columns with names beginning with a number, I added “a” to the beginning.


4 Modeling


Let go modeling with {mlr} and {ranger}!🏃 First, we need to make a supervised task object.

train <- house14 %>%
  filter(!is.na(SalePrice))
test <- house14 %>%
  filter(is.na(SalePrice)) %>%
  select(-SalePrice)
train_Task <- makeRegrTask(data = train, target = "SalePrice")

I normlized the variables before modelling.

train_Task <- normalizeFeatures(train_Task, method = "standardize")
test_Task <- normalizeFeatures(test, method = "standardize")

make ranger learner object.

# getParamSet("regr.ranger"): If you want to know a parameter set in ranger.
.ranger <- makeLearner("regr.ranger", predict.type = "response")

And then, set tunable parameters and a grid in order to get the optimal hyperparameters.

ranger_params <- makeParamSet(
  makeIntegerParam("num.trees", lower = 450, upper = 550),
  makeIntegerParam("mtry", lower = 3, upper = 15),
  makeIntegerParam("min.node.size", lower = 5, upper = 20)
)

Most of the modeling steps would be similar to followed above, but this time I’ve done random search instead of grid search for parameter tuning. random search is faster than grid search, but what you need to know is sometimes it turns out to be less efficient. because, In grid search, the algorithm tunes over every possible combination of parameters provided. In a random search, we specify the number of iterations and it randomly passes over the parameter combinations. In this process, it might miss out some important combination of parameters which could have returned maximum accuracy. That’s all you need to know. let’s do random search for 100 iterations. This time, I will load and use the .RData that I have tuned in advance. If you run the comment part in the code below, you will get a .RData file. 😄

rancontrol <- makeTuneControlRandom(maxit = 100L)
# set 3 fold cross validation
set_cv <- makeResampleDesc("CV", iters = 5L)
# hypertuning
set.seed(1)
# ranger_tune <- tuneParams(learner = .ranger, resampling = set_cv, task = train_Task,
#                           par.set = ranger_params, control = rancontrol, measures = rmsle)
# list.save(ranger_tune, "./data/ranger_tune.RData")
ranger_tune <- list.load("./data/ranger_tune.RData")

In the training step, I chose RMSLE to evaluate the model, which is the same as the Kaggle competition criterion. The following is optimal hyperparameters based on RMSLE:

ranger_tune$x 
## $num.trees
## [1] 492
## 
## $mtry
## [1] 14
## 
## $min.node.size
## [1] 7
ranger_tune$x$importance <- "impurity" # To get variable importances

It’s RMSLE based on the best model.

ranger_tune$y # RMSLE of the best model
## rmsle.test.mean 
##       0.1383685

Finally, we can use the optimal hyperparameters for tranining model.

ranger_tree <- setHyperPars(.ranger, par.vals = ranger_tune$x)
set.seed(1)
mod_ranger <- mlr::train(ranger_tree, train_Task)
getLearnerModel(mod_ranger)
## Ranger result
## 
## Call:
##  ranger::ranger(formula = NULL, dependent.variable.name = tn,      data = getTaskData(.task, .subset), case.weights = .weights,      keep.inbag = keep.inbag, ...) 
## 
## Type:                             Regression 
## Number of trees:                  492 
## Sample size:                      1460 
## Number of independent variables:  77 
## Mtry:                             14 
## Target node size:                 7 
## Variable importance mode:         impurity 
## Splitrule:                        variance 
## OOB prediction error (MSE):       747145968 
## R squared (OOB):                  0.8816142

Of couse, you can get the variable importances from {ranger}. I visualized the variable importances of the Top 10 based on “MeanDecreaseGini”. This is also the reason for adding “impurity” to optimal hyperparameters(the list object, ranger_tune) in advance.

imp <- getFeatureImportance(mod_ranger)
imp$res %>%
  gather(key = "variable", value = "importance") %>%
  arrange(desc(importance)) %>%
  slice(1:10) %>%
  ggplot() +
  geom_bar(aes(x = fct_reorder(variable, importance), y = importance, fill = importance), 
           stat = "identity") +
  scale_y_continuous(breaks = NULL) +
  coord_flip() +
  labs(x = "Variables", y = "Importances") +
  theme(legend.position = "none")


5 Making predictions and a submission file


make predictions for test set.💻

pred_ranger <- predict(mod_ranger, newdata = test_Task)
## Warning in predict.WrappedModel(mod_ranger, newdata = test_Task): Provided data
## for prediction is not a pure data.frame but from class tbl_df, hence it will be
## converted.

make a submission file.

ranger_submit <- data.frame(ID = test_ID, SalePrice = pred_ranger$data$response)
# write_excel_csv(ranger_submit, "./data/pred.csv") # RMSLE = 0.14430

you can get the result, if you upload the submission file on the Kaggle Website - House Prices - Submit Predictions. Anyway RMSLE on test set was 0.14430. Although the measure may seem complicated, you can make a meaningful interpretation! 💡


\(RMSLE = \sqrt{\frac{1}{n} \sum_{i = 1}^{n} (log \frac{p_i+1}{a_i+1})^2} = 0.1443\)


First, put \(log()\) aside and, that is, Let’s see \(\frac{p_i+1}{a_i+1}\). we can ignore the +1 of the numerator and denomitator, this just prevents the case of a zero value into the log. then, \(p_i/a_i\) is a ratio of predicted(\(p_i\)) and actual values(\(a_i\)). Good :-). Now log() is the problem, but it’s simple. Just take the exponential function, which is the inverse function of the log:

exp(0.1443)
## [1] 1.155231

That is, \(exp(RMLSE)\) is the average ratio of predicted and actual values. For example, If \(RMSLE\) is 0, which probably won’t happen🌞, it means your model is predicting all the actual values correctly.


\(exp(RMSLE) = e^{0.1443} = 1.155\)


The above result means that the predicted value is overestimating the actual value by 15%! 💡