![]() Let’s plot to understand how many variables we ought to take for the model creation. = T) std_dev <- prin_train$sdev pr_var <- std_dev^2 prop_varex <- pr_var/sum(pr_var) prin_train <- prcomp(trainnewpred, scale. Even if you don’t, look up here for a theoretical aspect of PCA. ![]() Hopefully you have a general idea on how PCA works. trainnewpred <- housedumpred testnewpred <- housedumpred Now separating the training and testing data sets after creating the dummy variables. library(caret) housedumnew <- dummyVars(~., data = housedum) housedumpred <- predict(housedumnew,housedum) We will use the dummyVars method of the caret package. Now that the data sets are merged let’s create dummy variables. ![]() # Removing the columns with more than 20% missing value in both train and test data. We should resort to removing the columns containing missing values more than a certain threshold. Now the thing with missing values is that it is a good practice to impute them with reasonable values but if we explicitly impute values of our choice we may be manipulating the data to our preference and are bound to get wrong model. We see that there are 5 columns: “Alley”, “FireplaceQu”, “PoolQC”, “Fence”, “MiscFeature” which have more than 20% missing data. # Set the working directory (where you have saved the file) setwd("E:/Kaggle/Housing Price/") # Read the train and test data train 20]) col_miss 20]) The following code loads the data in RStudio, and displays the structure of the data. This is the general file format while working in data. The file format is csv (comma seperated values). The data is available in kaggle for download. But I’ll post a new article if you guys need a base on how to approach exploring datasets. We will not be going too deep into the exploring part as the main theme of this article is on how to implement caret package. Generally most of the time spent is cleaning the data and in exploring the data as to get the relations between columns and whether we need to make new columns out of the existing ones. Well to start with the problem we do need the data. We have the question “Predicting the Sale Price of the properties based on the data given”. Predicting the output with respective models. Train the model for different algorithms in caret. Preparing the data for machine learning algorithms. Identifying the missing values and anomalies. Let’s keep some guidelines on how we’ll be approaching the problem statement. Roadmap:īefore jumping straight into coding. Sounds pretty simple, eh? Well it’s not that simple. Now instead of trying to remember different packages for different algorithms caret allows you to use 1 simple function to create all your algorithms. CARET package contains more than 175 algorithms to work with. The general idea of this article is “why use different packages for different algorithms when you have one for all ?”. In R we have different packages for all these algorithms. Now for regression problems we can use variety of algorithms such as Linear Regression, Random Forest, kNN etc. It’s supervised because we have both the features(data for the House Price) and the target (SalePrice) that we want to predict. This is a Supervised Regression Machine Learning Problem. We have to use regression techniques to predict the SalePrice of the property. The problem we will tackle is predicting the Sales Price of the resiential homes in Ames, Iowa. So the dataset for this competition has around 76 columns with it’s respective missing values cause let’s face it data set without missing values is like life without soul. With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this competition challenges you to predict the final price of each home. But this playground competition’s dataset proves that much more influences price negotiations than the number of bedrooms or a white-picket fence. The kaggle description on the dataset states as follows:Īsk a home buyer to describe their dream house, and they probably won’t begin with the height of the basement ceiling or the proximity to an east-west railroad. So I’ll be working on House Price Data Set which is a competition in kaggle and apply the caret package in R to apply different algorithms instead of different packages, also applying hyper-parameter tuning. ![]() There will be errors in grammars (not the code) so apologies in advance. Hello fellow readers, this is my first article so please bear with me. How to use different algorithms using Caret package in R.
0 Comments
Leave a Reply. |