设为首页 收藏本站
查看: 1757|回复: 0

[经验分享] Practicing Machine Learning Techniques in R with MLR Package

[复制链接]

尚未签到

发表于 2017-6-22 09:54:57 | 显示全部楼层 |阅读模式
Introduction
  In R, we often use multiple packages for doing various machine learning tasks. For example: we impute missing value using one package, then build a model with another and finally evaluate their performance using a third package.
  The problem is, every package has a set of specific parameters. While working with many packages, we end up spending a lot of time to figure out which parameters are important. Don’t you think?
  To solve this problem, I researched and came across a R package named MLR, which is absolutely incredible at performing machine learning tasks. This package includes all of the ML algorithms which we use frequently.
  In this tutorial, I’ve taken up a classification problem and tried improving its accuracy using machine learning. I haven’t explained the ML algorithms (theoretically) but focus is kept on their implementation. By the end of this article, you are expected to become proficient at implementing several ML algorithms in R. But, only if you practice alongside.
  Note: This article is meant only for beginners and early starters with Machine Learning in R. Basic statistic knowledge is required.

Table of Content

  • Getting Data
  • Exploring Data
  • Missing Value Imputation
  • Feature Engineering

    • Outlier Removal by Capping
    • New Features

  • Machine Learning

    • Feature Importance
    • QDA
    • Logistic Regression

      • Cross Validation

    • Decision Tree

      • Cross Validation
      • Parameter Tuning using Grid Search

    • Random Forest
    • SVM
    • GBM (Gradient Boosting)

      • Cross Validation
      • Parameter Tuning using Random Search (Faster)

    • XGBoost (Extreme Gradient Boosting)
    • Feature Selection

Machine Learning with MLR Package
  Until now, R didn’t have any package / library similar to Scikit-Learn from Python, wherein you could get all the functions required to do machine learning. But, since February 2016, R users have got mlr package using which they can perform most of their ML tasks.
  Let’s now understand the basic concept of how this package works. If you get it right here, understanding the whole package would be a mere cakewalk.
  The entire structure of this package relies on this premise:
  Create a Task. Make a Learner. Train Them.
  Creating a task means loading data in the package. Making a learner means choosing an algorithm ( learner) which learns from task (or data). Finally, train them.
  MLR package has several algorithms in its bouquet. These algorithms have been categorized into regression, classification, clustering, survival, multiclassification and cost sensitive classification. Let’s look at some of the available algorithms for classification problems:
  > listLearners("classif")[c("class","package")]
  class                           package
1 classif.avNNet                   nnet
2 classif.bartMachine            bartMachine
3 classif.binomial                 stats
4 classif.boosting               adabag,rpart
5 classif.cforest                  party
6 classif.ctree                    party
7 classif.extraTrees             extraTrees
8 classif.knn                      class
9 classif.lda                      MASS
10 classif.logreg                  stats
11 classif.lvq1                    class
12 classif.multinom                 nnet
13 classif.neuralnet             neuralnet
14 classif.nnet                     nnet
15 classif.plsdaCaret              caret
16 classif.probit                  stats
17 classif.qda                      MASS
18 classif.randomForest         randomForest
19 classif.randomForestSRC      randomForestSRC
20 classif.randomForestSRCSyn   randomForestSRC
21 classif.rpart                    rpart
22 classif.xgboost                 xgboost
  And, there are many more. Let’s start working now!
1. Getting Data
  For this tutorial, I’ve taken up one of the popular ML problem from DataHack: Download Data.
  After you’ve downloaded the data, let’s quickly get done with initial commands such as setting the working directory and loading data.
  > path <- "~/Data/Playground/MLR_Package"
> setwd(path)
  #load libraries and data
> install.packages("mlr")
> library(mlr)
> train <- read.csv("train_loan.csv", na.strings = c(""," ",NA))
> test <- read.csv("test_Y3wMUE5.csv", na.strings = c(""," ",NA))
2. Exploring Data
  Once the data is loaded, you can access it using:
  > summarizeColumns(train)
  name              type    na    mean           disp      median     mad    min   max   nlevs
LoanAmount       integer  22   146.4121622  85.5873252   128.0   47.4432    9    700    0
Loan_Amount_Term integer  14   342.0000000  65.1204099   360.0    0.0000   12    480    0
Credit_History   integer  50   0.8421986    0.3648783     1.0     0.0000   0      1     0
Property_Area    factor   0      NA         0.6205212     NA        NA    179    233    3
Loan_Status      factor   0      NA         0.3127036     NA        NA    192    422    2
  This functions gives a much comprehensive view of the data set as compared to base str()function. Shown above are the last 5 rows of the result. Similarly you can do for test data also:
  > summarizeColumns(test)
  From these outputs, we can make the following inferences:

  • In the data, we have 12 variables, out of which Loan_Status is the dependent variable and rest are independent variables.
  • Train data has 614 observations. Test data has 367 observations.
  • In train and test data, 6 variables have missing values (can be seen in na column).
  • ApplicantIncome and Coapplicant Income are highly skewed variables. How do we know that ? Look at their min, max and median value. We’ll have to normalize these variables.
  • LoanAmount, ApplicantIncome and CoapplicantIncome has outlier values, which should be treated.
  • Credit_History is an integer type variable. But, being binary in nature, we should convert it to factor.
  Also, you can check the presence of skewness in variables mentioned above using a simple histogram.
  > hist(train$ApplicantIncome, breaks = 300, main = "Applicant Income Chart",xlab = "ApplicantIncome")
  > hist(train$CoapplicantIncome, breaks = 100,main = "Coapplicant Income Chart",xlab = "CoapplicantIncome")
  As you can see in charts above, skewness is nothing but concentration of majority of data on one side of the chart. What we see is a right skewed graph. To visualize outliers, we can use a boxplot:
  > boxplot(train$ApplicantIncome)
  Similarly, you can create a boxplot for CoapplicantIncome and LoanAmount as well.
  Let’s change the class of Credit_History to factor. Remember, the class factor is always used for categorical variables.
  > train$Credit_History <- as.factor(train$Credit_History)
> test$Credit_History <- as.factor(test$Credit_History)
  To check the changes, you can do:
  > class(train$Credit_History)
[1] "factor"
  You can further scrutinize the data using:
  > summary(train)
> summary(test)
  We find that the variable Dependents has a level 3+ which shall be treated too. It’s quite simple to modify the name levels in a factor variable. It can be done as:
  #rename level of Dependents
> levels(train$Dependents)[4] <- "3"
> levels(test$Dependents)[4] <- "3"
3. Missing Value Imputation
  Not just beginners, even good R analyst struggle with missing value imputation. MLR package offers a nice and convenient way to impute missing value using multiple methods. After we are done with much needed modifications in data, let’s impute missing values.
  In our case, we’ll use basic mean and mode imputation to impute data. You can also use any ML algorithm to impute these values, but that comes at the cost of computation.
  #impute missing values by mean and mode
> imp <- impute(train, classes = list(factor = imputeMode(), integer = imputeMean()), dummy.classes = c("integer","factor"), dummy.type = "numeric")
> imp1 <- impute(test, classes = list(factor = imputeMode(), integer = imputeMean()), dummy.classes = c("integer","factor"), dummy.type = "numeric")
  This function is convenient because you don’t have to specify each variable name to impute. It selects variables on the basis of their classes. It also creates new dummy variables for missing values. Sometimes, these (dummy) features contain a trend which can be captured using this function.dummy.classes says for which classes should I create a dummy variable. dummy.type says what should be the class of new dummy variables.
  $data attribute of imp function contains the imputed data.
  > imp_train <- imp$data
> imp_test <- imp1$data
  Now, we have the complete data. You can check the new variables using:
  >summarizeColumns(imp_train)
>summarizeColumns(imp_test)
  Did you notice a disparity among both data sets? No ? See again. The answer is Married.dummyvariable exists only in imp_train and not in imp_test. Therefore, we’ll have to remove it before modeling stage.
  Optional: You might be excited or curious to try out imputing missing values using a ML algorithm. In fact, there are some algorithms which don’t require you to impute missing values. You can simply supply them missing data. They take care of missing values on their own. Let’s see which algorithms are they:
  > listLearners("classif", check.packages = TRUE, properties = "missings")[c("class","package")]
class                           package
1 classif.bartMachine          bartMachine
2 classif.boosting            adabag,rpart
3 classif.cforest                party
4 classif.ctree                  party
5 classif.gbm                     gbm
6 classif.naiveBayes             e1071
7 classif.randomForestSRC   randomForestSRC
8 classif.rpart                  rpart
  However, it is always advisable to treat missing values separately. Let’s see how can you treat missing value using rpart:
  > rpart_imp <- impute(train, target = "Loan_Status",
classes = list(numeric = imputeLearner(makeLearner("regr.rpart")),
factor = imputeLearner(makeLearner("classif.rpart"))),
dummy.classes = c("numeric","factor"),
dummy.type = "numeric")
4. Feature Engineering
  Feature Engineering is the most interesting part of predictive modeling. So, feature engineering has two aspects: Feature Transformation and Feature Creation. We’ll try to work on both the aspects here.
  At first, let’s remove outliers from variables like ApplicantIncome, CoapplicantIncome, LoanAmount. There are many techniques to remove outliers. Here, we’ll cap all the large values in these variables and set them to a threshold value as shown below:
  #for train data set
> cd <- capLargeValues(imp_train, target = "Loan_Status",cols = c("ApplicantIncome"),threshold = 40000)
> cd <- capLargeValues(cd, target = "Loan_Status",cols = c("CoapplicantIncome"),threshold = 21000)
> cd <- capLargeValues(cd, target = "Loan_Status",cols = c("LoanAmount"),threshold = 520)
  #rename the train data as cd_train
> cd_train <- cd
  #add a dummy Loan_Status column in test data
> imp_test$Loan_Status <- sample(0:1,size = 367,replace = T)
  > cde <- capLargeValues(imp_test, target = "Loan_Status",cols = c("ApplicantIncome"),threshold = 33000)
> cde <- capLargeValues(cde, target = "Loan_Status",cols = c("CoapplicantIncome"),threshold = 16000)
> cde <- capLargeValues(cde, target = "Loan_Status",cols = c("LoanAmount"),threshold = 470)
  #renaming test data
> cd_test <- cde
  I’ve chosen the threshold value with my discretion, after analyzing the variable distribution. To check the effects, you can do summary(cd_train$ApplicantIncome) and see that the maximum value is capped at 33000.
  In both data sets, we see that all dummy variables are numeric in nature. Being binary in form, they should be categorical. Let’s convert their classes to factor. This time, we’ll use simple for and ifloops.
  #convert numeric to factor - train
> for (f in names(cd_train[, c(14:20)])) {
if( class(cd_train[, c(14:20)] [[f]]) == "numeric"){
levels <- unique(cd_train[, c(14:20)][[f]])
cd_train[, c(14:20)][[f]] <- as.factor(factor(cd_train[, c(14:20)][[f]], levels = levels))
}
}
  #convert numeric to factor - test
> for (f in names(cd_test[, c(13:18)])) {
if( class(cd_test[, c(13:18)] [[f]]) == "numeric"){
levels <- unique(cd_test[, c(13:18)][[f]])
cd_test[, c(13:18)][[f]] <- as.factor(factor(cd_test[, c(13:18)][[f]], levels = levels))
}
}
  These loops say – ‘for every column name which falls column number 14 to 20 of cd_train / cd_test data frame, if the class of those variables in numeric, take out the unique value from those columns as levels and convert them into a factor (categorical) variables.
  Let’s create some new features now.
  #Total_Income
> cd_train$Total_Income <- cd_train$ApplicantIncome + cd_train$CoapplicantIncome
> cd_test$Total_Income <- cd_test$ApplicantIncome + cd_test$CoapplicantIncome
  #Income by loan
> cd_train$Income_by_loan <- cd_train$Total_Income/cd_train$LoanAmount
> cd_test$Income_by_loan <- cd_test$Total_Income/cd_test$LoanAmount
  #change variable class
> cd_train$Loan_Amount_Term <- as.numeric(cd_train$Loan_Amount_Term)
> cd_test$Loan_Amount_Term <- as.numeric(cd_test$Loan_Amount_Term)
  #Loan amount by term
> cd_train$Loan_amount_by_term <- cd_train$LoanAmount/cd_train$Loan_Amount_Term
> cd_test$Loan_amount_by_term <- cd_test$LoanAmount/cd_test$Loan_Amount_Term
  While creating new features(if they are numeric), we must check their correlation with existing variables as there are high chances often. Let’s see if our new variables too happens to be correlated:
  #splitting the data based on class
> az <- split(names(cd_train), sapply(cd_train, function(x){ class(x)}))
  #creating a data frame of numeric variables
> xs <- cd_train[az$numeric]
  #check correlation
> cor(xs)
  As we see, there exists a very high correlation of Total_Income with ApplicantIncome. It means that the new variable isn’t providing any new information. Thus, this variable is not helpful for modeling data.
  Now we can remove the variable.
  > cd_train$Total_Income <- NULL
> cd_test$Total_Income <- NULL
  There is still enough potential left to create new variables. Before proceeding, I want you to think deeper on this problem and try creating newer variables. After doing so much modifications in data, let’s check the data again:
  > summarizeColumns(cd_train)
> summarizeColumns(cd_test)
5. Machine Learning
  Until here, we’ve performed all the important transformation steps except normalizing the skewed variables. That will be done after we create the task.
  As explained in the beginning, for mlr, a task is nothing but the data set on which a learner learns. Since, it’s a classification problem, we’ll create a classification task. So, the task type solely depends on type of problem at hand.
  #create a task
> trainTask <- makeClassifTask(data = cd_train,target = "Loan_Status")
> testTask <- makeClassifTask(data = cd_test, target = "Loan_Status")
  Let’s check trainTask
  > trainTask
Supervised task: cd_train
Type: classif
Target: Loan_Status
Observations: 614
Features:
numerics factors ordered
13         8       0
Missings: FALSE
Has weights: FALSE
Has blocking: FALSE
Classes: 2
N   Y
192 422
Positive class: N
  As you can see, it provides a description of cd_train data. However, an evident problem is that it is considering positive class as N, whereas it should be Y. Let’s modify it:
  > trainTask <- makeClassifTask(data = cd_train,target = "Loan_Status", positive = "Y")
  For a deeper view, you can check your task data using str(getTaskData(trainTask)).
  Now, we will normalize the data. For this step, we’ll use normalizeFeatures function from mlr package. By default, this packages normalizes all the numeric features in the data. Thankfully, only 3 variables which we have to normalize are numeric, rest of the variables have classes other than numeric.
  #normalize the variables
> trainTask <- normalizeFeatures(trainTask,method = "standardize")
> testTask <- normalizeFeatures(testTask,method = "standardize")
  Before we start applying algorithms, we should remove the variables which are not required.
  > trainTask <- dropFeatures(task = trainTask,features = c("Loan_ID","Married.dummy"))
  MLR package has an in built function which returns the important variables from data. Let’s see which variables are important. Later, we can use this knowledge to subset out input predictors for model improvement. While running this code, R might prompt you to install ‘FSelector’ package, which you should do.
  #Feature importance
> im_feat <- generateFilterValuesData(trainTask, method = c("information.gain","chi.squared"))
> plotFilterValues(im_feat,n.show = 20)

  #to launch its shiny application
> plotFilterValuesGGVIS(im_feat)
  If you are still wondering about information.gain, let me provide a simple explanation. Information gain is generally used in context with decision trees. Every node split in a decision tree is based on information gain. In general, it tries to find out variables which carries the maximum information using which the target class is easier to predict.
  Let’s start modeling now. I won’t explain these algorithms in detail but I’ve provided links to helpful resources. We’ll take up simpler algorithms at first and end this tutorial with the complexed ones.
  With MLR, we can choose & set algorithms using makeLearner. This learner will train on trainTaskand try to make predictions on testTask.
  1. Quadratic Discriminant Analysis (QDA).
  In general, qda is a parametric algorithm. Parametric means that it makes certain assumptions about data. If the data is actually found to follow the assumptions, such algorithms sometime outperform several non-parametric algorithms. Read More.
  #load qda
> qda.learner <- makeLearner("classif.qda", predict.type = "response")
  #train model
> qmodel <- train(qda.learner, trainTask)
  #predict on test data
> qpredict <- predict(qmodel, testTask)
  #create submission file
> submit <- data.frame(Loan_ID = test$Loan_ID, Loan_Status = qpredict$data$response)
> write.csv(submit, "submit1.csv",row.names = F)
  Upload this submission file and check your leaderboard rank (wouldn’t be good). Our accuracy is ~ 71.5%. I understand, this submission might not put you among the top on leaderboard, but there’s along way to go. So, let’s proceed.
  2. Logistic Regression
  This time, let’s also check cross validation accuracy. Higher CV accuracy determines that our model does not suffer from high variance and generalizes well on unseen data.
  #logistic regression
> logistic.learner <- makeLearner("classif.logreg",predict.type = "response")
  #cross validation (cv) accuracy
> cv.logistic <- crossval(learner = logistic.learner,task = trainTask,iters = 3,stratify = TRUE,measures = acc,show.info = F)
  Similarly, you can perform CV for any learner. Isn’t it incredibly easy? So, I’ve used stratified sampling with 3 fold CV. I’d always recommend you to use stratified sampling in classification problems since it maintains the proportion of target class in n folds. We can check CV accuracy by:
  #cross validation accuracy
> cv.logistic$aggr
acc.test.mean
0.7947553

  This is the average accuracy calculated on 5 folds. To see, respective accuracy each fold, we can do this:
  > cv.logistic$measures.test
  iter    acc
1  1    0.8439024
2  2    0.7707317
3  3    0.7598039
  Now, we’ll train the model and check the prediction accuracy on test data.
  #train model
> fmodel <- train(logistic.learner,trainTask)
> getLearnerModel(fmodel)
  #predict on test data
> fpmodel <- predict(fmodel, testTask)
  #create submission file
> submit <- data.frame(Loan_ID = test$Loan_ID, Loan_Status = fpmodel$data$response)
> write.csv(submit, "submit2.csv",row.names = F)
  Woah! This algorithm gave us a significant boost in accuracy. Moreover, this is a stable model since our CV score and leaderboard score matches closely. This submission returns accuracy of 79.16%. Good, we are improving now. Let’s get ahead to the next algorithm.
3. Decision Tree
  A decision tree is said to capture non-linear relations better than a logistic regression model. Let’s see if we can improve our model further. This time we’ll hyper tune the tree parameters to achieve optimal results. To get the list of parameters for any algorithm, simply write (in this case rpart):
  > getParamSet("classif.rpart")
  This will return a long list of tunable and non-tunable parameters. Let’s build a decision tree now. Make sure you have installed the rpart package before creating the tree learner:
  #make tree learner
> makeatree <- makeLearner("classif.rpart", predict.type = "response")
  #set 3 fold cross validation
> set_cv <- makeResampleDesc("CV",iters = 3L)
  I’m doing a 3 fold CV because we have less data. Now, let’s set tunable parameters:
  #Search for hyperparameters
> gs <- makeParamSet(
makeIntegerParam("minsplit",lower = 10, upper = 50),
makeIntegerParam("minbucket", lower = 5, upper = 50),
makeNumericParam("cp", lower = 0.001, upper = 0.2)
)
  As you can see, I’ve set 3 parameters. minsplit represents the minimum number of observation in a node for a split to take place. minbucket says the minimum number of observation I should keep in terminal nodes. cp is the complexity parameter. The lesser it is, the tree will learn more specific relations in the data which might result in overfitting.
  #do a grid search
> gscontrol <- makeTuneControlGrid()
  #hypertune the parameters
> stune <- tuneParams(learner = makeatree, resampling = set_cv, task = trainTask, par.set = gs, control = gscontrol, measures = acc)
  You may go and take a walk until the parameter tuning completes. May be, go catch some pokemons! It took 15 minutes to run at my machine. I’ve 8GB intel i5 processor windows machine.
  #check best parameter
> stune$x
# $minsplit
# [1] 37
#
# $minbucket
# [1] 15
#
# $cp
# [1] 0.001
  It returns a list of best parameters. You can check the CV accuracy with:
  #cross validation result
> stune$y
0.8127132
  Using setHyperPars function, we can directly set the best parameters as modeling parameters in the algorithm.
  #using hyperparameters for modeling
> t.tree <- setHyperPars(makeatree, par.vals = stune$x)
  #train the model
> t.rpart <- train(t.tree, trainTask)
getLearnerModel(t.rpart)
  #make predictions
> tpmodel <- predict(t.rpart, testTask)
  #create a submission file
> submit <- data.frame(Loan_ID = test$Loan_ID, Loan_Status = tpmodel$data$response)
> write.csv(submit, "submit3.csv",row.names = F)
  Decision Tree is doing no better than logistic regression. This algorithm has returned the same accuracy of 79.14% as of logistic regression. So, one tree isn’t enough. Let’s build a forest now.
4. Random Forest
  Random Forest is a powerful algorithm known to produce astonishing results. Actually, it’s prediction derive from an ensemble of trees. It averages the prediction given by each tree and produces a generalized result. From here, most of the steps would be similar to followed above, but this time I’ve done random search instead of grid search for parameter tuning, because it’s faster.
  > getParamSet("classif.randomForest")
  #create a learner
> rf <- makeLearner("classif.randomForest", predict.type = "response", par.vals = list(ntree = 200, mtry = 3))
> rf$par.vals <- list(
importance = TRUE
)
  #set tunable parameters
#grid search to find hyperparameters
> rf_param <- makeParamSet(
makeIntegerParam("ntree",lower = 50, upper = 500),
makeIntegerParam("mtry", lower = 3, upper = 10),
makeIntegerParam("nodesize", lower = 10, upper = 50)
)
  #let's do random search for 50 iterations
> rancontrol <- makeTuneControlRandom(maxit = 50L)
  Though, random search is faster than grid search, but sometimes it turns out to be less efficient. In grid search, the algorithm tunes over every possible combination of parameters provided. In a random search, we specify the number of iterations and it randomly passes over the parameter combinations. In this process, it might miss out some important combination of parameters which could have returned maximum accuracy, who knows.
  #set 3 fold cross validation
> set_cv <- makeResampleDesc("CV",iters = 3L)
  #hypertuning
> rf_tune <- tuneParams(learner = rf, resampling = set_cv, task = trainTask, par.set = rf_param, control = rancontrol, measures = acc)
  Now, we have the final parameters. Let’s check the list of parameters and CV accuracy.
  #cv accuracy
> rf_tune$y
acc.test.mean
0.8192571
  #best parameters
> rf_tune$x
$ntree
[1] 168
  $mtry
[1] 6
  $nodesize
[1] 29
  Let’s build the random forest model now and check its accuracy.
  #using hyperparameters for modeling
> rf.tree <- setHyperPars(rf, par.vals = rf_tune$x)
  #train a model
> rforest <- train(rf.tree, trainTask)
> getLearnerModel(t.rpart)
  #make predictions
> rfmodel <- predict(rforest, testTask)
  #submission file
> submit <- data.frame(Loan_ID = test$Loan_ID, Loan_Status = rfmodel$data$response)
> write.csv(submit, "submit4.csv",row.names = F)
  No new story to cheer about. This model too returned an accuracy of 79.14%. So, try using grid search instead of random search, and tell me in comments if your model improved.
5. SVM
  Support Vector Machines (SVM) is also a supervised learning algorithm used for regression and classification problems. In general, it creates a hyperplane in n dimensional space to classify the data based on target class. Let’s step away from tree algorithms for a while and see if this algorithm can bring us some improvement.
  Since, most of the steps would be similar as performed above, I don’t think understanding these codes for you would be a challenge anymore.
  #load svm
> getParamSet("classif.ksvm") #do install kernlab package
> ksvm <- makeLearner("classif.ksvm", predict.type = "response")
  #Set parameters
> pssvm <- makeParamSet(
makeDiscreteParam("C", values = 2^c(-8,-4,-2,0)), #cost parameters
makeDiscreteParam("sigma", values = 2^c(-8,-4,0,4)) #RBF Kernel Parameter
)
  #specify search function
> ctrl <- makeTuneControlGrid()
  #tune model
> res <- tuneParams(ksvm, task = trainTask, resampling = set_cv, par.set = pssvm, control = ctrl,measures = acc)
  #CV accuracy
> res$y
acc.test.mean
0.8062092
  #set the model with best params
> t.svm <- setHyperPars(ksvm, par.vals = res$x)
  #train
> par.svm <- train(ksvm, trainTask)
  #test
> predict.svm <- predict(par.svm, testTask)
  #submission file
> submit <- data.frame(Loan_ID = test$Loan_ID, Loan_Status = predict.svm$data$response)
> write.csv(submit, "submit5.csv",row.names = F)
  This model returns an accuracy of 77.08%. Not bad, but lesser than our highest score. Don’t feel hopeless here. This is core machine learning. ML doesn’t work unless it gets some good variables. May be, you should think longer on feature engineering aspect, and create more useful variables. Let’s do boosting now.
  6. GBM
  Now you are entering the territory of boosting algorithms. GBM performs sequential modeling i.e after one round of prediction, it checks for incorrect predictions, assigns them relatively more weight and predict them again until they are predicted correctly.
  #load GBM
> getParamSet("classif.gbm")
> g.gbm <- makeLearner("classif.gbm", predict.type = "response")
  #specify tuning method
> rancontrol <- makeTuneControlRandom(maxit = 50L)
  #3 fold cross validation
> set_cv <- makeResampleDesc("CV",iters = 3L)
  #parameters
> gbm_par<- makeParamSet(
makeDiscreteParam("distribution", values = "bernoulli"),
makeIntegerParam("n.trees", lower = 100, upper = 1000), #number of trees
makeIntegerParam("interaction.depth", lower = 2, upper = 10), #depth of tree
makeIntegerParam("n.minobsinnode", lower = 10, upper = 80),
makeNumericParam("shrinkage",lower = 0.01, upper = 1)
)
  n.minobsinnode refers to the minimum number of observations in a tree node. shrinkage is the regulation parameter which dictates how fast / slow the algorithm should move.
  #tune parameters
> tune_gbm <- tuneParams(learner = g.gbm, task = trainTask,resampling = set_cv,measures = acc,par.set = gbm_par,control = rancontrol)
  #check CV accuracy
> tune_gbm$y
  #set parameters
> final_gbm <- setHyperPars(learner = g.gbm, par.vals = tune_gbm$x)
  #train
> to.gbm <- train(final_gbm, traintask)
  #test
> pr.gbm <- predict(to.gbm, testTask)
  #submission file
> submit <- data.frame(Loan_ID = test$Loan_ID, Loan_Status = pr.gbm$data$response)
> write.csv(submit, "submit6.csv",row.names = F)
  The accuracy of this model is 78.47%. GBM performed better than SVM, but couldn’t exceed random forest’s accuracy. Finally, let’s test XGboost also.
7. Xgboost
  Xgboost is considered to be better than GBM because of its inbuilt properties including first and second order gradient, parallel processing and ability to prune trees. General implementation of xgboost requires you to convert the data into a matrix. With mlr, that is not required.
  As I said in the beginning, a benefit of using this (MLR) package is that you can follow same set of commands for implementing different algorithms.
  #load xgboost
> set.seed(1001)
> getParamSet("classif.xgboost")
  #make learner with inital parameters
> xg_set <- makeLearner("classif.xgboost", predict.type = "response")
> xg_set$par.vals <- list(
objective = "binary:logistic",
eval_metric = "error",
nrounds = 250
)
  #define parameters for tuning
> xg_ps <- makeParamSet(
makeIntegerParam("nrounds",lower=200,upper=600),
makeIntegerParam("max_depth",lower=3,upper=20),
makeNumericParam("lambda",lower=0.55,upper=0.60),
makeNumericParam("eta", lower = 0.001, upper = 0.5),
makeNumericParam("subsample", lower = 0.10, upper = 0.80),
makeNumericParam("min_child_weight",lower=1,upper=5),
makeNumericParam("colsample_bytree",lower = 0.2,upper = 0.8)
)
  #define search function
> rancontrol <- makeTuneControlRandom(maxit = 100L) #do 100 iterations
  #3 fold cross validation
> set_cv <- makeResampleDesc("CV",iters = 3L)
  #tune parameters
> xg_tune <- tuneParams(learner = xg_set, task = trainTask, resampling = set_cv,measures = acc,par.set = xg_ps, control = rancontrol)
  #set parameters
> xg_new <- setHyperPars(learner = xg_set, par.vals = xg_tune$x)
  #train model
> xgmodel <- train(xg_new, trainTask)
  #test model
> predict.xg <- predict(xgmodel, testTask)
  #submission file
> submit <- data.frame(Loan_ID = test$Loan_ID, Loan_Status = predict.xg$data$response)
> write.csv(submit, "submit7.csv",row.names = F)
  Terrible XGBoost. This model returns an accuracy of 68.5%, even lower than qda. What could happen ? Overfitting. So, this model returned CV accuracy of ~ 80% but leaderboard score declined drastically, because the model couldn’t predict correctly on unseen data.
What can you do next? Feature Selection ?
  For improvement, let’s do this. Until here, we’ve used trainTask for model building. Let’s use the knowledge of important variables. Take first 6 important variables and train the models on them. You can expect some improvement. To create a task selecting important variables, do this:
  #selecting top 6 important features
> top_task <- filterFeatures(trainTask, method = "rf.importance", abs = 6)
  So, I’ve asked this function to get me top 6 important features using the random forest importance feature. Now, replace top_task with trainTask in models above, and tell me in comments if you got any improvement.
  Also, try to create more features. The current leaderboard winner is at ~81% accuracy. If you have followed me till here, don’t give up now.
End Notes
  The motive of this article was to get you started with machine learning techniques. These techniques are commonly used in industry today. Hence, make sure you understand them well. Don’t use these algorithms as black box approaches, understand them well. I’ve provided link to resources.
  What happened above, happens a lot in real life. You’d try many algorithms but wouldn’t get improvement in accuracy. But, you shouldn’t give up. Being a beginner, you should try exploring other ways to achieve accuracy. Remember, no matter how many wrong attempts you make, you just have to be right once.
  You might have to install packages while loading these models, but that’s one time only. If you followed this article completely, you are ready to build models. All you have to do is, learn the theory behind them.
  Did you find this article helpful ? Did you try the improvement methods I listed above ? Which algorithm gave you the max. accuracy? Share your observations / experience in the comments below.
  转自:analyticsvidhya

运维网声明 1、欢迎大家加入本站运维交流群:群②:261659950 群⑤:202807635 群⑦870801961 群⑧679858003
2、本站所有主题由该帖子作者发表,该帖子作者与运维网享有帖子相关版权
3、所有作品的著作权均归原作者享有,请您和我们一样尊重他人的著作权等合法权益。如果您对作品感到满意,请购买正版
4、禁止制作、复制、发布和传播具有反动、淫秽、色情、暴力、凶杀等内容的信息,一经发现立即删除。若您因此触犯法律,一切后果自负,我们对此不承担任何责任
5、所有资源均系网友上传或者通过网络收集,我们仅提供一个展示、介绍、观摩学习的平台,我们不对其内容的准确性、可靠性、正当性、安全性、合法性等负责,亦不承担任何法律责任
6、所有作品仅供您个人学习、研究或欣赏,不得用于商业或者其他用途,否则,一切后果均由您自己承担,我们对此不承担任何法律责任
7、如涉及侵犯版权等问题,请您及时通知我们,我们将立即采取措施予以解决
8、联系人Email:admin@iyunv.com 网址:www.yunweiku.com

所有资源均系网友上传或者通过网络收集,我们仅提供一个展示、介绍、观摩学习的平台,我们不对其承担任何法律责任,如涉及侵犯版权等问题,请您及时通知我们,我们将立即处理,联系人Email:kefu@iyunv.com,QQ:1061981298 本贴地址:https://www.yunweiku.com/thread-386761-1-1.html 上篇帖子: 深度优化LNMP之Nginx (转) 下篇帖子: 计算机网络基础知识整理
您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

扫码加入运维网微信交流群X

扫码加入运维网微信交流群

扫描二维码加入运维网微信交流群,最新一手资源尽在官方微信交流群!快快加入我们吧...

扫描微信二维码查看详情

客服E-mail:kefu@iyunv.com 客服QQ:1061981298


QQ群⑦:运维网交流群⑦ QQ群⑧:运维网交流群⑧ k8s群:运维网kubernetes交流群


提醒:禁止发布任何违反国家法律、法规的言论与图片等内容;本站内容均来自个人观点与网络等信息,非本站认同之观点.


本站大部分资源是网友从网上搜集分享而来,其版权均归原作者及其网站所有,我们尊重他人的合法权益,如有内容侵犯您的合法权益,请及时与我们联系进行核实删除!



合作伙伴: 青云cloud

快速回复 返回顶部 返回列表