top of page
Writer's pictureMahitha Kumar

How to pick the best ML model ?

Machine learning models are built on training data and then predictions are done to address the business problems. There are many models (like SVM, Decision tree, Random forest, Logistic regression, Naive bayes ,etc.) which can be built in Machine learning.


Choosing the best model is sometimes challenging as we need to find the right model with optimal parameters. Let us consider applying the SVM model for a particular data set , to optimize this model we have to decide the parameters of kernel, C, gamma, etc. Similarly there will be different parameters for other models which are to be optimized. The optimized models will give the best predictions. This process of optimizing the hyper parameters is called Hyper parameter tuning.




Setting up these parameters and building optimized models is one step and selecting the best model from these would be another. Let us try to learn a way to select the best model through the following example.


In this example we shall use digits data set from sk.learn and build Logistic Regression, SVM, Decision Tree Classifier and Random Forest Classifier models. And let us tune a few hyper parameters to get the optimized models and find the best of those.


Let us get the data with the following code


from sklearn import datasets
digits = datasets.load_digits()

Now let us import the libraries we need to build the above models.



from sklearn import svm
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier

Now let us create configuration dictionary required for different model setups where we specify the model types and few parameters of those models for digits Data set.

model_parameters = {
 'logistic_regression' : {
     'model':LogisticRegression(solver='liblinear',multi_class='auto'),
     'params': {
        'C': [1,5,10]
      }
    }, 
     'svm': {
     'model': svm.SVC(gamma='auto'),
      'params' : {
         'C': [1,10,20],
         'kernel': ['rbf','linear','poly']
      }  
    }, 
       'decision_tree': {
       'model': DecisionTreeClassifier(),
       'params': {
         'criterion': ['gini','entropy'],
      }
    },
       'random_forest': {
       'model': RandomForestClassifier(),
       'params' : {
         'n_estimators': [1,10,20],
         'criterion': ['gini','entropy'],
         'bootstrap': ['True','False']
      }
    },
  }
  

In the above dictionary we are specifying a few parameters for each model. let us see few details of those

Logistic regression: In Logistic regression we will be able to predict the binary dependent variables using below parameters.

  • Solver is a parameter which can be used to minimize the cost function. Here we are using ’liblinear' as the solver's value since it is a library for large linear classification and this performs well with high dimensionality.

  • Multi_class is set to auto by default. C is a regularization parameter( the distance to the margin).

  • C parameter allows you to decide how much you want to penalize misclassified points. The lower the better. By default the value is 1. We can optimize this by considering different values. I have used 1,5 and 10.

To keep in mind that there are many other parameters for Logistic regression which can be optimized but I have used only a few to simplify the explanation.


SVM(Support vector Machine): The separation of data with a clear margin is SVM. There are many hyper parameters for SVM.


  • Kernel specifies how the data needs to be transformed. The options can be rbf(radial basis function),linear ,polynomial.

  • C parameter allows you to decide how much you want to penalize misclassified points. By default the value is 1.We can try with 1,10,20 and see which one gives a better score.

  • Gamma defines how far the influence of a single training example reaches. If the value is low then it is far from the margin.

Decision Tree: For Decision tree there are many parameters by which we can specify the criteria to split like Gini or entropy ,how to split, how far to expand, randomness etc. If we do not specify the values, a decision tree classifier will pick the default ones. Here I have used both criterion gini and entropy.


Random Forest: This is an estimator which fits number of decision trees on various subsets of the data and averages the prediction which improves the accuracy. This classifier will have similar parameters to decision trees and few more can be added. We can specify the number of decision trees, what criteria to use(gini or entropy), how far to go(max_depth). We also can specify bootstrap (whether to use bootstrap samples to build trees) which can be true or false etc. Here I have used 1,10,20 for n_estimators which specifies to consider the number of decision trees. Criteria is gini or entropy and the bootstrap is true or false.


After specifying this dictionary we can iterate over each model and get the accuracy scores for all the above models. We need Grid search cv for this and it can be imported from sklearn.model_selection. ( code snippet below)



from sklearn.model_selection import GridSearchCV
import pandas as pd
accuracy = []

for model_name, mp in model_parameters.items():
    clf =  GridSearchCV(mp['model'], mp['params'], cv=5, return_train_score=False)
    clf.fit(digits.data, digits.target)
    accuracy.append({
        'model': model_name,
        'best_accuracy': clf.best_score_,
        'best_params': clf.best_params_
    })
    
df = pd.DataFrame(accuracy,columns=['model','best_accuracy','best_params'])
df.sort_values(by=['best_accuracy'],ascending=False)

In the above snippet I have used cross validation as 5. That means it uses different test sets every time it predicts and this continues 5 times and gets the average of all. After the GridSearchCV is processed we are getting the model name, best score and best params.


If we look at the results, from all the above models for digits data set the best model is SVM which is giving 96% accuracy and the best params are C:1 and kernel:’poly’.


Conclusion:

This way of calculating of accuracy scores for optimizing the models will give you data based decision on what model to apply for your business problem. If multiple models have same accuracy, you can consider the model with less computational resource and time.

86 views

Recent Posts

See All
bottom of page