Is your machine learning model taking time and you ever wonder if accuracy is moderate? XGBoost is the solution for you. Let us look at how it can help.
XGBoost stands for eXtreme Gradient Boosting. XGBoost is a powerful machine learning algorithm in Supervised Learning. XG Boost works on parallel tree boosting which predicts the target by combining results of multiple weak model. It offers great speed and accuracy.
The XGBoost library implements the gradient boosting decision tree algorithm.It is a software library that you can download and install on your machine, then access from a variety of interfaces. Specifically, XGBoost supports the following main interfaces:
Command Line Interface (CLI).
C++ (the language in which the library is written).
Python interface as well as a model in scikit-learn.
R interface as well as a model in the caret package.
Julia.
Java and JVM languages like Scala and platforms like Hadoop
To enhance XGBoost we can specify certain parameters called Hyperparameters. Let us look about these Hyperparameters in detail.
Hyperparameters
Hyperparameters are certain values or weights that determine the learning process of an algorithm. XGBoost provides a large range of hyperparameters. XGBoost is a very powerful algorithm. So, it will have more design decisions and hence large hyperparameters. In tree-based models, hyperparameters include things like the maximum depth of the tree, the number of trees to grow, the number of variables to consider when building each tree, the minimum number of samples on a leaf and the fraction of observations used to build a tree etc. The following are the types of hyperparameters we usually use to enhance XGboost algorithm.
1.General Hyperparameters
These parameters guide the overall functioning of the XGBoost model.
a. Booster: It helps to select the type of models for each iteration. gbtree is used by default. gbtree ,dart for tree based models and gblinear for linear models.
b. Verbosity: It is used to mention specifications about printing messages. The default value is 1.Valid values could be 0(silent),1(warning),2(info),3(debug)
c. nthread : This is used to specify the number of parallel threads used to run XGBoost. The number of cores in the system should be entered otherwise it will run on all cores automatically i.e. by default it will take the maximum number of threads available.
There are other general parameters like
d. disable_default_eval_metric [default=0]
e. num_pbuffer [set automatically by XGBoost, no need to be set by user]
f. num_feature [set automatically by XGBoost, no need to be set by user]
2.Booster Hyperparameters
There are two types tree booster and linear booster.
a. eta [default=0.3, alias: learning_rate] :It is the step size shrinkage used in update to prevent overfitting. After each boosting step, we can directly get the weights of new features, and eta shrinks the feature weights to make the boosting process more conservative. It makes the model more robust by shrinking the weights on each step. Range can be [0,1] Typical final values are 0.01-0.2.
b. gamma [default=0, alias: min_split_loss]:A node is split only when the resulting split gives a positive reduction in the loss function. Gamma specifies the minimum loss reduction required to make a split. It makes the algorithm conservative. The values can vary depending on the loss function and should be tuned. The larger gamma is, the more conservative the algorithm will be. It can Range: [0,∞].
c. max_depth [default=6]:The maximum depth of a tree, same as GBM. It is used to control over-fitting as higher depth will allow model to learn relations very specific to a particular sample. Increasing this value will make the model more complex and more likely to overfit. We should be careful when setting large value of max_depth because XGBoost aggressively consumes memory when training a deep tree. range: [0,∞] (0 is only accepted in lossguided growing policy when tree_method is set as hist. Should be tuned using CV(cross validation). Typical values: 3-10
d. min_child_weight [default=1]:It defines the minimum sum of weights of all observations required in a child. This is similar to min_child_leaf in GBM but not exactly. This refers to min “sum of weights” of observations while GBM has min “number of observations”. It is used to control over-fitting. Higher values prevent a model from learning relations which might be highly specific to the particular sample selected for a tree. Too high values can lead to under fitting. Hence, it should be tuned using CV. The larger min_child_weight is, the more conservative the algorithm will be. range: [0,∞]
e. max_delta_step [default=0]:In maximum delta step we allow each tree’s weight estimation to be. If the value is set to 0, it means there is no constraint. If it is set to a positive value, it can help making the update step more conservative. Usually this parameter is not needed, but it might help in logistic regression when class is extremely imbalanced. Set it to value of 1-10 might help control the update. range: [0,∞]
f. subsample [default=1]:It denotes the fraction of observations to be randomly samples for each tree. Subsample ratio of the training instances. Setting it to 0.5 means that XGBoost would randomly sample half of the training data prior to growing trees. This will prevent overfitting. Subsampling will occur once in every boosting iteration. Lower values make the algorithm more conservative and prevents overfitting but too small values might lead to under-fitting. Typical values: 0.5-1.range: (0,1]
g. colsample_bytree, colsample_bylevel, colsample_bynode [default=1]:This is a family of parameters for subsampling of columns.
All colsample_by parameters have a range of (0, 1], the default value of 1, and specify the fraction of columns to be subsampled.
colsample_bytree is the subsample ratio of columns when constructing each tree. Subsampling occurs once for every tree constructed.
colsample_bylevel is the subsample ratio of columns for each level. Subsampling occurs once for every new depth level reached in a tree. Columns are subsampled from the set of columns chosen for the current tree.
colsample_bynode is the subsample ratio of columns for each node (split). Subsampling occurs once every time a new split is evaluated. Columns are subsampled from the set of columns chosen for the current level.
colsample_by* parameters work cumulatively. For instance, the combination {'colsample_bytree':0.5, 'colsample_bylevel':0.5, 'colsample_bynode':0.5} with 64 features will leave 8 features to choose from at each split.
h. lambda [default=1, alias: reg_lambda]:L2 regularization term on weights(analogous to Ridge regression).This is used to handle the regularization part of XGBoost. Increasing this value will make model more conservative.
i. alpha [default=0, alias: reg_alpha]:L1 regularization term on weights (analogous to Lasso regression).It can be used in case of very high dimensionality so that the algorithm runs faster when implemented.Increasing this value will make model more conservative.
j. tree_method string [default= auto]:XGBoost supports approx, hist and gpu_hist for distributed training. Experimental support for external memory is available for approx and gpu_hist. Choices are auto, exact, approx, hist, gpu_hist
auto: Use heuristic to choose the fastest method. For small to medium dataset, exact greedy (exact) will be used. For very large dataset, approximate algorithm (approx) will be chosen. Because old behavior is always use exact greedy in single machine, user will get a message when approximate algorithm is chosen to notify this choice.
exact: Exact greedy algorithm.
approx: Approximate greedy algorithm using quantile sketch and gradient histogram.
hist: Fast histogram optimized approximate greedy algorithm. It uses some performance improvements such as bins caching.
gpu_hist: GPU implementation of hist algorithm.
k. scale_pos_weight [default=1]:It controls the balance of positive and negative weights.It is useful for imbalanced classes.A value greater than 0 should be used in case of high class imbalance as it helps in faster convergence.A typical value to consider: sum(negative instances) / sum(positive instances).
l. max_leaves [default=0]:Maximum number of nodes to be added.Only relevant when grow_policy=lossguide is set.
There are few other parameters like
sketch_eps, updater, refresh_leaf, process_type,grow_policy ,max_bin, predictor,
num_parallel_tree.
3.Learning Task Hyper parameters
These parameters are used to define the optimization objective the metric to be calculated at each step.They are used to specify the learning task and the corresponding learning objective.
a.objective [default=reg:squarederror]:It defines the loss function to be minimized. Most commonly used values are
reg:squarederror : regression with squared loss.
reg:squaredlogerror: regression with squared log loss 1/2[log(pred+1)−log(label+1)]2. - All input labels are required to be greater than -1.
reg:logistic : logistic regression
binary:logistic : logistic regression for binary classification, output probability
binary:logitraw: logistic regression for binary classification, output score before logistic transformation
binary:hinge : hinge loss for binary classification. This makes predictions of 0 or 1, rather than producing probabilities.
multi:softmax : set XGBoost to do multiclass classification using the softmax objective, you also need to set num_class(number of classes)
multi:softprob : same as softmax, but output a vector of ndata nclass, which can be further reshaped to ndata nclass matrix. The result contains predicted probability of each data point belonging to each class.
b. eval_metric [default according to objective]:The metric to be used for validation data. The default values are rmse for regression, error for classification and mean average precision for ranking. We can add multiple evaluation metrics. Python users must pass the metrices as list of parameters pairs instead of map. The most common values are given below -
rmse : root mean square error
mae : mean absolute error
logloss : negative log-likelihood
error : Binary classification error rate (0.5 threshold). It is calculated as #(wrong cases)/#(all cases). For the predictions, the evaluation will regard the instances with prediction value larger than 0.5 as positive instances, and the others as negative instances.
merror : Multiclass classification error rate. It is calculated as #(wrong cases)/#(all cases).
mlogloss : Multiclass logloss
auc: Area under the curve
aucpr : Area under the PR curve
c. seed [default=0]:The random number seed.This parameter is ignored in R package, use set.seed() instead.It can be used for generating reproducible results and also for parameter tuning.
4.Commandline Hyperparameters
The fourth type of parameters are command line parameters. They are only used in the console version of XGBoost.
num_round:The number of rounds for boosting
data:The path of training data
test:data :The path of test data to do prediction
save_period [default=0]:The period to save the model. Setting save_period=10 means that for every 10 rounds XGBoost will save the model. Setting it to 0 means not saving any model during the training.
task [default= train] options: train, pred, eval, dump.
train : training using data
pred: making prediction for test:data
eval: for evaluating statistics specified by eval[name]=filename
dump: for dump the learned model into text format
model_in [default=NULL]:Path to input model, needed for test, eval, dump tasks. If it is specified in training, XGBoost will continue training from the input model.
model_out [default=NULL]:Path to output model after training finishes. If not specified, XGBoost will output files with such names as 0003.model where 0003 is number of boosting rounds.
model_dir [default= models/]:The output directory of the saved models during training.
fmap :Feature map, used for dumping model.
dump_format [default= text] options: text, json(Format of model dump file).
name_dump [default= dump.txt]:Name of model dump file.
name_pred [default= pred.txt]:Name of prediction file, used in pred mode.
pred_margin [default=0]:Predict margin instead of transformed probability.
Conclusion:
XG Boost is very powerful Machine learning algorithm which can have higher rates of accuracy when specified by its wide range of parameters. Hence XGBoost has become the Dominant model of today’s Data Science world.
Comments