XG Boost -The base and Mathematics behind it

XG Boost is one of the best machine learning algorithms known, which is high on speed and performance due to various factors like parallelization, cache optimization, out of memory computation and regularization, auto pruning, also missing values treatment.

Before going into mathematics and intuition of XG Boost it is necessary that reader has an idea of regularization, ensemble, bagging, boosting techniques like Ada boost and Gradient Boost.

If you want an idea of it all description is given in easy language below.

If you already have an idea of it just jump to XG Boost and get started.

Ensemble-Ensemble technique is one which combines various algorithms to form one optimized predictive algorithm. Making small decision trees instead of one from random batch of data and aggregate them into one final, strong predictor.

Bagging – Random Forest, All trees have equal say in final outcome. It is parallel learning model, all models are independent. The tress are expanded till whole length nut still are weak learners.

Boosting- Ada boost, It is sequential learning model, the model built afterwards are smarter and have more say in the final outcome. The trees are not expanded fully and are just one root of 2 leaf nodes known as stumps.

Ada boost- It is derived from Adaptive Boosting. It is a training process by which only those features are selected in model by the predictive power is improved, reducing dimensionality and improving execution time as irrelevant features are removed gradually. This model learns through updating weights in sequence, i.e. models are dependent on each other. Boosting happens by increasing weights of misclassified dataset and decreasing weights of correctly classified dataset.

In next model the misclassified from last model is tried to be classified correctly and hence adaptive learning happens in each sequential model. End model is one built by all previous learnings.

AdaBoost is sensitive to noisy data and outliers


1) The initial weights are one divided by the number of records in the dataset. example 1/3 for above example.

2) The weights are increased for misclassified or wrongly predicted record, example record 2 above and decreased for the rightly predicted ones so that their sum remains same. It is done so that the algorithm concentrates and emphasizes more on the data which is difficult to predict or classify.

3) In iteration 2 weights are again updated based on results and wrongly predicted ones are tried to be predicted perfectly.

4) In Final iteration Learning happens and weights are updated.

Amount of say or weight in model is determined by formula

Amount of say =1/2 log( 1-Total Error)/Total Error

When stump is good then total error is near to Zero, amount of say will be near to 1, when stump is horrible the total error is near to one, Amount of say will be large negative value in this case.

If total error is 0.5 that is half misclassified the Amount of say will be 0.

New Sample Weight for wrongly Classified=Sample weight * e amount of say

New Sample Weight for correctly classified=Sample weight * e -amount of say

Increase of weight means the misclassified sample is added more than once in new data set giving it more weightage or priority, it means if this sample will be misclassified again the penalty will be high as the samples are more than once in new data.

GradientBoost-It is a machine learning technique for regression as well as for classification problems and uses sequential learning through ensemble techniques. It uses optimization of Loss Function for better learning, i.e . Difference between Actual and predicted will be tried to bring to zero, i.e to the most close value to actual. Trees are better grown and not stumps like in Adaboost. Leaf nodes range from 8-32.


1) It calculates the average of the target column for regression problem, example 160 in above case. For classification problem probability will come under picture.

2) Calculates residual i.e. Actual –Predicted values

3) Residual becomes the target column and model is fit on it.

4) Residual values are then predicted with model, predictions will change.

Iteration 1: New Predicted Value =Average Value(step 1) + learning rate(0.1)* New residual Value.

Final Value= Base Value(step 1) + learning rate(0.1)* residual Value(iter 1) + learning rate(0.1)* residual Value(iter 2)……….+ learning rate(0.1)* residual Value(iter n)


1) Ridge –Ridge regression works when most of the features in a model are useful. Its shrinks the parameters but do not remove any of them.

Suppose data is less to train the model i.e. blue dots are train data. If we have only 2 points we try to make a linear regression line y=mx+b , m is slope and b is intercept here.

Now if we see test data then we see that there is a lot of variance and the line crossing over train data was an overfitting model.

Now if we add bias and decrease the slope, we will see it better fits the test data, and not an overfit for train data. Here we see how a ridge makes a model better.

Ridge Regression Line = ƛ * m2 + Sum of squared residuals

If ƛ=0 Line = Line of least squares.

If ƛ=1, Line introduces bias and variance is decreased, slope is lesser, i.e. line is less sensitive to x.

With increase of ƛ the slope gets smaller and smaller till it reaches near to zero but not equal to zero. Hence getting lesser sensitive to x.

We use 10 fold cross validations to determine which value of ƛ gives least variation.

We always need more data than parameters to determine the model, but if the data set is small we use Ridge regression, which uses cross validation and reduces variance by making predictions less sensitive to the training data which is done by adding penalty i.e. ƛ * m2 and lambda is determined by using cross validation.

Ridge regression works well if all parameters are important in prediction.

2) Lasso-It is similar to Ridge regression. When model contains lots of useless features then Lasso regression is used. It removes useless features and make the model that is easy to interpret.

Lasso lambda is used to regularize and to avoid overfitting if records of data are less. It uses absolute value of errors instead of squared values as in ridge regression.

Lasso Regression Line = ƛ * |m | + Sum of squared residuals

If ƛ=0 Line = Line of least squares.

If ƛ=1, Line introduces bias and variance is decreased, slope is lesser, i.e. line is less sensitive to x.

With increase of ƛ the slope gets smaller and smaller till it reaches to zero, i.e almost insensitive to x data. Which means all the features which are unimportant in predicting the values in a model are removed totally.

It is hence better than Ridge regression in the sense that it reduces variance and removes unimportant features from predicting.

3) Elastic Net: When there are millions of features and it is nearly impossible to know all of them we use Elastic Net model, which is a mixture of both Lasso and Ridge models and uses their best features to give an optimized model.

It is especially good when there is correlation between parameters. Lasso regression just picks one of the correlated parameter and eliminates the others which are less correlated. Ridge regression shrinks the parameters of all the correlated features together but do not remove them.

By combining both penalties of Lasso and Ridge it gives the best of both.

Elastic Net: = ƛ1 * (|v1 | + |v2 | + ………… |vn| ) + Sum of squared residuals

+ ƛ2 *( V12 + V22 + …..Vn2) + Sum of squared residuals

V1, V2 are variables

If ƛ1 , ƛ2 both are zero it is least square method.

If ƛ1 >0 ƛ2 =0 , It is Ridge regression

If ƛ1 =0 ƛ2 >0 , It is Lasso regression

If both If ƛ1 >0 ƛ2 >0 , It is elastic regression which will take best of both.

XG Boost

XGBoost is a sequential Ensemble model, it is extreme gradient boosting model. It is extension of gradient boost. Now understand the Maths behind it.


1) We make a base model first with average of target field, and calculating loss.

2) Next Model is made on loss errors and tries to minimize the errors.

We need to understand

a) ƛ(Lamda) as regularization parameter.

b) ε (Learning rate) which is measure of how fast we want to converge values to actual.

c) ɤ (gamma), which is a value which will decide pruning criteria. Lower the gamma lower aggressive approach to prune the tree, higher the gamma , highly aggressive approach to prune the tree.

Whenever gamma value is less than gain value then split will happen otherwise split will not happen

d) Similarity Score= Sum of square of residual/ Number of rows + ƛ

For ƛ=0 SS= (-5+-2+4+3+5 )2 / 5 =25/5+0


For ƛ=1 SS= (-5+-2+4+3+5 )2 / 5+1 =25/6


e) Gain= Similarity Score after split – Similarity Score of node above(branch before split)

Gain at Level 1 For ƛ=0 = 24.5 + 48 – 5= 67.5

Gain at Level 1 For ƛ=1 = 16.3 +36 – 4.16 = 48.14

Lesser gain at each level means that to control the tree to overfit, ƛ value can be increased.

Impact of outlier on prediction will come down significantly if ƛ is more.

Hence regularization is used to control Overfitting.

As the similarity score and gain value will come down it will

Suppose here ɤ value given is 60, then again split will happen and gain is calculated else if ɤ value given is 70, no further split will happen. That is how auto-pruning will happen.

f) New Prediction= Old Prediction + Learning Rate( ε ) * Output

Let’s take example of row 3 --179 cms since age is more than 16 its gain is 67.5

For ƛ=0 , 175+ (0.1) * 67.5= 175+ 6.75= 181.75

New Residual = 179-181.75= -2.75

For ƛ=1 , 175+ (0.1) * 48.14= 175+ 4.814= 179.814

New Residual = 179-179.814= -0.814

It is done in same way for all records. And iterations are done when the values are near to actual ones.

There are lot other things that make XG Boost the best is its features like:

Hardware Optimization: This algorithm is designed to make use of hardware resources efficiently. This is done by cache awareness by allocating internal buffers in each thread to store gradient statistics. Out-of-core computing optimize available disk space while handling big data-frames that do not fit into memory.

Missing Values Handling: Missing values are handled in XG Boost in the best possible way, by using various techniques. And Cross Validation is done at each step to be sure.


XG Boost is by far deemed as the best algorithm for Supervised learning. CatBoost and Light GBM are very near in race and its a matter of time when a new algorithm will come and will be better than XG Boost.

This algorithm is almost perfect in terms of performance and predictive capabilities, however one needs to study it in depth to know about it and its techniques of internal functioning.

If there are a lot of features which are categorical in nature then CatBoost of Light GBM is recommended as XG Boost may convert them to One Hot encoding resulting in sparse dataset. which is not good for decision trees.

However in Classification and Regression models it is still reigning the Machine learning World with its extra Ordinary capabilities to deal with data.

Thanks for reading!

68 views0 comments

© Numpy Ninja.