In machine/deep learning main motive of optimizers is to reduce the cost/loss by updating weights and biases and to improve model performance.
Optimization algorithms have following goals :
To reach global minima . This is feasible if the function is convex, which means only one local or global minima.
When the objective function is not convex as in most deep learning problems, the objective is to find the lowest possible value of the objective function within its neighborhood.
There are various kinds of Optimizers and each has its own pros and cons. Lets start with the most basic optimizer called Batch Gradient Descent.
Batch Gradient Descent uses the entire training set data to calculate the gradient of the cost function.
As whole data is taken at on single time it reached global minima without any noise, but is suitable for small data sets only.
Because this method calculates the gradient for the entire data set in one go, the calculation is very slow.
Also it needs high configuration of system to go for the whole dataset at a single time.
Batch gradient descent can converge to a global minimum for convex functions and to a local minimum for non-convex functions.
Stochastic Gradient Descent
If we compare Batch Gradient Descent gradients are calculated with all data at one time, Stochastic Gradient Descent updates the gradient of each data sample with each update, i.e. single dataset is fed into the system and weights are updated.
As there may be similar samples for large datasets, and BGD calculates the gradient. There will be redundancy, and SGD is updated only once, there is no redundancy, it is faster, and new samples can be added as and when more data comes.
Disadvantages: As SGD is updated more frequently, the cost function will have severe oscillations. BGD can converge to a local minimum because whole data is taken into consideration at one time ,while there may be oscillation for SGD as data is updated frequently and it may jump to a better local minima.
When we decrease the learning rate slightly, the convergence of SGD and BGD is the same.
For non-convex functions, it is important to avoid trapping at the local minimum or saddle point, because the error around the saddle point is the same, the gradients of all dimensions are close to 0, and SGD is easily trapped here.
One disadvantage of the SGD method is that its update direction depends entirely on the current data, so its update is very unstable. A simple way to solve this problem is to introduce momentum.
Mini Batch Stochastic Gradient Descent
To overcome the problem of SGD and BGD , MBGD uses a small batch of samples, i.e., a number of samples to calculate each time. In this way, it reduces the variance when the parameters are updated, and which in turn makes convergence more stable. It can make full use of the matrix operations which are highly optimized in the deep learning library for more efficient gradient calculations.
The difference from SGD is that each cycle does not act on each sample, but a batch with a number of samples.
Mini-batch gradient descent may or may not give good convergence. All depends on the mini batch data taken up.
If we decrease the learning rate, the convergence rate will be slow. If learning rate is too large, the loss function will oscillate and deviate at the minimum value.
Mini Batch Stochastic Gradient Descent with Momentum
If we draw Vdw equation on a graph we will see the smoother curves, hence removing noise from the Stochastic Mini batch gradient descent.
In formula above we see that we can decide to give weightage to which data, coming at a particular time depending on the value of beta.
Hence the learning will be wholesome from previous data points retaining their direction thus removing noise.
Beta is taken as 0.95 as optimal value according to researchers.
Momentum simulates the inertia of an object when it is moving, i.e. the way of the previous update is retained to a some percentage during the update, while the current update gradient is used to optimize the final update direction. Stability is increased to a certain extent, so learning is faster, and also have the ability to get rid of local optimization.
Adagrad is an algorithm for gradient-based optimization that adapts to the learning rate of the parameters. If we use low learning rates for parameters associated with most frequently occurring features, and high learning rates for parameters associated with infrequent features. We can get a good model.
It is most suited for sparse data, i.e. where most has a number of features not used very frequently.
The same update rate for all parameters may not solve the purpose as, some parameters may have reached the stage where only fine-tuning is needed for them to give results while some parameters need to be adjusted a lot due to the small number of samples associated with them.
The formula of Adagrad is as under:
Adagrad , an algorithm that adaptively assigns different learning rates to various parameters among them. The problem is that for each parameter, as its total distance updated increases, its learning rate also slows.
Adagrad eliminates the need to manually tune the learning rate.
As the learning rate is decreasing drastically so in deep neural networks, at a particular time the weight updation will be so small that it will stop moving towards global minima.
There are three main problems associated with the Adagrad algorithm
The learning rate is keeps decreasing.
The learning rate of the late training period is very small.
It requires a lot of manual setting a global initial learning rate.
Adadelta is an extension of Adagrad and it also tries to reduce Adagrad’s aggressively, reducing the learning rate:
It is done by restricting the past accumulated gradient to some fixed size of w. Running average at time t will depend on the previous average and the current gradient.
In Adadelta we do not need to set the default learning rate as we take the ratio of the running average of the previous time steps to the current gradient.
The RMSProp algorithm full form is called Root Mean Square Prop, which is an adaptive learning rate optimization algorithm proposed by Geoff Hinton.
RMSProp tries to resolve Adagrad’s radically diminishing learning rates problem by using a moving average of the squared gradient. It utilizes the magnitude of the recent gradient descents to normalize the gradient as shown in the equation below.
While Adagrad accumulates all previous gradient squares, RMSprop just calculates the corresponding average value, so it can eliminate the problem of quickly dropping of learning rate of the Adagrad .
In RMSProp learning rate gets adjusted automatically and it chooses a different learning rate for each parameter.
This algorithm divides the learning rate by the average of the exponential decay of squared gradients.
Adaptive Moment Estimation (Adam) computes adaptive learning rates for each parameter. It remembers and stores an exponentially decaying average of past squared gradients like Adadelta and RMSprop.
Adam also keeps an exponentially decaying average of past gradients, similar to momentum.
Adam is a combination of both Momentumand RMSprop,Momentum brings in smoothening effect and reduces noice while RMSProp brings in Learning rate updation in a proper way whuch is free from disadvantages of Adagrad.
Vdw, Vdb, Sdw, Sdb all are set to 0 initially.
On iteration t compute dL/dw and dL/db. Adam uses smoothening effect of momentum and advantages of perfectly updated learning rate of RMSProp to its advantage.
Adam implements the exponential moving average of the gradients to scale the learning rate instead of a simple average as in Adagrad.
Adam is computationally efficient and has very less memory requirement.
Adam is so far the best optimizer that is known.
Things to consider before selecting an optimizer
RMSprop, Adadelta, Adam have similar effects in many cases.
If the data is sparse, use methods, like Adagrad, Adadelta, RMSprop, Adam.
Adam added momentum and learning rate on the basis of RMSprop.
As the gradient becomes sparse, Adam will perform better than RMSprop.
If we need faster convergence or for deeper and complex neural networks an adaptive algorithm is needed.
SGD if used without momentum can reach a minimum value, but will takes longer than other algorithms and may be trapped in the saddle point.
I hope the Optimizers concept is by far clear, its the beauty of mathematics and playing around with equations which researchers spent a lot of time on. For all Optimizers now minibatch is always used.
Thanks for reading and happy modelling!