The day before yesterday I built my own optimizers
Before this, I just knew that the optimizer is something that tells our model what to do in which direction to go and how much it should go ( learning rate )
previously before this, I just knew Adam() is the one which you want no matter what the problem Adam is got your back, haha!
But this blog is gonna be SGD
so let’s start with what is SGD
formal defination of SGD (Stochastic Gradient Descent ) goes like this,Stochastic gradient descent (often abbreviated SGD) is an iterative method for optimizing an objective function with suitable smoothness properties (e.g. differentiable or subdifferentiable). It can be regarded as a stochastic approximation of gradient descent optimization, since it replaces the actual gradient (calculated from the entire data set) by an estimate thereof (calculated from a randomly selected subset of the data). Especially in high-dimensional optimization problems this reduces the computational burden, achieving faster iterations in trade for a lower convergence rate.source wikipedia
dang this is hard feels like something is hard by looking at definition but trust me it’s not that hard
Stochastic stands for system or process linked with the random probability
and gradient descent means slope
so the whole meaning about SGD goes like some random probability with the weights and slope of some function with some points
The whole process uses the sample approximation theorem as its main part cause you know acc to theorem any function can be solved if we have the proper weights for the function
SGD tries to find those functions with help of slopes
as you may have studied in earlier machine learning courses there are some things that you have to do with the weights and the biases
so if we have some random function and we are assigning some random weights to it and bias ( this is the important part )
like y = summation (weights * x ) + bias
so weights * x gives the total weights multiplied with the functional parameters
optimizer consists of 3 main things
learning rate: to how much intensity is it gonna take a step
steps: how many steps are it gonna take this is calculated as
steps = derivatives of gradients* learning rate
zero grads = to make the gradients again 0 cause torch measures each step we take and it adds them if we didn’t make 0
steps for finding the grads:
1. iterate through the x , y in dl2. make the predictions wiht the models3. calculate the loss using loss function4. calculate the loss derivative using loss.backward()5. update the parameters using paramerters -= parammeters.grad * lr ( we do step 4,5 with optimizer)
the optimizer code :
class BasicOptim:def __init__(self, params,lr):self.params,self.lr = list(params),lrdef step(self, *args, **kwargs): for p in self.params: p.data -= p.grad.data * self.lrdef zero_grad(self,*args, **kwargs): for p in self.params:p.grad = None