Fast AI and optimizers

3 min readJul 1, 2021

The day before yesterday I built my own optimizers
Before this, I just knew that the optimizer is something that tells our model what to do in which direction to go and how much it should go ( learning rate )

previously before this, I just knew Adam() is the one which you want no matter what the problem Adam is got your back, haha!

But this blog is gonna be SGD

so let’s start with what is SGD

formal defination of SGD (Stochastic Gradient Descent ) goes like this,Stochastic gradient descent (often abbreviated SGD) is an iterative method for optimizing an objective function with suitable smoothness properties (e.g. differentiable or subdifferentiable). It can be regarded as a stochastic approximation of gradient descent optimization, since it replaces the actual gradient (calculated from the entire data set) by an estimate thereof (calculated from a randomly selected subset of the data). Especially in high-dimensional optimization problems this reduces the computational burden, achieving faster iterations in trade for a lower convergence rate.source wikipedia

dang this is hard feels like something is hard by looking at definition but trust me it’s not that hard

Stochastic stands for system or process linked with the random probability

and gradient descent means slope

so the whole meaning about SGD goes like some random probability with the weights and slope of some function with some points

The whole process uses the sample approximation theorem as its main part cause you know acc to theorem any function can be solved if we have the proper weights for the function

SGD tries to find those functions with help of slopes

as you may have studied in earlier machine learning courses there are some things that you have to do with the weights and the biases

so if we have some random function and we are assigning some random weights to it and bias ( this is the important part )

like y = summation (weights * x ) + bias

so weights * x gives the total weights multiplied with the functional parameters

optimizer consists of 3 main things

learning rate: to how much intensity is it gonna take a step
steps: how many steps are it gonna take this is calculated as

steps = derivatives of gradients* learning rate

zero grads = to make the gradients again 0 cause torch measures each step we take and it adds them if we didn’t make 0

steps for finding the grads:

1. iterate through the x , y in dl2. make the predictions wiht the models3. calculate the loss using loss function4. calculate the loss derivative using loss.backward()5. update the parameters using paramerters -= parammeters.grad * lr ( we do step 4,5 with optimizer)

the optimizer code :

class BasicOptim:def __init__(self, params,lr):self.params,self.lr = list(params),lrdef step(self, *args, **kwargs):       for p in self.params: p.data -= p.grad.data * self.lrdef zero_grad(self,*args, **kwargs):      for p in self.params:p.grad = None

Fast AI and optimizers

Written by Som

No responses yet