Training ML models without keeping your machine running!

Som
3 min readDec 1, 2021
Photo by Lukas Blazek on Unsplash

If you are learning machine learning you might have observed that training models are one of the tedious processes and if you are having bigger architecture models it might take a long time just to run and save the model.

For the training model, if you are using Google Colab then you must have experienced that runtime getting disconnected after some time of inactivity (probably it’s less than 1/2 hr)

The alternative is Kaggle but with Kaggle you don’t have a way for storing data in your google drive but although you can store any dataset you want in Kaggle datasets and you can use it from there.

The best part about Kaggle is you can keep running notebook without even your internet connection after setting your notebook for running and storing results you want to store (in this case we want to store our model with model.save and some outputs in CSV )

After storing the data and completing your notebook you can get access to your saved model and some other outputs with the notebook output file ( which is saved by Kaggle when you do run and save) let’s see what are the requirements for doing this.

Requirements :

  • You should have your kaggle account
  • You should have a dataset that you want to work on stored on Kaggle.
  • And yes some knowledge of PyTorch 🔥 or TensorFlow for running and saving our model

Steps:

  1. Create a new code notebook in Kaggle.
  2. Add your data using Add data button on the navigation section in the top right corner
  3. Change runtime to GPU with help of the right-side panel of your notebook (GPU trains your model faster ) it’s in the accelerator section
  4. Make sure you are running on GPU by running !nvidia-smi in code cells
  5. Create model building and saving files
  6. Make sure there is no error in your code. ( if your code contains an error in sequential execution then the saving process might fail)
  7. Make sure your model doesn’t go beyond 9 hrs
  8. Creating and having all the code you wanted you will have to store the model using model.save function and again one last check that everything is working fine.
  9. After clicking Save a version on the top right corner of your screen make sure you click on Save and commit changes instead of quicksave ( quicksave will not save your output while save and commit will save your output)
  10. After clicking save and commit you will see a message that your file is getting executed in the bottom right notification section in your notebooks and you are good to go and close your browser you can even turn off your computer during the process.
  11. After finishing running your notebook Kaggle will give you notice that it finished running the notebook

Why this is needed:

In competitions on Kaggle when we have to build models for real-world problems we were working on lots of GB’s data and this model-building procedure is surely gonna take time.

Although google collab offers you Tesla T4 like GPU’s it doesn’t offer you GPU for a long time whereas Kaggle offers you Tesla P100 GPU it’s not as fast as Tesla T4 but just as fast for getting our work done and it provides us GPU for a long time(9hrs).

In competitions, you will have to create a model training notebook and model inference notebook by saving model and building model with this step works really well

Notebook link:

You can refer above notebook for model building and training steps. Notebook runtime is about 7 hrs.

thanks for reading my blog :) follow for more say hi to me in comments it gives me encouragement for writing more blogs :) have a good day :)

--

--