Introduction
This project is about scaling Vision Transformers (ViTs) and Convolutional Neural Networks (ConvNets) for full self-driving.
The goal was to learn how to do distributed, fault-tolerant, and scalable training of large models on a large GPU cluster comprising of ~32
nodes or roughly 256x A100-80GB
GPUs using Tensorflow & PyTorch, as well as SLURM
.
Dataset
I used the BDD100K
dataset for training the model. The dataset is a diverse driving video dataset with over 100K videos, totalling around 2TB
of data. For efficiency, I used a TFRecord
based pipeline for easy serialization and efficient distributed deserialization of the dataset.
The dataset was stored on AWS buckets, and was accessed using the s3fs
. ray
was used for parallelizing the data preprocessing step.
Model
I used the ConvNext
model as the base model for this project. It managed to outperform the ViT baselines through little hyperparmeter tuning, and was more more lightweight in terms of FLOPs as well as memory usage.
For containerization, I used Docker
and Singularity
for creating the container images. The images were then pushed to the Singularity cloud build where they could be cached and downloaded whenever a training run commences.
Result
This is a sample predicted trajectory of the model on a validation sample: