Revolutionizing Object Segmentation: Fast Faster SAM

Efficient and Lightweight Models for Real-Time Object Segmentation on Mobile Devices

Som
6 min readJul 10, 2023
segment anything!

In the following article, we will delve into an overview of the groundbreaking advancements in object segmentation models, specifically focusing on the revolutionary Fast Faster SAM. Building upon the research paper “Segment Anything” by Meta AI research, these models have paved the way for accurate and efficient object segmentation, pushing the boundaries of what is possible in computer vision. We will explore the key components of Fast Faster SAM, its impressive capabilities, and how it addresses the challenges faced by its predecessors.

Some base information about the SAM model.

🔥 Both of these models are based on Segment Anything model’s research paper by Meta AI research. https://arxiv.org/abs/2304.02643

Basically segment anything. 
It focuses on segmenting objects in images. The models utilize a giant
vision transformer as the image encoder and a prompt encoder for segmenting
various types of objects. You can input both images and text as prompts.

Image ---> image vector with img_enc \\
Decoder model ( consumes vector reqpresentation gives mask)
Text ---> text vector with encoder //

Biggest compute --> Image encoder

Bigger the vision transformer better the performance.
Segment anything model data flow diagram

Faster segment anything paper ( Kyung Hee University)

FasterSAM is a paper from Kyung Hee University that addresses the high computation requirements of SAM models, making them unsuitable for edge devices such as mobile devices. In this paper, they replace the heavyweight image encoder with a lightweight one. They distil the knowledge from the heavyweight encoder to the new encoder, optimizing the output of the lightweight encoder to match that of the heavyweight encoder. The proposed MobileSAM can run on CPUs and is 5 times faster and 7 times smaller than the current FastSAM. They achieve this by replacing the VIT-H (heavyweight) with the VIT-T (tiny transformer).

Github Link code:

Coupled knowledge distillation vs semi coupled knowledge distillation:

The lightweight image encoder

# for constructing this they have 4 stages which will reduce the resolution 
**1st stage:** constructed by the convolution blocks with the inverted residuals
**remaining stages:** consists of the transformer blocks. For downsampling across the stages they are using stride = 2 ( they are skipping pixels in the image )

There are two types of distillation mentioned in the paper: fully coupled knowledge distillation and semi-coupled knowledge distillation. The lightweight image encoder consists of four stages that reduce the resolution. The first stage uses convolution blocks with inverted residuals, while the remaining stages consist of transformer blocks. Downsampling is achieved by using a stride of 2, skipping pixels in the image.

decoupled distillation diagram

Training dataset used:

they used 1% of the training dataset SA-1. They pre-computed image embeddings to streamline the forward process. MobileSAM achieves performance on par with the original SAM model, measured using mIoU (intersection over union i.e. how much overlap is there between the actual ground truth and output of the mobileSAM).

Batch size:

A larger batch size generally results in better performance. A small batch size may lead to gradient descent in the wrong direction. A larger batch size provides a more representative sample from the training dataset, ensuring that steps are taken in the right order considering the representative samples.

Segment anything vs segment everything:

Segment Anything refers to the SAM model’s ability to accurately identify and isolate specific objects in an image when given a prompt. It can segment any object as long as the prompt is set correctly. On the other hand, “segment everything” refers to object proposal generation, where the system suggests potential objects in an image without needing a prompt. SAM can perform both tasks, but its main focus is on “segment anything.” FastSAM is a follow-up method that generates object proposals without prompts, utilizing a mapping algorithm to select the correct masks for prompt-based segmentation. When evaluating and comparing the models, the main focus is usually on the “segment anything” mode, which is the core task of SAM.

The conclusion from this paper: Making SAM mobile friendly by replacing large heavy weights with a lightweight image encoder. Plug and play to the existing SAM-based projects from heavy-weight SAM to lightweight SAM with almost zero effort.

Live demo on FasterSAM:

Fast SAM paper (Wuhan AI research)

Abstract

They reformulate the task of the SAM model ( segment generation and prompting), They propose that the regular CNN detector with an instance of segmentation branch can do this task well. Convert the task into the proper segmentation by using the 1/50th of the final generated Segment anything 1B dataset provided by SAM authors. They get the results 50x faster than the SAM. Segmentation with high fps compared to the previous model.

Fast SAM architecture

NMS: non-maximum suppression: looks at the confidence and then combines the additional bounding box. It has to be run every time when we have to process the image making it computationally expensive.

live demo on huggingface spaces: https://huggingface.co/spaces/An-619/FastSAM

Github code link

🤗 We have focused more on the faster segment anything model since it’s doing more of the comparison study comparing itself to the fast segment anything 🙂

Conclusion

We have seen the summary of the papers which are working over the fast version of the SAM ( segment anything model ) from meta-research. both of them have 3 components. image encoder, mask decoder and prompt encoder. both of the repos are working on the same thing replacing image encoders on the smaller model. Both have the hugging-face demo. Both are almost identical having a relative speed up from the original SAM model.

References:

Faster Segment Anything paper towards lightweight SAM( segment anything model for mobile applications) — Kyung Hee University

Fast segment anything — Wuhan AI research

Special thanks to Hu-po for making this video about SAM, FastSAM and FasterSAM.

The below video explains it in more depth.

If you find this article intriguing, please consider giving me a follow on Medium. 😃

WRITER at MLearning.ai // Code Interpreter 88 uses // 800+ AI tools

--

--