Paper Reading Help

RepViT: Revisiting Mobile CNN From ViT Perspective

Abstract

  • the architectural disparities between lightweight ViTs and lightweight CNNs have not been adequately examined.

  • We incrementally enhance the mobile friendliness of a standard lightweight CNN, specifically MobileNetV3, by integrating the efficient architectural choices of lightweight ViTs

  • This ends up with a new family of pure lightweight CNNs, namely RepViT

Introduction

Efficient Design Principles

image_20240201_110200.png

ViT

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

a pure transformer applied directly to sequences of image patches can perform very well on image classification tasks.

We split an image into patches and provide the sequence of linear embeddings of these patches as an input to a Transformer. Image patches are treated the same way as tokens (words) in an NLP application. We train the model on image classification in supervised fashion.

image_20240201.png

Image Classification

Swin Transformer

Swin Transformer

image_20240201_173400.png
image_20240201_174100.png
image_20240201_180600.png
Pyramid Vision Transformer

Pyramid Vision Transformer

image_20240201_185400.png

semantic segmentation

Masked-attention Mask Transformer for Universal Image Segmentation

https://arxiv.org/pdf/2112.01527.pdf

  • Unify pan-optic, instance, semantic segmentation

  • image_20240201_225400.png

Masks are learned through training.

SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers

https://arxiv.org/pdf/2105.15203.pdf

image_20240202_093400.png

Object Detection

End-to-End Object Detection with Transformers
image_20240202_102800.png
MViTv2: Improved Multi-scale Vision Transformers for Classification and Detection
image_20240202_110000.png
image_20240202_110300.png

Enhancing Computational Efficiency of Vision Transformers for Mobile Devices through Effective Design Principles

Mobile-Former: Bridging MobileNet and Transformer

image_20240202_205700.png
image_20240202_210500.png

Rethinking vision transformers for mobilenet size and speed.

  • proposed EfficientFormerV2

  • Token Mixers

  • Search Space Refinement

  • MHSA Improvements

  • Attention on Higher Resolution

  • Dual-Path Attention Down sampling (CNN + Attention Parallel Computation and Sum)

MOBILEVIT: LIGHT-WEIGHT, GENERAL-PURPOSE, AND MOBILE-FRIENDLY VISION TRANSFORMER

image_20240202_221300.png
  • No need for patch embedding

EdgeViTs: Competing Light-weight CNNs on Mobile Devices with Vision Transformers

image_20240203_1005.png
image_20240203_1007.png

Separable Self-attention for Mobile Vision Transformers: MobileViTv2

image_20240203_103900.png
image_20240203_105900.png

Summary

  • In this work, we revisit the design of lightweight CNNs by incorporating the architectural choices of lightweight ViTs.

  • Our research aims to narrow the divide between lightweight CNNs and lightweight ViTs, and highlight the potential of the former for employment on mobile devices compared to the latter.

we begin with a standard lightweight CNN, i.e., MobileNetV3- L. We gradually “modernize” its architecture by incorporating the efficient architectural designs of lightweight ViTs

RepViT has a MetaFormer structure, but is composed entirely of convolutions.

Generic Architecture

Methodology

Preliminary

  • We utilize the iPhone 12 as the test device and Core ML Tools as the compiler.

  • We measure the actual on-device latency for models as the benchmark metric.

  • employ GeLU activations in the MobileNetV3-L model

image_20240203_232700.png

Details about data-augmentation trick mentioned above

Block Design

image_20240204_000500.png
image_20240204_000600.png

1x1 expansion is typically used to increase the number of channels in the feature maps, while the 1x1 projection convolution is used for dimensionality reduction. The combination of them is channel mixer.

And the depth-wise convolution is token mixer.

The squeeze and excitation module is also moved up to be placed after the depth-wise filters, as it depends on spatial information interaction.

Reducing the expansion ratio and increasing width

Macro design

image_20240204_110300.png

patchifying operation results in ViTs’ substandard optimizability and sensitivity to training recipes. To address these issues, they suggest using a small number of stacked stride-2 3*3 convolutions as an alternative architectural choice for the stem, known as early convolutions.

Deeper down-sampling layers.

image_20240204_114100.png

Simple classifier.

image_20240204_114700.png

Overall stage ratio

Stage ratio represents the ratio of the number of blocks in different stages, thereby indicating the distribution of computation across the stages

The original stage ratio of MobileNetV3-L is 1:2:5:2. Therefore, we follow [19] to employ a more optimal stage ratio of 1:1:7:1 for the network. We then increase the network depth to 2:2:14:2, achieving a deeper layout [19, 24]. This step increases the top-1 accuracy to 76.9% with a latency of 0.91ms.

Micro design

Kernel size selection

large kernel-sized convolution is not friendly for mobile devices, due to its computation complexity and memory access costs.

Squeeze-and-excitation layer placement

Network architecture

We develop multiple RepViT variants, including RepViT-M0.9/M1.0/M1.1/M1.5/M2.3. The suffix “-MX” means that the latency of the model is Xms. RepViT-M0.9 is the outcome of the “modernizing” process applied to MobileNetV3-L. The different variants are distinguished by the number of channels and the number of blocks within each stage.

Last modified: 10 March 2024