SP-ViT: Learning 2D Spatial Priors for Vision Transformers

Yuxuan Zhou (University of Mannheim),* Wangmeng Xiang (The Hong Kong Polytechnic University), Chao Li (Alibaba), Biao Wang (Alibaba), Xihan Wei (Alibaba), Lei Zhang ("Hong Kong Polytechnic University, Hong Kong, China"), Margret Keuper (University of Mannheim), Xian-Sheng Hua (Damo Academy, Alibaba Group)

The 33^rd British Machine Vision Conference

Abstract

Recently, transformers have shown great potential in image classification and established state-of-the-art results on the ImageNet benchmark. However, compared to CNNs, transformers converge slowly and are prone to overfitting in low-data regimes due to the lack of spatial inductive biases. Such spatial inductive biases can be especially beneficial since the 2D structure of an input image is not well preserved in transformers. In this work, we present Spatial Prior–enhanced Self-Attention (SP-SA), a novel variant of vanilla Self-Attention (SA) tailored for vision transformers. Spatial Priors (SPs) are our proposed family of inductive biases that highlight certain groups of spatial relations. Unlike convolutional inductive biases, which are forced to focus exclusively on hard-coded local regions, our proposed SPs are learned by the model itself and take a variety of spatial relations into account. Specifically, the attention score is calculated with emphasis on certain kinds of spatial relations at each head, and such learned spatial foci can be complementary to each other. Based on SP-SA we propose the SP-ViT family, which consistently outperforms other ViT models with similar GFlops or parameters. Our largest model SP-ViT-L achieves 86.3% Top-1 accuracy with a reduction in the number of parameters by almost 50% compared to previous state-of-the-art model (150M for SP-ViT-L↑384 vs 271M for CaiT-M-36↑384) among all ImageNet-1K models trained on 224 × 224 and fine-tuned on 384 × 384 resolution w/o extra data. Code can be found at https://github.com/ZhouYuxuanYX/SP-ViT.

Video

Citation

@inproceedings{Zhou_2022_BMVC,
author    = {Yuxuan Zhou and Wangmeng Xiang and Chao Li and Biao Wang and Xihan Wei and Lei Zhang and Margret Keuper and Xian-Sheng Hua},
title     = {SP-ViT: Learning 2D Spatial Priors for Vision Transformers},
booktitle = {33rd British Machine Vision Conference 2022, {BMVC} 2022, London, UK, November 21-24, 2022},
publisher = {{BMVA} Press},
year      = {2022},
url       = {https://bmvc2022.mpi-inf.mpg.de/0564.pdf}
}

Copyright © 2022 The British Machine Vision Association and Society for Pattern Recognition
The British Machine Vision Conference is organised by The British Machine Vision Association and Society for Pattern Recognition. The Association is a Company limited by guarantee, No.2543446, and a non-profit-making body, registered in England and Wales as Charity No.1002307 (Registered Office: Dept. of Computer Science, Durham University, South Road, Durham, DH1 3LE, UK).

Imprint | Data Protection