LcT: Locally-Enhanced Cross-Window Vision Transformer

Canhui Wei (Southwest University), Huiwei Wang (Southwest University)*
The 33rd British Machine Vision Conference


Since introducing vision transformers (ViTs), ViT-based models have outperformed convolutional neural networks (CNNs) on various vision tasks. However, competitive ViTs are parameter-heavy and computationally expensive, restricting their applicability to tasks performed on resource-limited devices. In this paper, we propose a lightweight hybrid architecture leveraging the advantage of CNNs and ViTs, called LcT. It achieves the speed-accuracy trade-off with a small number of parameters and computational costs. The proposed LcT is based on two primary operations: 1) It divides the feature map into non-overlapping windows and utilizes the self-attention mechanism to capture cross-window information; 2) It captures local information within each window using a stack of local convolutional layers. The two procedures mentioned above cooperate to complete global information interaction. Our experimental results indicate that LcT is effective and efficient at classification and downstream vision tasks.



author    = {Canhui Wei and Huiwei Wang},
title     = {LcT: Locally-Enhanced Cross-Window Vision Transformer},
booktitle = {33rd British Machine Vision Conference 2022, {BMVC} 2022, London, UK, November 21-24, 2022},
publisher = {{BMVA} Press},
year      = {2022},
url       = {}

Copyright © 2022 The British Machine Vision Association and Society for Pattern Recognition
The British Machine Vision Conference is organised by The British Machine Vision Association and Society for Pattern Recognition. The Association is a Company limited by guarantee, No.2543446, and a non-profit-making body, registered in England and Wales as Charity No.1002307 (Registered Office: Dept. of Computer Science, Durham University, South Road, Durham, DH1 3LE, UK).

Imprint | Data Protection