Beyond the CLS Token: Image Reranking using Pretrained Vision Transformers

Chao Zhang (Toshiba Europe Limited),* Stephan Liwicki (Toshiba Europe Limited), Roberto Cipolla (University of Cambridge)
The 33rd British Machine Vision Conference


We propose to leverage structural similarity of pretrained vision transformers for image retrieval reranking. Vision transformers have become the dominant architecture in many computer vision tasks. However, the usage of global representation ($\textbf{CLS}$ token) makes for the lack of interpretability. Not all patches are equally important for image retrieval, our key idea is to exploit trained model for optimal spatial weights with respect to patch tokens. To understand the relationship between global and local representations of vision transformers, we compare multiple transformers architectures against ResNet using similarity as an indicative measure. Our analysis reveals that the usage of convolutions inside vision transformers is vital to learn suitable patch embeddings for structural similarities. We also find that local patch similarity equipped with an optimal transport solver could improve image retrieval using global similarity alone. Our evaluations with off-the-shelf pretrained vision transformers suggest that structural similarity not only improves retrieval performance without re-training, but also provides visualization cues for interpretable image similarity. Evaluations on three benchmarks show that our proposed structural approach outperforms the state of the art for interpretable reranking.



author    = {Chao Zhang and Stephan Liwicki and Roberto Cipolla},
title     = {Beyond the CLS Token: Image Reranking using Pretrained Vision Transformers},
booktitle = {33rd British Machine Vision Conference 2022, {BMVC} 2022, London, UK, November 21-24, 2022},
publisher = {{BMVA} Press},
year      = {2022},
url       = {}

Copyright © 2022 The British Machine Vision Association and Society for Pattern Recognition
The British Machine Vision Conference is organised by The British Machine Vision Association and Society for Pattern Recognition. The Association is a Company limited by guarantee, No.2543446, and a non-profit-making body, registered in England and Wales as Charity No.1002307 (Registered Office: Dept. of Computer Science, Durham University, South Road, Durham, DH1 3LE, UK).

Imprint | Data Protection