Hugs Are Better Than Handshakes: Unsupervised Cross-Modal Transformer Hashing with Multi-granularity Alignment


Jinpeng Wang (Tsinghua University),* Ziyun Zeng (Tsinghua University), Bin Chen (Harbin Institute of Technology, Shenzhen), Yuting Wang (Tsinghua University), Dongliang Liao (Data Quality Team, WeChat, Tencent Inc., China), Gongfu Li (Tencent Inc.), Yiru Wang (Tencent Inc.), Shu-Tao Xia (Tsinghua University)
The 33rd British Machine Vision Conference

Abstract

The goal of unsupervised cross-modal hashing (UCMH) is to map different modalities into a semantic-preserving hamming space without requiring label supervision. Existing deep approaches mainly took classic convolutional neural networks and multilayer perceptrons to encode images and texts, which are inadequate for semantic extraction and hard to generate high-quality hash codes. Motivated by recent advances in transformers, we take the first investigation of transformer-based UCMH that learns to generate hash codes via global representation (i.e., [CLS]) tokens. We propose hugging, a multi-granularity aligning framework for transformer-based UCMH learning. In particular during training, apart from directly aligning hash codes from global tokens, hugging further develops fine-grained alignment based on content token sequences, which fully exploits the structural semantics contained in transformer architectures. Unifying global and fine-grained alignment enables complete cross-modal learning, helping to bridge heterogeneous modality gaps and providing solid self-supervision. As an instantiation of the proposed hugging framework, we build a simple HuggingHash model with a contrastive hashing learning objective and demonstrate its comprehensive merits on three benchmark datasets. Moreover, we also adapt several state-of-the-art hashing methods using the hugging framework, verifying that it can be general and practical to benefit transformer-based UCMH.

Video



Citation

@inproceedings{Wang_2022_BMVC,
author    = {Jinpeng Wang and Ziyun Zeng and Bin Chen and Yuting Wang and Dongliang Liao and Gongfu Li and Yiru Wang and Shu-Tao Xia},
title     = {Hugs Are Better Than Handshakes: Unsupervised Cross-Modal Transformer Hashing with Multi-granularity Alignment},
booktitle = {33rd British Machine Vision Conference 2022, {BMVC} 2022, London, UK, November 21-24, 2022},
publisher = {{BMVA} Press},
year      = {2022},
url       = {https://bmvc2022.mpi-inf.mpg.de/1035.pdf}
}


Copyright © 2022 The British Machine Vision Association and Society for Pattern Recognition
The British Machine Vision Conference is organised by The British Machine Vision Association and Society for Pattern Recognition. The Association is a Company limited by guarantee, No.2543446, and a non-profit-making body, registered in England and Wales as Charity No.1002307 (Registered Office: Dept. of Computer Science, Durham University, South Road, Durham, DH1 3LE, UK).

Imprint | Data Protection