Probing Visual-Audio Representation for Video Highlight Detection via Hard-Pairs Guided Contrastive Learning


Shuaicheng Li (Sensetime Research), Feng Zhang (Fudan University),* Kunlin Yang (Sensetime Group Limited), Lingbo Liu (Hong Kong Polytechnic University), Shinan Liu (SenseTime Group Limited), Jun Hou (SenseTime Group Limited), Shuai Yi (SenseTime Group Limited)
The 33rd British Machine Vision Conference

Abstract

Video highlight detection (VHD) is a crucial yet challenging problem which aims to identify the interesting moments in untrimmed videos. The key to this task lies in effective video representations that jointly pursue two goals, i.e., 1) cross-modal representation learning and 2) fine-grained feature discrimination. To issue 1), the dominant VHD models adopt cross-attention based transformer to learn audio-visual information and inter-modality alignment. They always assume that multi-modal signals are synchronized while may not hold in practice due to spurious noise and appearance shift in untrimmed videos. To relieve this problem, we propose a cross-modality co-occurrence encoding by considering not only single visual/audio but asynchronous cross-modal correlations. We also explore the additional global contextual information abstracted from local region to further promote the inter-modality learning. To issue 2), to enlarge the discriminative power of feature embedding, we propose a hard-pairs guided contrastive learning (HPCL) scheme to reflect intrinsic semantic representation. A hard-pairs sampling strategy is employed in HPCL to mine the hard segment samples for improving feature discrimination and providing significant gradient information. Extensive experiments conducted on two benchmarks demonstrate the effectiveness and superiority of our proposed methods compared to other state-of-the-art methods.

Video



Citation

@inproceedings{Li_2022_BMVC,
author    = {Shuaicheng Li and Feng Zhang and Kunlin Yang and Lingbo Liu and Shinan Liu and Jun Hou and Shuai Yi},
title     = {Probing Visual-Audio Representation for Video Highlight Detection via Hard-Pairs Guided Contrastive Learning},
booktitle = {33rd British Machine Vision Conference 2022, {BMVC} 2022, London, UK, November 21-24, 2022},
publisher = {{BMVA} Press},
year      = {2022},
url       = {https://bmvc2022.mpi-inf.mpg.de/0709.pdf}
}


Copyright © 2022 The British Machine Vision Association and Society for Pattern Recognition
The British Machine Vision Conference is organised by The British Machine Vision Association and Society for Pattern Recognition. The Association is a Company limited by guarantee, No.2543446, and a non-profit-making body, registered in England and Wales as Charity No.1002307 (Registered Office: Dept. of Computer Science, Durham University, South Road, Durham, DH1 3LE, UK).

Imprint | Data Protection