Polishing Network for Decoding of Higher-Quality Diverse Image Captions

Yue Zheng (Tsinghua University),* Ya-Li Li (Tsinghua University), Shengjin Wang (Tsinghua University)
The 33rd British Machine Vision Conference


Diverse image caption generation has attracted more attention in recent researches. Existing methods usually adopt a single-pass decoding process, that the sampled words at each time step during decoding will not be modified. A mistaken word could affect the whole subsequent sequence. On the other hand, decoders in single-pass approaches only have access to the previously generated words, thus unable to compose the sentences with an understanding of the whole contents. Inspired by the multi-pass process of human generating descriptions, in this paper we propose a novel framework with a Polishing Network (PN) for decoding diverse image captions. PN refines the raw descriptions generated by an original diverse image caption generation model. The refined sentences could modify some of the incorrect words and phrases in the raw descriptions, while still describing similar content. We also propose a novel approach for training PN. The raw-refined caption pairs used as training samples for PN are obtained by sampling both the input and output words of an original model during decoding. The experimental results show that the proposed approach can generate high-quality diverse image captions, achieving a better quality-diversity trade-off. We compare the performance of our method with several existing methods in the diverse image caption generation task. The proposed method achieves the state-of-the-art performance with oracle BLEU-4/CIDEr scores of 0.534/1.709 at sample size 20 on the MS COCO dataset.



author    = {Yue Zheng and Ya-Li Li and Shengjin Wang},
title     = {Polishing Network for Decoding of Higher-Quality Diverse Image Captions},
booktitle = {33rd British Machine Vision Conference 2022, {BMVC} 2022, London, UK, November 21-24, 2022},
publisher = {{BMVA} Press},
year      = {2022},
url       = {https://bmvc2022.mpi-inf.mpg.de/0601.pdf}

Copyright © 2022 The British Machine Vision Association and Society for Pattern Recognition
The British Machine Vision Conference is organised by The British Machine Vision Association and Society for Pattern Recognition. The Association is a Company limited by guarantee, No.2543446, and a non-profit-making body, registered in England and Wales as Charity No.1002307 (Registered Office: Dept. of Computer Science, Durham University, South Road, Durham, DH1 3LE, UK).

Imprint | Data Protection