Describe Your Facial Expressions by Linking Image Encoders and Large Language Models

Yujian Yuan (Institute of Computing Technology, Chinese Academy of Sciences; University of Chinese Academy of Sciences), Jiabei Zeng (Institute of Computing Technology, Chinese Academy of Sciences),* Shiguang Shan (Institute of Computing Technology, Chinese Academy of Sciences)
The 34th British Machine Vision Conference


This paper presents a novel task of describing human facial expressions of a facial image in natural language, which captures the nuances of facial actions and emotional states beyond traditional emotion categories or facial action units (AUs). To achieve the facial expression captioning model, we propose a three-stage training framework that trains a vision-to-language model using synthesized image-text pairs and the BLIP-2 pre-training techniques. To overcome the challenge of missing training image-text pairs for facial expression captioning, we propose a strategy that involves synthesizing and combining captions using GPT-3.5 and existing annotations on either emotion categories or AUs. Experiments demonstrate the effectiveness of our method in generating captions that describe details of facial actions and emotions, as well as the inferential relationship between them, even when those emotions are not present in the training data. It is also demonstrated that the vision-to-language task enhances the performance of the intermediate visual features on both AU detection and emotion classification tasks. The code and trained models are available at:


author    = {Yujian Yuan and University of Chinese Academy of Science and Jiabei Zeng and Shiguang Shan},
title     = {Describe Your Facial Expressions by Linking Image Encoders and Large Language Models},
booktitle = {34th British Machine Vision Conference 2023, {BMVC} 2023, Aberdeen, UK, November 20-24, 2023},
publisher = {BMVA},
year      = {2023},
url       = {}

Copyright © 2023 The British Machine Vision Association and Society for Pattern Recognition
The British Machine Vision Conference is organised by The British Machine Vision Association and Society for Pattern Recognition. The Association is a Company limited by guarantee, No.2543446, and a non-profit-making body, registered in England and Wales as Charity No.1002307 (Registered Office: Dept. of Computer Science, Durham University, South Road, Durham, DH1 3LE, UK).

Imprint | Data Protection