Masked Vision-Language Transformers for Scene Text Recognition

Jie Wu (Westone Information Industry INC.),* Ying Peng (Westone Information Industry INC.), Shengming Zhang (Westone Information Industry INC), Weigang Qi (Westone Information Industry INC.), zhang jian (Westone Information Industry INC)
The 33rd British Machine Vision Conference


Scene text recognition (STR) enables computers to recognize and read the text in various real-world scenes. Recent STR models benefit from taking linguistic information in addition to visual cues into consideration. We propose a novel Masked Vision-Language Transformers (MVLT) to capture both the explicit and the implicit linguistic information. Our encoder is a Vision Transformer, and our decoder is a multi-modal Transformer. MVLT is trained in two stages: in the first stage, we design a STR-tailored pretraining method based on a masking strategy; in the second stage, we fine-tune our model and adopt an iterative correction method to improve the performance. MVLT attains superior results compared to state-of-the-art STR models on several benchmarks. Our code and model are available at



