Segmentation Assisted U-shaped Multi-scale Transformer for Crowd Counting

Yifei Qian (University of St Andrews),* Liangfei Zhang (University of St Andrews), Xiaopeng Hong (Harbin Institute of Technology), Carl Donovan (University of St Andrews ), Ognjen Arandjelovic (University of St Andrews)
The 33rd British Machine Vision Conference


Vision crowd counting task has made remarkable process in recent years thanks to the development of CNNs. However, this field has run into bottleneck since CNNs, by their nature, are limited by locally attentive receptive fields and incapable to model long-term dependencies. To address this problem, we introduce a multi-scale transformer based crowd counting network, termed Crowd U-Transformer (CUT) which extracts and aggregates semantic and spatial features from multiple levels. In this design, we use crowd segmentation as an attention module to gain fine-grained features. Also, we propose a loss function to better focus on the counting performance in foreground area. Experimental results on four widely used benchmarks are exhibited and our method shows state-of-the-art performances.



