Beyond OCR+VQA: Involving OCR into the Flow for Robust and Accurate TextVQA

Published in Proceedings of the 29th ACM Internatioonal Conference on Multimedia, 2021

Text-based visual question answering (TextVQA) requires analyzing both the visual contents and texts in an image to answer a question, which is more practical than general visual question answering (VQA). Existing efforts tend to regard optical character recognition (OCR) as a pre-processing and then combine it with a VQA framework. It makes the performance of multimodal reasoning and question answering highly depend on the accuracy of OCR. In this work, we address this issue with two perspectives. First, we take advantages of multimodal cues to complete the semantic information of texts. A visually enhanced text embedding is proposed to enable understanding of texts without accurately recognizing them. Second, we further leverage rich contextual information to modify the answer texts even if the OCR module does not correctly recognize them. In addition, the visual objects are endued with semantic representations to enable objects in the same semantic space as OCR tokens. Equipped with these techniques, the cumulative error propagation caused by poor OCR performance is effectively suppressed. Extensive experiments on TextVQA and ST-VQA datasets demonstrate that our approach achieves the state-of-the-art performance in terms of accuracy and robustness.

Download paper here