The rapid development of Multimodal Large Language Models (MLLMs), such as GPT-4o, marks a significant step toward artificial general intelligence. Existing methods typically align vision encoders with LLMs via supervised finetuning (SFT), but this often deteriorates their ability to handle multiple languages as training progresses. We empirically observe that imbalanced SFT datasets, largely English-centric, degrade performance on non-English languages due to the failure in multilingual token alignment. To address this, we propose PARROT, a novel approach that leverages textual guidance for visual token alignment at the language level. PARROT conditions visual tokens on diverse language inputs and uses Mixture-of-Experts (MoE) to align multilingual tokens. By computing cross-attention between initial visual features and textual embeddings, we select the most relevant experts, converting visual tokens into language-specific representations. Additionally, we introduce the Massive Multilingual Multimodal Benchmark (MMMB), a new benchmark comprising 6 languages, 15 categories, and 12,000 questions, to assess multilingual capabilities. PARROT achieves state-of-the-art performance on both the multilingual benchmarks and a wide range of multimodal tasks. Code and dataset are available at: https://github.com/AIDC-AI/Parrot.
Figure 1: The overall architecture of Parrot. It converts English-biased features to language-specific features based on the multilingual MoE module, aiming to improve the multilingual capabilities. The training details within each stage are presented on the right.
There are several existing multilingual benchmarks (i.e., Multi30K, M3Exam, MMBench, and LLaVA-Bench) for MLLMs, but they have some limitations:
Figure 2: Some bad cases for multilingual benchmarkperceive. Left: code reasoning is strongly related to English. Middle: logical reasoning is too challenging. Right: lack relevance between image and text.
We selected six languages for inclusion: English (en), Chinese (zh), Portuguese (pt), Arabic (ar), Turkish (tr), and Russian (ru). These languages represent a diverse range of linguistic families, and we list the detailed information and some multilingual cases in Figure 3. In terms of dataset requirements and consistency, our benchmark incorporates datasets in two main respects:
Figure 3: Overview of MMMB. It incorporates 6 languages, 15 categories, and 12,000 questions.
To enhance the intuitive understanding of the Parrot's multilingual capability, we prepare a comprehensive case study accompanied by illustrative visuals.
@inproceedings{sun2025parrot,
title={Parrot: Multilingual Visual Instruction Tuning},
author={Sun, Hai-Long and Zhou, Da-Wei and Li, Yang and Lu, Shiyin and Yi, Chao and Chen, Qing-Guo and Xu, Zhao and Luo, Weihua and Zhang, Kaifu and Zhan, De-Chuan and others},
booktitle={ICML},
year={2025}
}