Parrot:

Multilingual Visual Instruction Tuning


1School of Artificial Intelligence, Nanjing University 2National Key Laboratory for Novel Software Technology, Nanjing University
3AI Business, Alibaba Group

Abstract

The rapid development of Multimodal Large Language Models (MLLMs), such as GPT-4o, marks a significant step toward artificial general intelligence. Existing methods typically align vision encoders with LLMs via supervised finetuning (SFT), but this often deteriorates their ability to handle multiple languages as training progresses. We empirically observe that imbalanced SFT datasets, largely English-centric, degrade performance on non-English languages due to the failure in multilingual token alignment. To address this, we propose PARROT, a novel approach that leverages textual guidance for visual token alignment at the language level. PARROT conditions visual tokens on diverse language inputs and uses Mixture-of-Experts (MoE) to align multilingual tokens. By computing cross-attention between initial visual features and textual embeddings, we select the most relevant experts, converting visual tokens into language-specific representations. Additionally, we introduce the Massive Multilingual Multimodal Benchmark (MMMB), a new benchmark comprising 6 languages, 15 categories, and 12,000 questions, to assess multilingual capabilities. PARROT achieves state-of-the-art performance on both the multilingual benchmarks and a wide range of multimodal tasks. Code and dataset are available at: https://github.com/AIDC-AI/Parrot.


Technical Description


• Architecture

Teaser

Figure 1: The overall architecture of Parrot. It converts English-biased features to language-specific features based on the multilingual MoE module, aiming to improve the multilingual capabilities. The training details within each stage are presented on the right.


  • Stage 1: Modality Alignment. In this stage, we keep both the vision encoder and the LLM weights frozen, focusing solely on optimizing the projectors to align the visual features $\Hmat_v$ with the pre-trained LLM word embedding. This stage can be likened to training a visual tokenizer that is compatible with the frozen LLM. To enhance the diversity of images, we extract a portion of data from LAION and CC12M datasets and construct the in-house caption data through GPT-4V.
  • Stage 2: Instruction Tuning for Multilingual Alignment. We still keep the vision encoder weights frozen while continuing to train the projector, MoE, and LLM. Due to the design of the MoE module, Parrot can rapidly learn to align visual representations across multiple languages by using a small amount of multilingual image-text data. As shown in Table 1, we only use nearly 10K training data for each language in stage 2. This approach is particularly beneficial given the scarcity of data resources in low-resource languages.


Table1

MMMB: A Massive Multilingual Multimodal Benchmark


• Limitations of Existing Benchmarks

There are several existing multilingual benchmarks (i.e., Multi30K, M3Exam, MMBench, and LLaVA-Bench) for MLLMs, but they have some limitations:

  1. Outdated Benchmarks. Multi30k is designed for image-text retrieval tasks, and the performance has nearly reached the upper bound due to the relatively easy problems.
  2. Non-Standardized Evaluations. Other benchmarks, like LLaVA-Bench, rely on evaluations using GPT-4. Dependence on GPT-4 as a de facto ''Ground Truth'' may hinder reproducibility. Meanwhile, since LLaVA uses a deprecated version (GPT-4-0314), using other different versions could result in unfair comparisons. On the other hand, because M3Exam does not offer consistent test samples across different languages, it cannot ensure whether poor performance is due to the problem's difficulty or the model's lack of multilingual capabilities.
  3. Limited Languages. MMBench and LLaVA-Bench are limited in English and Chinese, which can not measure the multilingual capabilities across a broad spectrum.


bad-cases

Figure 2: Some bad cases for multilingual benchmarkperceive. Left: code reasoning is strongly related to English. Middle: logical reasoning is too challenging. Right: lack relevance between image and text.



• Construction of the Multilingual Benchmark

We selected six languages for inclusion: English (en), Chinese (zh), Portuguese (pt), Arabic (ar), Turkish (tr), and Russian (ru). These languages represent a diverse range of linguistic families, and we list the detailed information and some multilingual cases in Figure 3. In terms of dataset requirements and consistency, our benchmark incorporates datasets in two main respects:

  1. Since MMBench officially includes English and Chinese versions, we extend it to the other four languages.
  2. For the massive multilingual multimodal benchmark, denoted as MMMB, we select and clean the suitable data from ScienceQA, MME, and SEED-Bench datasets with established guidelines. These datasets are then processed into a Visual Question Answering (VQA) format, resulting in a total of 12,000 samples across all six languages.


MMMB

Figure 3: Overview of MMMB. It incorporates 6 languages, 15 categories, and 12,000 questions.


• Evaluations

benchmark
experiments

Demonstrations


To enhance the intuitive understanding of the Parrot's multilingual capability, we prepare a comprehensive case study accompanied by illustrative visuals.



• Example-1: Djokovic

case1

• Example-2: Deer and Swan

case2

• Example-3: English

case_en

• Example-4: Chinese

case_zh

• Example-5: Portuguese

case_pt

• Example-6: Arabic

case_ar

• Example-7: Turkish

case_tr

• Example-8: Russian

case_ru

BibTeX

@inproceedings{sun2025parrot,
  title={Parrot: Multilingual Visual Instruction Tuning},
  author={Sun, Hai-Long and Zhou, Da-Wei and Li, Yang and Lu, Shiyin and Yi, Chao and Chen, Qing-Guo and Xu, Zhao and Luo, Weihua and Zhang, Kaifu and Zhan, De-Chuan and others},
  booktitle={ICML},
  year={2025}
}