FLAVA: A Foundation Language And Vision Alignment Model

Amanpreet Singh^*

Ronghang Hu^*

Vedanuj Goswami^*

Guillaume Couairon

Wojciech Galuba

Marcus Rohrbach

Douwe Kiela

Facebook AI Research

^* Equal Contribution

CVPR 2022

[Paper] [Code] [Model] [Data]

State-of-the-art vision and vision-and-language models rely on large-scale visio-linguistic pretraining for obtaining good performance on a variety of downstream tasks. Generally, such models are often either cross-modal (contrastive) or multi-modal (with earlier fusion) but not both; and they often only target specific modalities or tasks. A promising direction would be to use a single holistic universal model, as a "foundation", that targets all modalities at once -- a true vision and language foundation model should be good at vision tasks, language tasks, and cross- and multi-modal vision and language tasks. We introduce FLAVA as such a model and demonstrate impressive performance on a wide range of 35 tasks spanning these target modalities.

FLAVA Model

FLAVA, a language and vision alignment model learns strong representations from multimodal (imagetext pairs) and unimodal data (unpaired images and text) and can be applied to target a broad scope of tasks from three domains (visual recognition, language understanding, and multimodal reasoning) under a common transformer model architecture.

Source Code

Source code is available at this link and pretrained models can also be find on Hugging Face. Follow Amanpreet on Twitter for latest updates.

Paper and Bibtex

[Paper] [ArXiv] [Bibtex]

@inproceedings{singh2022flava,
  author = {
    Amanpreet Singh and
    Ronghang Hu and
    Vedanuj Goswami and
    Guillaume Couairon and
    Wojciech Galuba and
    Marcus Rohrbach and
    Douwe Kiela
  },
  title = {
    {FLAVA:} {A} Foundational Language And 
    Vision Alignment Model
  },
  booktitle={CVPR},
  year={2022}
}

Acknowledgements

We thank Devi Parikh for her support and advice on this project. We are grateful to Dmytro Okhonko, Hu Xu, Armen Aghajanyan, Po-Yao Huang, Min Xu, and Aleksandra Piktus for joint explorations of multimodal data. We thank Ning Zhang, Madian Khabsa, Sasha Sheng, and Naman Goyal for useful technical discussions; Karan Desai for providing access to RedCaps; Vaibhav Singh and others on the Google TPU team for help running experiments on TPUs; Shubho Sengupta, Armand Joulin, Brian O’Horo, Arthur Menezes for compute and storage support; and Ryan Jiang, Kushal Tirumala and Russ Howes for help running experiments.