Bridging Vision and Language Spaces with Assignment Prediction

Abstract

This paper introduces VLAP, a novel approach that bridges pretrained vision models and large language models (LLMs) to make frozen LLMs a model of the non-linguistic visual world. VLAP transforms the embedding space of pretrained vision models into the LLMs’ embedding space using a single linear layer, which is trained with the optimal transport-based assignment prediction objective. Specifically, we harness well-established word embeddings to bridge two modality embedding spaces. We simultaneously assign the visual and text representations to a set of word embeddings within pretrained LLMs through the optimal transport. We predict the assignment of one modality from the representation of another modality data, enforcing consistent assignments for paired multimodal data. This allows two modality representations to contain the same information, grounding the frozen LLMs’ word embedding space in visual data. Moreover, a robust semantic taxonomy of LLMs can be preserved with visual data since the LLMs interpret and reason linguistic information from correlations between word embeddings. Experimental results show that VLAP achieves substantial improvements over the previous linear transformation-based methods across a range of vision-language tasks, including image captioning, visual question answering, and cross-modal retrieval. We also demonstrate the learned visual representations hold a semantic taxonomy of LLMs, making visual semantic arithmetic possible

Publication
In International Conference on Learning Representations
Jungin Park
Jungin Park
Ph.D.

My research interests include computer vision, video understanding, multimodal learning, and vision-language models.