Image-Text Relation Prediction for Multilingual Tweets

Matīss Rikters, Edison Marrese-Taylor·May 08, 2025

Summary

A study evaluates multilingual Vision-Language Models' prediction of image-text relations in tweets. A balanced dataset from Latvian Twitter, with English translations, was created. Newer, larger models outperform smaller ones, with room for improvement. Performance varies with input language, and results are sensitive to the evaluation dataset. The study aims to compare several recent VLMs on a single GPU, addressing computational linguistics and natural language processing. Contributions include works on machine translation, multimodal model handling, visual instruction processing, and multilingual visual question answering.

Advanced features