Emotion-LLaMA: Multimodal Emotion Recognition and Reasoning with Instruction Tuning

Zebang Cheng, Zhi-Qi Cheng, Jun-Yan He, Jingdong Sun, Kai Wang, Yuxiang Lin, Zheng Lian, Xiaojiang Peng, Alexander Hauptmann·June 17, 2024

Summary

The paper presents Emotion-LLaMA, a state-of-the-art multimodal emotion recognition and reasoning model that addresses the limitations of existing models by integrating audio, visual, and textual inputs. It introduces the MERR dataset, a large-scale resource with 28,618 coarse and 4,487 fine-grained annotations, to enhance model learning. Emotion-LLaMA employs a modified LLaMA model with instruction tuning, outperforming competitors in tasks like Clue Overlap, Label Overlap, and zero-shot evaluations on the DFEW dataset. The model's success highlights the importance of specialized multimodal instruction for emotional understanding and its potential applications in areas like affective computing, mental health, and human-computer interactions.

Key findings

Tables

Advanced features