Benchmarking Mental State Representations in Language Models

Matteo Bortoletto, Constantin Ruhdorfer, Lei Shi, Andreas Bulling·June 25, 2024

Summary

This study investigates the representation of mental states, particularly Theory of Mind (ToM), in language models through extensive benchmarking with various model types, sizes, and fine-tuning methods. Key findings include: 1. Larger models and fine-tuning lead to better performance in understanding and representing mental states, with improved accuracy in belief inference tasks. 2. Prompt sensitivity is significant, with different prompts affecting the models' ability to interpret mental states, and some prompts can enhance performance. 3. Activation editing techniques like contrastive activation addition (CAA) can enhance ToM capabilities without retraining, suggesting potential for real-time manipulation of model reasoning. 4. Probing experiments reveal the impact of model size, fine-tuning, and prompt design on belief representation, with smaller models showing vulnerability to prompt variations. 5. The study explores the ethical implications of using language models for mental state representation, emphasizing the need for caution and further research. In conclusion, the research contributes to our understanding of how language models process mental states and provides insights into optimizing their performance and addressing ethical considerations in this area.

Key findings

7

Tables

1

Advanced features