M3GIA: A Cognition Inspired Multilingual and Multimodal General Intelligence Ability Benchmark

Wei Song, Yadong Li, Jianhua Xu, Guowei Wu, Lingfeng Ming, Kexin Yi, Weihua Luo, Houyi Li, Yi Du, Fangda Guo, Kaicheng Yu·June 08, 2024

Summary

M3GIA is a novel benchmark introduced to evaluate the cognitive abilities of multilingual and multimodal large language models (MLLMs) using the Cattell-Horn-Carroll model. It assesses five cognitive factors (fluid reasoning, comprehension-knowledge, visual processing, reading & writing, and quantitative knowledge) across six languages, aiming to provide a more comprehensive understanding beyond superficial performance. The study reveals disparities in performance across languages and identifies a "winner takes all" phenomenon. M3GIA differentiates itself by incorporating cognitive science models and plans to be open-source for model improvement. It has been applied to evaluate models like GPT-4, showing varying levels of intelligence across languages and cognitive domains.

Key findings

15

Paper digest

What problem does the paper attempt to solve? Is this a new problem?

The paper aims to address the challenge of evaluating state-of-the-art models through the lens of cognitive science to assess their emergent abilities, particularly focusing on large language models (LLMs) and Multimodal Large Language Models (MLLMs) . This paper delves into analyzing these models from a psychological perspective, exploring their human-like cognition, such as Theory of Mind (ToM) capabilities exhibited by GPT-4 and the ability of MLLMs to process and integrate multimodal information . The paper also highlights the limitations in existing benchmarks that fail to provide a solid theoretical underpinning and a systematic evaluation of models' cognitive abilities . This problem of evaluating cognitive abilities in advanced models is not entirely new but represents a current focus in AI research to understand the cognitive factors governing task performance .


What scientific hypothesis does this paper seek to validate?

This paper aims to validate the scientific hypothesis that General Intelligence Ability (GIA) has emerged in large Multilingual and Multimodal Language Models (MLLMs) . The study explores the cognitive abilities of these models from a psychological perspective, demonstrating that MLLMs exhibit human-like cognition, including Theory of Mind capabilities similar to human inference patterns . The research delves into the emergence of mental intelligence in large models and their foundational GIA factor that governs various cognitive abilities .


What new ideas, methods, or models does the paper propose? What are the characteristics and advantages compared to previous methods?

The paper proposes the evaluation of state-of-the-art models through the lens of cognitive science to explore the mental intelligence emerging from large models . It discusses how Large Language Models (LLMs) exhibit human-like cognition, such as Theory of Mind (ToM) capabilities similar to human inference patterns . Additionally, Multimodal Large Language Models (MLLMs) integrate multimodal information to generate website code from images, understand memes, and perform math reasoning, showcasing impressive emergent abilities . The paper highlights the need for a systematic evaluation of models' underlying cognitive abilities through the lens of cognitive science, emphasizing the importance of understanding the cognitive factors of MLLMs . The paper introduces a novel approach that evaluates state-of-the-art models by integrating cognitive science principles to delve into the mental intelligence exhibited by large models . It emphasizes the analysis of Large Language Models (LLMs) from a psychological perspective, showcasing their human-like cognition, particularly in Theory of Mind (ToM) capabilities akin to human inference patterns . Furthermore, Multimodal Large Language Models (MLLMs) leverage powerful LLMs to process and integrate multimodal information, showcasing remarkable emergent abilities such as generating website code from images, understanding memes, and performing math reasoning . These models exhibit a more holistic cognitive process by processing information from diverse sources, resembling human cognition more closely than models limited to linguistic input .

In comparison to previous methods, the paper highlights the need for a systematic evaluation of models' cognitive abilities through the lens of cognitive science, emphasizing the importance of understanding the cognitive factors governing the performance of MLLMs . Existing benchmarks like MMBench, MME, and MM-Vet have attempted to compartmentalize model capabilities across various tasks but often lack a solid theoretical underpinning and fail to provide a comprehensive evaluation of models' cognitive abilities . This underscores the significance of a more in-depth analysis of the underlying cognitive processes of MLLMs to gain a deeper understanding of their intelligence .


Do any related researches exist? Who are the noteworthy researchers on this topic in this field?What is the key to the solution mentioned in the paper?

Several related research studies exist in the field, with notable researchers contributing to this topic. Some noteworthy researchers mentioned in the provided context include:

  • Chen, Z., Wu, J., Wang, W., Su, W., Chen, G., Xing, S., Zhong, M., Zhang, Q., Zhu, X., Lu, L., Li, B., Luo, P., Lu, T., Qiao, Y., and Dai, J.
  • Chung, H. W., Hou, L., Longpre, S., Zoph, B., Tay, Y., Fedus, W., Li, Y., Wang, X., Dehghani, M., Brahma, S., et al.
  • Bai, J., Bai, S., Yang, S., Wang, S., Tan, S., Wang, P., Lin, J., Zhou, C., and Zhou, J.
  • Bubeck, S., Chandrasekaran, V., Eldan, R., Gehrke, J., Horvitz, E., Kamar, E., Lee, P., Lee, Y. T., Li, Y., Lundberg, S., et al.

The key to the solution mentioned in the paper involves various aspects such as scaling up vision foundation models, aligning for generic visual-linguistic tasks, and improving large multi-modal models with better captions. These solutions aim to enhance the performance and capabilities of vision-language models for comprehensive evaluation and benchmarking purposes .


How were the experiments in the paper designed?

The experiments in the paper were designed to comprehensively evaluate the cognitive abilities of Multimodal Large Language Models (MLLMs) based on the Cattell-Horn-Carroll (CHC) Model of Intelligence . The experiments identified five key cognitive factors for current MLLMs: Fluid reasoning (Gf), Comprehension-Knowledge (Gc), Visual processing (Gv), Reading and Writing (Grw), and Quantitative knowledge (Gq) . These factors were measured through a series of tests spanning across six languages, including English, Chinese, French, Spanish, Portuguese, and Korean . The experiments involved comparative analysis of the cognitive abilities of various MLLMs against human performance and discussed the impact of factors like the size of the Language Model component on cognitive abilities .


What is the dataset used for quantitative evaluation? Is the code open source?

The dataset used for quantitative evaluation in the study is the M3GIA benchmark, which is a cognition-inspired multilingual and multimodal general intelligence ability benchmark . The code for this benchmark is open-source, as mentioned in the document, with the aspiration of facilitating the enhancement of cognitive capabilities in Multilingual and Multimodal Large Language Models (MLLMs) .


Do the experiments and results in the paper provide good support for the scientific hypotheses that need to be verified? Please analyze.

The experiments and results presented in the paper provide substantial support for the scientific hypotheses that need to be verified. The study delves into the analysis of large language models (LLMs) from a cognitive science perspective, demonstrating their human-like cognition and emergent abilities . The research explores the mental intelligence emerging from these models and highlights their capabilities in tasks such as generating website code from images, understanding memes, and math reasoning . This analysis aligns with the primary motivation of AI research, emphasizing the need to evaluate state-of-the-art models through the lens of cognitive science .

Moreover, the study discusses the application of Theory of Mind (ToM) to assess large models, revealing that GPT-4 exhibits ToM capabilities similar to human inference patterns . Additionally, Multimodal Large Language Models (MLLMs) have shown impressive emergent abilities by processing and integrating multimodal information, resembling human cognition more closely than models confined to purely linguistic input . The research also highlights the limitations and challenges faced in evaluating the cognitive factors of MLLMs, emphasizing the need for a more comprehensive understanding of their cognitive abilities .

Overall, the experiments and results in the paper provide a robust foundation for verifying scientific hypotheses related to the cognitive abilities and emergent intelligence of large language models, shedding light on their potential and limitations in comparison to human cognition .


What are the contributions of this paper?

The paper makes several key contributions:

  • Introducing the first cognitive-driven multi-lingual and multi-modal benchmark, M3GIA, to evaluate the general intelligence ability of Multi-Modality Large Language Models (MLLMs) .
  • Identifying five key cognitive factors based on the Cattell-Horn-Carroll (CHC) model of intelligence and proposing a novel evaluation metric .
  • Going beyond English to include other languages like Chinese, French, Spanish, Portuguese, and Korean in the evaluation to assess the impact of language on the cognitive ability of MLLMs .
  • Collecting data from human participants to reveal that the most advanced MLLM reaches the lower boundary of human intelligence in English, but significant disparities exist in the other five languages assessed .
  • Highlighting the importance of understanding the intelligence of MLLMs beyond task performance and superficial achievements by incorporating cognitive science into the evaluation process .

What work can be continued in depth?

Further work that can be continued in depth includes:

  • Providing a more definitive and persuasive explanation for the underlying causes of the phenomenon observed in MLLMs, known as "winner takes all," which corroborates the emergence of General Intelligence Ability (GIA) within cutting-edge MLLMs .
  • Expanding the human data gathered to construct the GIA model and compare the cognitive abilities of current MLLMs with those of humans to ensure a more comprehensive and varied set of human samples, which would enhance the accuracy of the GIA model and the objectivity of the findings .
  • Introducing the first benchmark, M3GIA, to comprehensively evaluate the cognitive abilities of MLLMs under the theoretical umbrella of the well-recognized Cattell-Horn-Carroll (CHC) Model of Intelligence. This benchmark categorizes the cognitive capacities of current MLLMs into dimensions such as Fluid Reasoning (Gf), Comprehension-Knowledge (Gc), Visual Processing (Gv), Reading and Writing (Grw), and Quantitative Knowledge (Gq) .
  • Exploring whether languages impact the cognitive abilities of MLLMs, as using multi-lingual data to scale up the capability of MLLMs has become a de-facto standard in AI research .

Tables

2

Introduction
Background
Emergence of M3GIA in evaluating cognitive abilities of MLLMs
Cattell-Horn-Carroll model as a foundation
Objective
Comprehensive assessment of cognitive factors
Disparities and "winner takes all" phenomenon in model performance
Open-source goal for model improvement
Methodology
Data Collection
Selection of cognitive tasks across six languages
Task design for fluid reasoning, comprehension-knowledge, etc.
Data Preprocessing
Adaptation of tasks for multilingual and multimodal models
Standardization and normalization across languages
Cognitive Factors Assessment
Fluid Reasoning
Task description and evaluation
Performance analysis across languages
Comprehension-Knowledge
Language-specific tests and results
Visual Processing
Visual tasks and model responses
Reading & Writing
Literacy tasks and language proficiency evaluation
Quantitative Knowledge
Math and problem-solving tasks in different languages
Disparities and Phenomena
Comparative analysis of model performance
"Winner takes all" effect illustration
Open-Source Initiative
M3GIA's open-access platform and community involvement
Application and Evaluation
GPT-4 and Other Models
M3GIA results for GPT-4 and other MLLMs
Insights into language intelligence variations
Conclusion
Significance of M3GIA for understanding model cognition
Future directions and implications for multilingual AI research
Basic info
papers
computation and language
artificial intelligence
Advanced features
Insights
How many cognitive factors does M3GIA assess, and what are they?
What is the significance of M3GIA's open-source nature?
What is M3GIA primarily used for?
Which model does M3GIA benchmark in the context of cognitive abilities?

M3GIA: A Cognition Inspired Multilingual and Multimodal General Intelligence Ability Benchmark

Wei Song, Yadong Li, Jianhua Xu, Guowei Wu, Lingfeng Ming, Kexin Yi, Weihua Luo, Houyi Li, Yi Du, Fangda Guo, Kaicheng Yu·June 08, 2024

Summary

M3GIA is a novel benchmark introduced to evaluate the cognitive abilities of multilingual and multimodal large language models (MLLMs) using the Cattell-Horn-Carroll model. It assesses five cognitive factors (fluid reasoning, comprehension-knowledge, visual processing, reading & writing, and quantitative knowledge) across six languages, aiming to provide a more comprehensive understanding beyond superficial performance. The study reveals disparities in performance across languages and identifies a "winner takes all" phenomenon. M3GIA differentiates itself by incorporating cognitive science models and plans to be open-source for model improvement. It has been applied to evaluate models like GPT-4, showing varying levels of intelligence across languages and cognitive domains.
Mind map
Insights into language intelligence variations
M3GIA results for GPT-4 and other MLLMs
M3GIA's open-access platform and community involvement
Math and problem-solving tasks in different languages
Literacy tasks and language proficiency evaluation
Visual tasks and model responses
Language-specific tests and results
Performance analysis across languages
Task description and evaluation
Standardization and normalization across languages
Adaptation of tasks for multilingual and multimodal models
Task design for fluid reasoning, comprehension-knowledge, etc.
Selection of cognitive tasks across six languages
Open-source goal for model improvement
Disparities and "winner takes all" phenomenon in model performance
Comprehensive assessment of cognitive factors
Cattell-Horn-Carroll model as a foundation
Emergence of M3GIA in evaluating cognitive abilities of MLLMs
Future directions and implications for multilingual AI research
Significance of M3GIA for understanding model cognition
GPT-4 and Other Models
Open-Source Initiative
Quantitative Knowledge
Reading & Writing
Visual Processing
Comprehension-Knowledge
Fluid Reasoning
Data Preprocessing
Data Collection
Objective
Background
Conclusion
Application and Evaluation
Disparities and Phenomena
Cognitive Factors Assessment
Methodology
Introduction
Outline
Introduction
Background
Emergence of M3GIA in evaluating cognitive abilities of MLLMs
Cattell-Horn-Carroll model as a foundation
Objective
Comprehensive assessment of cognitive factors
Disparities and "winner takes all" phenomenon in model performance
Open-source goal for model improvement
Methodology
Data Collection
Selection of cognitive tasks across six languages
Task design for fluid reasoning, comprehension-knowledge, etc.
Data Preprocessing
Adaptation of tasks for multilingual and multimodal models
Standardization and normalization across languages
Cognitive Factors Assessment
Fluid Reasoning
Task description and evaluation
Performance analysis across languages
Comprehension-Knowledge
Language-specific tests and results
Visual Processing
Visual tasks and model responses
Reading & Writing
Literacy tasks and language proficiency evaluation
Quantitative Knowledge
Math and problem-solving tasks in different languages
Disparities and Phenomena
Comparative analysis of model performance
"Winner takes all" effect illustration
Open-Source Initiative
M3GIA's open-access platform and community involvement
Application and Evaluation
GPT-4 and Other Models
M3GIA results for GPT-4 and other MLLMs
Insights into language intelligence variations
Conclusion
Significance of M3GIA for understanding model cognition
Future directions and implications for multilingual AI research
Key findings
15

Paper digest

What problem does the paper attempt to solve? Is this a new problem?

The paper aims to address the challenge of evaluating state-of-the-art models through the lens of cognitive science to assess their emergent abilities, particularly focusing on large language models (LLMs) and Multimodal Large Language Models (MLLMs) . This paper delves into analyzing these models from a psychological perspective, exploring their human-like cognition, such as Theory of Mind (ToM) capabilities exhibited by GPT-4 and the ability of MLLMs to process and integrate multimodal information . The paper also highlights the limitations in existing benchmarks that fail to provide a solid theoretical underpinning and a systematic evaluation of models' cognitive abilities . This problem of evaluating cognitive abilities in advanced models is not entirely new but represents a current focus in AI research to understand the cognitive factors governing task performance .


What scientific hypothesis does this paper seek to validate?

This paper aims to validate the scientific hypothesis that General Intelligence Ability (GIA) has emerged in large Multilingual and Multimodal Language Models (MLLMs) . The study explores the cognitive abilities of these models from a psychological perspective, demonstrating that MLLMs exhibit human-like cognition, including Theory of Mind capabilities similar to human inference patterns . The research delves into the emergence of mental intelligence in large models and their foundational GIA factor that governs various cognitive abilities .


What new ideas, methods, or models does the paper propose? What are the characteristics and advantages compared to previous methods?

The paper proposes the evaluation of state-of-the-art models through the lens of cognitive science to explore the mental intelligence emerging from large models . It discusses how Large Language Models (LLMs) exhibit human-like cognition, such as Theory of Mind (ToM) capabilities similar to human inference patterns . Additionally, Multimodal Large Language Models (MLLMs) integrate multimodal information to generate website code from images, understand memes, and perform math reasoning, showcasing impressive emergent abilities . The paper highlights the need for a systematic evaluation of models' underlying cognitive abilities through the lens of cognitive science, emphasizing the importance of understanding the cognitive factors of MLLMs . The paper introduces a novel approach that evaluates state-of-the-art models by integrating cognitive science principles to delve into the mental intelligence exhibited by large models . It emphasizes the analysis of Large Language Models (LLMs) from a psychological perspective, showcasing their human-like cognition, particularly in Theory of Mind (ToM) capabilities akin to human inference patterns . Furthermore, Multimodal Large Language Models (MLLMs) leverage powerful LLMs to process and integrate multimodal information, showcasing remarkable emergent abilities such as generating website code from images, understanding memes, and performing math reasoning . These models exhibit a more holistic cognitive process by processing information from diverse sources, resembling human cognition more closely than models limited to linguistic input .

In comparison to previous methods, the paper highlights the need for a systematic evaluation of models' cognitive abilities through the lens of cognitive science, emphasizing the importance of understanding the cognitive factors governing the performance of MLLMs . Existing benchmarks like MMBench, MME, and MM-Vet have attempted to compartmentalize model capabilities across various tasks but often lack a solid theoretical underpinning and fail to provide a comprehensive evaluation of models' cognitive abilities . This underscores the significance of a more in-depth analysis of the underlying cognitive processes of MLLMs to gain a deeper understanding of their intelligence .


Do any related researches exist? Who are the noteworthy researchers on this topic in this field?What is the key to the solution mentioned in the paper?

Several related research studies exist in the field, with notable researchers contributing to this topic. Some noteworthy researchers mentioned in the provided context include:

  • Chen, Z., Wu, J., Wang, W., Su, W., Chen, G., Xing, S., Zhong, M., Zhang, Q., Zhu, X., Lu, L., Li, B., Luo, P., Lu, T., Qiao, Y., and Dai, J.
  • Chung, H. W., Hou, L., Longpre, S., Zoph, B., Tay, Y., Fedus, W., Li, Y., Wang, X., Dehghani, M., Brahma, S., et al.
  • Bai, J., Bai, S., Yang, S., Wang, S., Tan, S., Wang, P., Lin, J., Zhou, C., and Zhou, J.
  • Bubeck, S., Chandrasekaran, V., Eldan, R., Gehrke, J., Horvitz, E., Kamar, E., Lee, P., Lee, Y. T., Li, Y., Lundberg, S., et al.

The key to the solution mentioned in the paper involves various aspects such as scaling up vision foundation models, aligning for generic visual-linguistic tasks, and improving large multi-modal models with better captions. These solutions aim to enhance the performance and capabilities of vision-language models for comprehensive evaluation and benchmarking purposes .


How were the experiments in the paper designed?

The experiments in the paper were designed to comprehensively evaluate the cognitive abilities of Multimodal Large Language Models (MLLMs) based on the Cattell-Horn-Carroll (CHC) Model of Intelligence . The experiments identified five key cognitive factors for current MLLMs: Fluid reasoning (Gf), Comprehension-Knowledge (Gc), Visual processing (Gv), Reading and Writing (Grw), and Quantitative knowledge (Gq) . These factors were measured through a series of tests spanning across six languages, including English, Chinese, French, Spanish, Portuguese, and Korean . The experiments involved comparative analysis of the cognitive abilities of various MLLMs against human performance and discussed the impact of factors like the size of the Language Model component on cognitive abilities .


What is the dataset used for quantitative evaluation? Is the code open source?

The dataset used for quantitative evaluation in the study is the M3GIA benchmark, which is a cognition-inspired multilingual and multimodal general intelligence ability benchmark . The code for this benchmark is open-source, as mentioned in the document, with the aspiration of facilitating the enhancement of cognitive capabilities in Multilingual and Multimodal Large Language Models (MLLMs) .


Do the experiments and results in the paper provide good support for the scientific hypotheses that need to be verified? Please analyze.

The experiments and results presented in the paper provide substantial support for the scientific hypotheses that need to be verified. The study delves into the analysis of large language models (LLMs) from a cognitive science perspective, demonstrating their human-like cognition and emergent abilities . The research explores the mental intelligence emerging from these models and highlights their capabilities in tasks such as generating website code from images, understanding memes, and math reasoning . This analysis aligns with the primary motivation of AI research, emphasizing the need to evaluate state-of-the-art models through the lens of cognitive science .

Moreover, the study discusses the application of Theory of Mind (ToM) to assess large models, revealing that GPT-4 exhibits ToM capabilities similar to human inference patterns . Additionally, Multimodal Large Language Models (MLLMs) have shown impressive emergent abilities by processing and integrating multimodal information, resembling human cognition more closely than models confined to purely linguistic input . The research also highlights the limitations and challenges faced in evaluating the cognitive factors of MLLMs, emphasizing the need for a more comprehensive understanding of their cognitive abilities .

Overall, the experiments and results in the paper provide a robust foundation for verifying scientific hypotheses related to the cognitive abilities and emergent intelligence of large language models, shedding light on their potential and limitations in comparison to human cognition .


What are the contributions of this paper?

The paper makes several key contributions:

  • Introducing the first cognitive-driven multi-lingual and multi-modal benchmark, M3GIA, to evaluate the general intelligence ability of Multi-Modality Large Language Models (MLLMs) .
  • Identifying five key cognitive factors based on the Cattell-Horn-Carroll (CHC) model of intelligence and proposing a novel evaluation metric .
  • Going beyond English to include other languages like Chinese, French, Spanish, Portuguese, and Korean in the evaluation to assess the impact of language on the cognitive ability of MLLMs .
  • Collecting data from human participants to reveal that the most advanced MLLM reaches the lower boundary of human intelligence in English, but significant disparities exist in the other five languages assessed .
  • Highlighting the importance of understanding the intelligence of MLLMs beyond task performance and superficial achievements by incorporating cognitive science into the evaluation process .

What work can be continued in depth?

Further work that can be continued in depth includes:

  • Providing a more definitive and persuasive explanation for the underlying causes of the phenomenon observed in MLLMs, known as "winner takes all," which corroborates the emergence of General Intelligence Ability (GIA) within cutting-edge MLLMs .
  • Expanding the human data gathered to construct the GIA model and compare the cognitive abilities of current MLLMs with those of humans to ensure a more comprehensive and varied set of human samples, which would enhance the accuracy of the GIA model and the objectivity of the findings .
  • Introducing the first benchmark, M3GIA, to comprehensively evaluate the cognitive abilities of MLLMs under the theoretical umbrella of the well-recognized Cattell-Horn-Carroll (CHC) Model of Intelligence. This benchmark categorizes the cognitive capacities of current MLLMs into dimensions such as Fluid Reasoning (Gf), Comprehension-Knowledge (Gc), Visual Processing (Gv), Reading and Writing (Grw), and Quantitative Knowledge (Gq) .
  • Exploring whether languages impact the cognitive abilities of MLLMs, as using multi-lingual data to scale up the capability of MLLMs has become a de-facto standard in AI research .
Tables
2
Scan the QR code to ask more questions about the paper
© 2025 Powerdrill. All rights reserved.