Allo-AVA: A Large-Scale Multimodal Conversational AI Dataset for Allocentric Avatar Gesture Animation

Saif Punjwani, Larry Heck·October 21, 2024

Summary

The Allo-AVA dataset, a comprehensive multimodal resource, addresses the scarcity of high-quality, synchronized text and audio-driven avatar animation data. It comprises approximately 1,250 hours of diverse video content, complete with audio, transcripts, and precise timestamps for keypoints. This enables accurate replication of human movements aligned with speech, enhancing natural, context-aware avatar animations for applications like virtual reality and digital assistants. The dataset, totaling 7,500 videos, features an average 10-minute video duration, 135 billion extracted keypoints, and 15 million transcribed words. It categorizes gestures into 85 categories, emotions into 12, and speaker attributes into 25, offering a diverse range of human communicative behaviors across demographics. The dataset's keypoint extraction process combines OpenPose and MediaPipe for high-accuracy pose estimation, with a novel fusion algorithm integrating both models' outputs for improved keypoint accuracy. The Allo-AVA dataset, processed in Python with GPU acceleration, includes video, audio, transcript, keypoints, and visualization files, providing a valuable resource for avatar animation research and development.

Key findings

11

Tables

2

Introduction
Background
Overview of the scarcity of high-quality, synchronized text and audio-driven avatar animation data
Importance of the Allo-AVA dataset in addressing this scarcity
Objective
The goal of the Allo-AVA dataset in enhancing natural, context-aware avatar animations
Applications of the dataset in virtual reality, digital assistants, and beyond
Dataset Overview
Content
Description of the 1,250 hours of diverse video content
Details on the inclusion of audio, transcripts, and precise timestamps for keypoints
Size and Structure
Total number of videos (7,500)
Average video duration (10 minutes)
Keypoint extraction (135 billion keypoints, 15 million transcribed words)
Categorization
Overview of the 85 gesture categories, 12 emotion categories, and 25 speaker attribute categories
Extraction Process
Combination of OpenPose and MediaPipe for high-accuracy pose estimation
Novel fusion algorithm integrating both models' outputs for improved keypoint accuracy
Processing and Utilization
Technical Details
Python-based processing with GPU acceleration
Inclusion of video, audio, transcript, keypoints, and visualization files
Research and Development
Potential applications in avatar animation research and development
How the dataset can be leveraged for advancements in natural language processing, computer vision, and AI-driven human interaction
Conclusion
Impact
The significance of the Allo-AVA dataset in advancing the field of avatar animation and AI-driven human interaction
Future directions for research utilizing the dataset
Basic info
papers
computation and language
computer vision and pattern recognition
artificial intelligence
Advanced features