Allo-AVA: A Large-Scale Multimodal Conversational AI Dataset for Allocentric Avatar Gesture Animation

Saif Punjwani, Larry Heck·October 21, 2024

Summary

The Allo-AVA dataset, a comprehensive multimodal resource, addresses the scarcity of high-quality, synchronized text and audio-driven avatar animation data. It comprises approximately 1,250 hours of diverse video content, complete with audio, transcripts, and precise timestamps for keypoints. This enables accurate replication of human movements aligned with speech, enhancing natural, context-aware avatar animations for applications like virtual reality and digital assistants. The dataset, totaling 7,500 videos, features an average 10-minute video duration, 135 billion extracted keypoints, and 15 million transcribed words. It categorizes gestures into 85 categories, emotions into 12, and speaker attributes into 25, offering a diverse range of human communicative behaviors across demographics. The dataset's keypoint extraction process combines OpenPose and MediaPipe for high-accuracy pose estimation, with a novel fusion algorithm integrating both models' outputs for improved keypoint accuracy. The Allo-AVA dataset, processed in Python with GPU acceleration, includes video, audio, transcript, keypoints, and visualization files, providing a valuable resource for avatar animation research and development.

Key findings

11

Tables

2

Introduction

Background

Overview of the scarcity of high-quality, synchronized text and audio-driven avatar animation data

Importance of the Allo-AVA dataset in addressing this scarcity

Objective

The goal of the Allo-AVA dataset in enhancing natural, context-aware avatar animations

Applications of the dataset in virtual reality, digital assistants, and beyond

Dataset Overview

Content

Description of the 1,250 hours of diverse video content

Details on the inclusion of audio, transcripts, and precise timestamps for keypoints

Size and Structure

Total number of videos (7,500)

Average video duration (10 minutes)

Keypoint extraction (135 billion keypoints, 15 million transcribed words)

Categorization

Overview of the 85 gesture categories, 12 emotion categories, and 25 speaker attribute categories

Extraction Process

Combination of OpenPose and MediaPipe for high-accuracy pose estimation

Novel fusion algorithm integrating both models' outputs for improved keypoint accuracy

Processing and Utilization

Technical Details

Python-based processing with GPU acceleration

Inclusion of video, audio, transcript, keypoints, and visualization files

Research and Development

Potential applications in avatar animation research and development

How the dataset can be leveraged for advancements in natural language processing, computer vision, and AI-driven human interaction

Conclusion

Impact

The significance of the Allo-AVA dataset in advancing the field of avatar animation and AI-driven human interaction

Future directions for research utilizing the dataset

Basic info

papers

computation and language

computer vision and pattern recognition

artificial intelligence

Advanced features