HumanVid: Demystifying Training Data for Camera-controllable Human Image Animation

Zhenzhi Wang, Yixuan Li, Yanhong Zeng, Youqing Fang, Yuwei Guo, Wenran Liu, Jing Tan, Kai Chen, Tianfan Xue, Bo Dai, Dahua Lin·July 24, 2024

Summary

HumanVid is a large-scale, high-quality dataset designed for camera-controllable human image animation, combining real-world and synthetic data. It features 20,000 human-centric videos in 1080P resolution, with real-world data collected from copyright-free internet videos and synthetic data including 2,300 copyright-free 3D avatar assets. The dataset enables the development of a baseline model, CamAnimate, which considers both human and camera motions, demonstrating state-of-the-art performance in controlling human pose and camera motions. The dataset is publicly available, offering a valuable resource for training diffusion models and other applications. The text discusses the creation of synthetic human video datasets for human image animation tasks, addressing the need for diverse and large-scale human-centric video datasets. The authors introduce a synthetic data creation pipeline that involves character creation, motion retargeting, and 3D scene and camera placement. The synthetic videos are rendered using Unreal Engine 5 or Blender, featuring human-like characters from SMPL-X meshes and clothing, as well as anime characters from user-generated assets. The dataset is fully scalable without human supervision and includes diverse body shapes, skin textures, and 3D clothing and textures to represent a wide range of human appearances. The text describes a process for creating synthetic anime and human-like characters for use in video content, focusing on character creation, motion retargeting, and 3D scene and camera placement. The team collected 111 unique outfits from the Bedlam dataset and used commercial simulation software to create realistic clothing deformations. They also utilized 1691 unique clothing textures for diverse appearances. The characters were rendered using the SMPL-X model, and examples of clothed characters are shown. The dataset focuses on human-centric videos, featuring diverse camera trajectories and background textures, and uses a rule-based pipeline to generate camera motions, ensuring smooth and natural movements that follow the person. The text discusses various research papers and resources related to 3D human modeling, animation, and computer vision, highlighting advancements in fields such as Vroidhub, ediffi, Hspace, Zoedepth, Bedlam, Align your latents, Blender, Playing for 3D human recovery, Realtime multi-person 2D pose estimation, Openpose, Everybody dance now, Magicanimate, Structure and content-guided video synthesis, Meshcapade, Densepose, Sparsectrl, Animated-iff, Cameractrl, GANs trained by a two-time-scale update rule, Denoising diffusion probabilistic models, and others. These resources and methods collectively advance the fields of 3D human modeling, animation, and computer vision, offering new tools and techniques for creating realistic and interactive digital humans. In conclusion, HumanVid is a comprehensive dataset designed to address the challenges in human image animation, particularly the lack of high-quality public datasets and the neglect of camera motions. The dataset combines real-world and synthetic data, enabling the development of advanced models for controlling human pose and camera motions. The introduction of synthetic and real-world video datasets, along with the development of a baseline model, CamAnimate, sets a new benchmark for human image animation tasks. The dataset and the proposed model offer a valuable resource for researchers and practitioners in the field, facilitating advancements in video and movie production.

Key findings

6

Tables

2

Advanced features