Application of Vision-Language Model to Pedestrians Behavior and Scene Understanding in Autonomous Driving

Haoxiang Gao, Yu Zhao·January 12, 2025

Summary

Autonomous driving advances face challenges in understanding pedestrians. Vision-language models, adept at scene understanding and planning, offer promise but require significant computational resources. This paper analyzes effective knowledge distillation of large language model semantic labels to smaller vision networks, enhancing scene representation for autonomous driving decision-making. It discusses deploying large language and vision models on vehicles, focusing on developing specialized models, efficient deployment strategies, and methods to generate actionable signals for vehicle control. The paper introduces a comprehensive pedestrian semantic attributes taxonomy, enabling more intelligent and responsive autonomous vehicles. The Waymo Open Dataset is a resource for pedestrian behavior prediction, featuring diverse real-world scenarios. GPT4-V generates annotations for pedestrians, focusing on actions, behaviors, and unusual situations. The text outlines a method using GPT annotations for autonomous driving, formulating the problem as a multi-label classification, predicting probabilities of semantic labels appearing in GPT outputs. It discusses binary cross-entropy loss, comparing CNN and Vision Transformer backbones for semantic embedding, and introduces CLIP, an influential model for multi-modal learning. The text evaluates ensemble models for text generation, focusing on metrics like BLEU score, and discusses trajectory prediction tasks for autonomous vehicles, using pedestrian behavior signals and latent semantic embedding. The study achieved significant reductions in trajectory errors at 3 seconds compared to a baseline.

Key findings

6

Introduction
Background
Overview of autonomous driving advancements and challenges
Importance of understanding pedestrians in autonomous driving
Objective
Focus on effective knowledge distillation techniques for enhancing scene representation
Development of specialized models for autonomous driving decision-making
Method
Data Collection
Utilization of the Waymo Open Dataset for pedestrian behavior prediction
Gathering diverse real-world scenarios for training models
Data Preprocessing
Preparation of the dataset for model training
Annotation generation using GPT4-V for pedestrian actions, behaviors, and unusual situations
Knowledge Distillation
Large Language Model Annotations
Extraction of semantic labels from large language models
Distillation of knowledge to smaller vision networks
Deployment Strategies
Optimization for on-vehicle deployment of large models
Development of specialized models for efficient resource usage
Pedestrian Semantic Attributes Taxonomy
Comprehensive Taxonomy
Development of a taxonomy for pedestrian semantic attributes
Enhancing autonomous vehicle intelligence and responsiveness
Model Formulation and Evaluation
Problem Formulation
Formulation of the pedestrian understanding problem as a multi-label classification
Prediction of probabilities for semantic labels in GPT outputs
Loss Function and Model Backbones
Use of binary cross-entropy loss for training
Comparison of CNN and Vision Transformer backbones for semantic embedding
Multi-modal Learning
Introduction of CLIP for multi-modal learning in autonomous driving
Text Generation and Trajectory Prediction
Ensemble Models for Text Generation
Evaluation of ensemble models for text generation
Focus on metrics like BLEU score for model performance
Trajectory Prediction Tasks
Utilization of pedestrian behavior signals and latent semantic embedding for trajectory prediction
Comparison of trajectory errors at 3 seconds with a baseline
Results and Conclusion
Achievements
Significant reductions in trajectory errors at 3 seconds
Enhanced autonomous vehicle decision-making through improved pedestrian understanding
Future Directions
Ongoing research and development in autonomous driving and pedestrian understanding
Integration of advanced AI techniques for safer and more efficient autonomous vehicles
Basic info
papers
computer vision and pattern recognition
robotics
machine learning
artificial intelligence
Advanced features