Games for AI Control: Models of Safety Evaluations of AI Deployment Protocols

Charlie Griffin, Louis Thomson, Buck Shlegeris, Alessandro Abate·September 12, 2024

Summary

AI-Control Games model safety and protocol effectiveness for untrusted AI systems through a decision-making framework. This multi-objective, partially observable, stochastic game reduces complex protocols to zero-sum games, enabling optimal strategy identification. Applied to Trusted Monitoring protocols for deploying untrusted language models, the model outperforms empirical studies, evaluates new settings, and analyzes the safety-usefulness trade-off, offering utility enhancements over existing approaches. The model formalizes interactions between Blue (safety-focused AI) and Red (untrusted AI), guiding actions like auditing, deferring, or running solutions. Two game models, 'Original Trusted Monitor' and 'Weak Trusted Monitor', differ in the trusted AI's suspicion score distribution. Formalizing these games aids in analyzing protocol properties and improving safety. The model identifies Pareto frontiers for policy mixtures, highlighting the impact of varying assumptions like auditing budget and number of problems on safety and usefulness scores. Future work aims to apply the model to broader settings and explore memory-dependent protocols in Trusted Monitoring.

Key findings

6
  • header
  • header
  • header
  • header
  • header
  • header

Introduction
Background
Overview of AI control and safety in untrusted systems
Importance of formal models in decision-making frameworks
Objective
Objective of using AI-Control Games in protocol evaluation
Focus on multi-objective, partially observable, stochastic games
Method
Data Collection
Methods for collecting data on AI interactions and protocol effectiveness
Data Preprocessing
Techniques for preparing data for game model application
Game Model Development
Zero-Sum Game Reduction
Transformation of complex protocols into zero-sum games
Game Models
'Original Trusted Monitor' and 'Weak Trusted Monitor'
Differences in trusted AI's suspicion score distribution
Protocol Evaluation
Outperformance of Empirical Studies
Comparison with traditional empirical methods
New Settings Evaluation
Analysis of protocol effectiveness in novel scenarios
Safety-Usefulness Trade-off
Quantification of trade-offs between safety and protocol utility
Decision-Making Framework
Blue vs. Red AI Interactions
Formalization of interactions between safety-focused AI (Blue) and untrusted AI (Red)
Action Guidance
Strategies for auditing, deferring, or running solutions
Game Analysis
Pareto Frontiers
Identification of optimal policy mixtures
Protocol Properties Analysis
Examination of protocol characteristics under varying assumptions
Future Work
Broader Settings Application
Expansion of model to more diverse AI control scenarios
Memory-Dependent Protocols
Exploration of protocols that consider AI memory in decision-making
Basic info
papers
machine learning
artificial intelligence
Advanced features
Insights
What is the main idea behind AI-Control Games model?
How does the AI-Control Games model reduce complex protocols?
What is the future work planned for the AI-Control Games model?
What are the two game models used in the AI-Control Games model?