A Stochastic Dynamical Theory of LLM Self-Adversariality: Modeling Severity Drift as a Critical Process

Jack David Carson·January 28, 2025

Summary

A continuous-time stochastic model elucidates how large language models can self-amplify biases or toxicity through chain-of-thought reasoning. The model, using a severity variable x(t) evolving under a stochastic differential equation, identifies critical phenomena, phase transitions, and implications for model stability and bias propagation. It distinguishes between subcritical and supercritical paths, with a critical threshold marking irreversible bias. The framework aims to enable formal verification of model stability and guide adjustments to ensure safe, multi-step reasoning without misaligned outputs. Future work involves expanding the model to multi-dimensions and incorporating memory kernels for more realistic context handling.

Key findings

2

Advanced features