Breaking Bad Molecules: Are MLLMs Ready for Structure-Level Molecular Detoxification?

Fei Lin, Ziyang Gong, Cong Wang, Yonglin Tian, Tengchao Zhang, Xue Yang, Gen Luo, Fei-Yue Wang·June 12, 2025

Summary

ToxiMol benchmarks molecular toxicity repair for drug development, focusing on creating less toxic alternatives. It uses ToxiEval, a dataset evaluating repair success based on toxicity prediction, synthetic accessibility, drug-likeness, and structural similarity. The study assesses 30 models, highlighting advancements in understanding toxicity, semantic constraints, and structure-aware molecule editing. AI in drug discovery emphasizes SMARTS, multi-task graph learning, transformers, and graph convolutional networks. Liu et al. advanced Visual instruction tuning in 2023, improving model success rates. The 27B model excels, achieving a 41.6% overall success rate in hERG tasks and toxicity endpoint classification. GPT variants, Claude, Gemini, Grok, and GLM are evaluated in 10 toxicity repair tasks, with the 27B model showing the highest average success rate. The TxGemma-9B-predict model, available on Hugging Face, is noted for its use in toxicity prediction, subject to specific terms of use.

Introduction
Background
Overview of molecular toxicity in drug development
Importance of creating less toxic alternatives
Objective
To assess and compare the effectiveness of 30 models in molecular toxicity repair using ToxiEval dataset
Method
Data Collection
Description of ToxiEval dataset
Criteria for evaluating repair success
Data Preprocessing
Methods for handling and preparing the dataset
Results
Model Evaluation
Overview of 30 models assessed
Highlighting advancements in understanding toxicity, semantic constraints, and structure-aware molecule editing
AI in Drug Discovery
Focus on SMARTS, multi-task graph learning, transformers, and graph convolutional networks
Liu et al.'s Contribution
Description of Visual instruction tuning in 2023
Improvement in model success rates
Top Performing Model
27B model's performance in hERG tasks and toxicity endpoint classification
Overall success rate of 41.6%
Comparative Analysis
GPT Variants
Evaluation of Claude, Gemini, Grok, and GLM in 10 toxicity repair tasks
27B Model
Comparison of average success rates across tasks
Model Availability
TxGemma-9B-predict
Description of the model
Terms of use
Conclusion
Summary of findings
Implications for drug development
Future directions in molecular toxicity repair
Basic info
papers
computation and language
artificial intelligence
Advanced features