Breaking Bad Molecules: Are MLLMs Ready for Structure-Level Molecular Detoxification?

Fei Lin, Ziyang Gong, Cong Wang, Yonglin Tian, Tengchao Zhang, Xue Yang, Gen Luo, Fei-Yue Wang·June 12, 2025

Summary

ToxiMol benchmarks molecular toxicity repair for drug development, focusing on creating less toxic alternatives. It uses ToxiEval, a dataset evaluating repair success based on toxicity prediction, synthetic accessibility, drug-likeness, and structural similarity. The study assesses 30 models, highlighting advancements in understanding toxicity, semantic constraints, and structure-aware molecule editing. AI in drug discovery emphasizes SMARTS, multi-task graph learning, transformers, and graph convolutional networks. Liu et al. advanced Visual instruction tuning in 2023, improving model success rates. The 27B model excels, achieving a 41.6% overall success rate in hERG tasks and toxicity endpoint classification. GPT variants, Claude, Gemini, Grok, and GLM are evaluated in 10 toxicity repair tasks, with the 27B model showing the highest average success rate. The TxGemma-9B-predict model, available on Hugging Face, is noted for its use in toxicity prediction, subject to specific terms of use.

Introduction

Background

Overview of molecular toxicity in drug development

Importance of creating less toxic alternatives

Objective

To assess and compare the effectiveness of 30 models in molecular toxicity repair using ToxiEval dataset

Method

Data Collection

Description of ToxiEval dataset

Criteria for evaluating repair success

Data Preprocessing

Methods for handling and preparing the dataset

Results

Model Evaluation

Overview of 30 models assessed

Highlighting advancements in understanding toxicity, semantic constraints, and structure-aware molecule editing

AI in Drug Discovery

Focus on SMARTS, multi-task graph learning, transformers, and graph convolutional networks

Liu et al.'s Contribution

Description of Visual instruction tuning in 2023

Improvement in model success rates

Top Performing Model

27B model's performance in hERG tasks and toxicity endpoint classification

Overall success rate of 41.6%

Comparative Analysis

GPT Variants

Evaluation of Claude, Gemini, Grok, and GLM in 10 toxicity repair tasks

27B Model

Comparison of average success rates across tasks

Model Availability

TxGemma-9B-predict

Description of the model

Conclusion

Summary of findings

Implications for drug development

Future directions in molecular toxicity repair

Basic info

papers

computation and language

artificial intelligence

Advanced features