Drug repurposing offers a promising path to accelerate therapeutic discovery by identifying
new uses for approved drugs, thereby reducing cost, risk, and development timelines (1, 2). We
present a multi-modal, uncertainty-aware deep learning framework that integrates structured
biomedical knowledge graphs, unstructured literature, and molecular structure data to predict
drug potency and prioritize repurposing candidates. Our system, centered on the Calibrated-
StoNet architecture, fuses embeddings from relational graph neural networks (GNNs), BioBERT
(9), and MolBERT (8), and models predictive uncertainty using both heteroscedastic loss and
Monte Carlo Dropout. An Engression module further calibrates uncertainty by modeling full
conditional distributions. Evaluated on benchmark datasets, our model achieves strong regres-
sion performance (RMSE: 0.85), robust uncertainty calibration (ECE: 4.3%), and perfect top-10
ranking accuracy for candidate selection. These results demonstrate the value of principled un-
certainty estimation and multi-source data fusion for high-confidence, interpretable AI-driven
drug repurposing.
Drug repurposing — identifying new therapeutic uses for existing drugs — is an efficient strategy to reduce the time, cost, and risk associated with traditional drug discovery pipelines. In this talk, I’ll present an AI-based framework that leverages the power of open-source tools and multi-modal data fusion to predict drug potency and prioritize repurposing candidates with high confidence.
Our approach integrates three diverse data modalities: (1) structured biomedical knowledge graphs (e.g., drug-disease-gene relationships), (2) unstructured literature (scientific publications and clinical studies), and (3) molecular structure data. The model architecture — based on a calibrated, uncertainty-aware deep learning stack — fuses embeddings from Relational Graph Neural Networks (R-GNNs), BioBERT (for literature-based context), and MolBERT (for SMILES-based molecular representations).
A key innovation is the use of uncertainty modeling to improve interpretability and trustworthiness of predictions. We incorporate heteroscedastic loss functions and Monte Carlo Dropout to estimate predictive variance, while an additional “Engression” module calibrates these uncertainty estimates by modeling full conditional distributions. This ensures not only accuracy but also well-calibrated confidence levels — critical for high-stakes domains like drug discovery.
All components are built using open-source scientific computing tools in Python, including PyTorch, DGL, HuggingFace Transformers, and Scikit-learn, allowing for full reproducibility and community-driven development. Our model is evaluated on benchmark datasets, showing promising results: strong regression accuracy (RMSE: 0.85), robust uncertainty calibration (ECE: 4.3%), and perfect top-10 ranking precision in repurposing candidate identification.
This work highlights the value of principled uncertainty estimation and data integration across modalities for biomedical AI. The talk will also cover challenges in working with heterogeneous biomedical data, lessons learned in model calibration, and future directions for extending this research in the open-source ecosystem.
Attendees interested in machine learning, graph modeling, or scientific applications of NLP will gain insights into how FOSS tools can be orchestrated to tackle impactful real-world problems in healthcare and drug discovery.