[2411.03537] Two-Stage Pretraining for Molecular Property Prediction in the Wild

[Submitted on 5 Nov 2024 (v1), last revised 18 Jul 2025 (this version, v2)]

View a PDF of the paper titled Two-Stage Pretraining for Molecular Property Prediction in the Wild, by Kevin Tirta Wijaya and 5 other authors

View PDF
HTML (experimental)

Abstract:Molecular deep learning models have achieved remarkable success in property prediction, but they often require large amounts of labeled data. The challenge is that, in real-world applications, labels are extremely scarce, as obtaining them through laboratory experimentation is both expensive and time-consuming. In this work, we introduce MoleVers, a versatile pretrained molecular model designed for various types of molecular property prediction in the wild, i.e., where experimentally-validated labels are scarce. MoleVers employs a two-stage pretraining strategy. In the first stage, it learns molecular representations from unlabeled data through masked atom prediction and extreme denoising, a novel task enabled by our newly introduced branching encoder architecture and dynamic noise scale sampling. In the second stage, the model refines these representations through predictions of auxiliary properties derived from computational methods, such as the density functional theory or large language models. Evaluation on 22 small, experimentally-validated datasets demonstrates that MoleVers achieves state-of-the-art performance, highlighting the effectiveness of its two-stage framework in producing generalizable molecular representations for diverse downstream properties.

Submission history

From: Kevin Tirta Wijaya [view email]
[v1]
Tue, 5 Nov 2024 22:36:17 UTC (726 KB)
[v2]
Fri, 18 Jul 2025 13:53:09 UTC (550 KB)

Submission history

Leave a Comment Cancel Reply