Accurate Chemistry Collection: Coupled cluster atomization energies for broad chemical space

Accurate thermochemical data with sub-chemical accuracy (i.e., within ±1 kcal/mol from sufficiently accurate experimental or theoretical reference data) is essential for the development and improvement of computational chemistry methods. Challenging thermochemical properties such as heats of formation and total atomization energies (TAEs) are of particular interest because they rigorously test the ability of computational chemistry methods to accurately describe complex chemical transformations involving multiple bond rearrangements. Yet, existing thermochemical datasets that confidently reach this level of accuracy are limited in either size or scope. Datasets with highly accurate reference values include a small number of data points, and larger datasets provide less accurate data or only cover a narrow portion of the chemical space. The existing datasets are therefore insufficient for developing data-driven methods with predictive accuracy over a large chemical space. The Microsoft Research Accurate Chemistry Collection (MSR-ACC) will address this challenge. Here, it offers the MSR-ACC/TAE25 dataset of 76,879 total atomization energies obtained at the CCSD(T)/CBS level via the W1-F12 thermochemical protocol. The dataset is constructed to exhaustively cover chemical space for all elements up to argon by enumerating and sampling chemical graphs, thus avoiding bias towards any particular subspace of the chemical space (such as drug-like, organic, or experimentally observed molecules). With this first dataset in MSR-ACC, we enable data-driven approaches for developing predictive computational chemistry methods with unprecedented accuracy and scope.

Publication Downloads

Microsoft Research Accurate Chemistry Collection (MSR-ACC)

June 24, 2025

The Skala functional will enable more accurate, scalable predictions in computational chemistry. It starts with the largest high-accuracy dataset ever built for training deep-learning-based density functional theory (DFT) models. This dataset underpins Skala—coming soon to the Azure AI Foundry catalog—a new machine-learned exchange-correlation functional that reaches experimental accuracy for atomization energies.

What is Density Functional Theory (DFT)?

In this video, Microsoft’s Chris Bishop, Technical Fellow and Director of Microsoft Research AI for Science, explains how Microsoft researchers achieved a breakthrough in the accuracy of density functional theory (DFT) and the challenges they faced. Scientists worldwide use DFT to calculate the properties of molecules and materials. The researchers generated a vast dataset, two orders of magnitude larger than anything scientists used previously, and then combined it with the power of deep learning. The result is the world’s first deep learning exchange correlation (XC) functional, which achieves high accuracy without sacrificing speed. Microsoft’s new deep learning-powered DFT model has the potential to advance and accelerate scientific discovery in areas like clean energy, semiconductor technology, medicine, and more.