Dataset ======= To train and evaluate machine learning models for molecular systems, high-quality datasets are essential. These datasets typically consist of molecular structures along with their corresponding physical properties, such as energies and forces, computed using quantum mechanical methods. Several datasets for molecular systems -------------------------------------- Some high quality datasets for molecular systems are publicly available. Three of the most prominent ones are the - **QM9 dataset** - **ANI-2x dataset** - **SPICE dataset** The QM9 Dataset [1], often the first benchmark dataset for molecular machine learning, provides a large number of features for small organic molecules composed of up to nine heavy atoms along with their hydrogen atoms covering in total 5 element species (C, H, O, N, F). It contains approximately 134,000 stable molecules with computed properties such as atomization energies, dipole moments, and vibrational frequencies, calculated using density functional theory (DFT) at the B3LYP/6-31G(2df,p) level of theory. However, it does not provide force information, which is crucial for training models that predict molecular dynamics. The SPICE Dataset [2] and ANI-2x Dataset [3] both provide extensive datasets with energies and forces for a wide variety of organic molecules. Both are also limited to smaller sets of chemical elements (e.g., H, C, N, O, F, S, Cl for ANI-2x) and may not cover the full diversity of chemical space needed for certain applications. Anyway, both datasets are suitable for training machine learning models for molecular systems. Since both datasets consist of a large number of datapoints, it is quite time-consuming to train models on the full datasets. Therefore, for initial testing and prototyping, we will create a smaller subset of the SPICE dataset containing only molecules / systems with up to 110 atoms. This subset will be referred to as the "a_wpS" (a... initial subset, wpS... with pubSolv = mostly solvated amino acids from the PubChem database [4]) dataset in the following. -------- a_wpS Dataset ------------- Properties of the dataset: - **Format:** HDF5. Additionally, xyz and .pt files were created for easier data loading. - **Number of molecules / systems:** 13,909 - **Chemical elements included:** H, C, N, O, F, P, S, Cl, Br, I - **Maximum number of atoms per molecule / system:** 110 - **Provided properties:** Element types / species per atom, Positions (3D coordinates) per atom, Total energy of the molecule, Forces per atom per coordinate References ---------- - **[1]QM9 ** Quantum Machine 9 L. Ruddigkeit, R. van Deursen, L. C. Blum, J.-L. Reymond, Enumeration of 166 billion organic small molecules in the chemical universe database GDB-17, J. Chem. Inf. Model. 52, 2864–2875, 2012. R. Ramakrishnan, P. O. Dral, M. Rupp, O. A. von Lilienfeld, Quantum chemistry structures and properties of 134 kilo molecules, Scientific Data 1, 140022, 2014. - **[2]SPICE ** A Dataset of Drug-like Molecules and Peptides for Training Machine Learning Potentials Eastman, P., Behara, P. K., Dotson, D. L., et al., 2023. *DOI: 10.1038/s41597-022-01882-6* - **[3]ANI-2x ** Extending the Applicability of the ANI Deep Learning Molecular Potential to Sulfur and Halogens Christian Devereux, Justin S. Smith, Kate K. Huddleston, Kipton Barros, Roman Zubatyuk, Olexandr Isayev, Adrian E. Roitberg, 2020. *DOI: 10.1021/acs.jctc.0c00121* - **[4]PubChem ** Open Chemistry Database, National Institutes of Health.