Recently we published on the curation of physicochemical data sets that were then made available as Open Data. The work was reported in:
“An automated curation procedure for addressing chemical errors and inconsistencies in public datasets used in QSAR modeling, SAR and QSAR in Environmental Research, K. Mansouri, C.Grulke, R. Judson and A.J. Williams, SAR and QSAR in Environmental Research,Volume 27 2016 – Issue 11, Pages 911-937 http://dx.doi.org/10.1080/1062936X.2016.1253611”
The data has since been modeled using an alternative approach to that we used and is now reported in http://dx.doi.org/10.1021/acs.jcim.6b00625.
“In Silico Prediction of Physicochemical Properties of Environmental Chemicals Using Molecular Fingerprints and Machine Learning, Q. Zang, K. Mansouri, A.J. Williams, R.S. Judson, D.G. Allen, W.M. Casey, and N.C. Kleinstreuer, J. Chem. Inf. Model., 2017, 57 (1), pp 36–49″
The abstract for the article is below
There are little available toxicity data on the vast majority of chemicals in commerce. High-throughput screening (HTS) studies, such as those being carried out by the U.S. Environmental Protection Agency (EPA) ToxCast program in partnership with the federal Tox21 research program, can generate biological data to inform models for predicting potential toxicity. However, physicochemical properties are also needed to model environmental fate and transport, as well as exposure potential. The purpose of the present study was to generate an open-source quantitative structure–property relationship (QSPR) workflow to predict a variety of physicochemical properties that would have cross-platform compatibility to integrate into existing cheminformatics workflows. In this effort, decades-old experimental property data sets available within the EPA EPI Suite were reanalyzed using modern cheminformatics workflows to develop updated QSPR models capable of supplying computationally efficient, open, and transparent HTS property predictions in support of environmental modeling efforts. Models were built using updated EPI Suite data sets for the prediction of six physicochemical properties: octanol–water partition coefficient (logP), water solubility (logS), boiling point (BP), melting point (MP), vapor pressure (logVP), and bioconcentration factor (logBCF). The coefficient of determination (R2) between the estimated values and experimental data for the six predicted properties ranged from 0.826 (MP) to 0.965 (BP), with model performance for five of the six properties exceeding those from the original EPI Suite models. The newly derived models can be employed for rapid estimation of physicochemical properties within an open-source HTS workflow to inform fate and toxicity prediction models of environmental chemicals.