I wasn’t aware of the NCGC Pharmaceutical Collection Browser until today. The work behind the development of the database and the browser is discussed in the publication here:
R. Huang, N. Southall, Y. Wang, A. Yasgar, P. Shinn, A. Jadhav, D.-T. Nguyen, C. P. Austin. The NCGC Pharmaceutical Collection: A Comprehensive Resource of Clinically Approved Drugs Enabling Repurposing and Chemical Genomics. Science Translational Medicine, 2011; 3 (80): 80ps16 DOI: 10.1126/scitranslmed.3001862
As is usual with new databases that come online I always concern myself with data quality. In order to take a look at the data quality I looked at the HTS amenable compounds subset of data. It’s a dataset of >7600 compounds. I ran a couple of very simple filters to try and identify potential issues with the data. In particular I was looking for presence/absence/confusions in stereochemistry. The filters also checked for valency issues and charge imbalance. Based on these checks my estimates are, for the HTS amenable compounds at least, the errors in the data amount to a minimum of 5% and probably over 10%. This is an estimate of course and it would be a lot of work to clean it all up. I’ll try and take a look at the entire database shortly.
Some examples of the errors I saw are below…Unfortunately there are many hundreds of errors in just this subset of the database. We keep creating databases, and in this case a 90 Mbyte desktop browser solution, but WHO is curating and checking the data? What is the cost to develop software that keeps getting invested relative to building quality datasets to use in the various systems? And so it continues….
Charge Balance Issues
Imperfect, Absent and Incorrect Stereochemistry
Incorrect Valence Issues
And, just to clarify…I am not saying that our own database, ChemSpider, is perfect. It’s not. But the crowds can help us improve it and curate the data online and immediately. One thing I DO like is that the developers thought ahead about getting immediate feedback as shown below. Unfortunately when I tried to use it it threw a message that there was an error so I don’t know whether the message got through. I hope to get a response at my email address.