Zen and the Art of Motorcycle Maintenance…for those of you interested in a discussion about Quality this is a great read (or listen on a long ride with an audio book. Persig discusses the Metaphysics of Quality as a philosophy, a theory about reality and asks questions such as what is real, what is good and what is moral. The narrator of the book, Phaedrus, named after the character from the Plato dialogue of the same name, criticizes his instructors for poorly educating the students.
Now, poorly educating the students is an issue and certainly this concern has been raised recently in regards to the ChemSpider system. The question this leads to is about Quality. The Quality of LARGE public domain databases. ChemSpider is an example and not without challenges…to be expected with over 10 million compounds. However, as shown in a couple of specific posts about sodium chloride and recently prussian blue issues regarding the judgment of quality of both our system and other databases abound.
Rich Apodaca has recently posted a request for information about new free access/free speech/free beer databases to follow on from his very popular posting regarding 32 free chemistry databases. For those of you who do not frequent Depth-First I HIGHLY recommend a browse…one of my top sites for commentary on our domain. There will be a number of databases submitted for inclusion in Rich’s next list. However, the question I will have then will be about Quality. It is a concern as we choose to post certain content or not…there actually should be a quality flag depending on data sources in our opinion. Some are simply better than others.
We are already in the process of curating the ChemSpider content ourselves as well as with the assistance of some dedicated individuals. Clearly there are issues with some of the content within the index. With 10 million structures what is one to expect? The database is set to double in size over the next couple of months we believe and, in parallel, the number of potential errors will grow also.
So, the questions I have tumbling around my rather non-Zen brain this time of night are:
1) Assuming perfection is not feasible and errors will occur in a large free database of millions of structures, what level of error/misinformation is acceptable? This is after all an issue of cost versus quality in many cases. If you were paying $50 per search the expectations of quality would be much higher I would assume.
2) There are different criteria for quality for different data – it may be acceptable to have a poor predicted property for a compound since it IS a prediction but what if the structure itself is wrong, one stereocenter is mislabeled, one trade name is misspelled. Can you identify the highest quality data and for which is it acceptable to have errors?
3) What are peoples experiences of other large free databases…there are many out there as posted in Richard Apodaca’s list? What is the quality like?
4) Which public domain free online database is the gold standard by which others should be measured? How good is the database? What level of error content? What type of errors?
Any other commentary is welcomed. The question I posit is “How is Quality measured in terms of public domain free online databases?”