Over the past 4 weeks I have been involved with some new and old friends in the world of chemistry to initiate an analysis of “quality” in public chemistry resources. This is work in progress and involves 3 separate groups (lets call them labs) looking at various resources. Here’s s short description of the project.
The questions we are attempting to answer are:
Core question : what is the quality of data online in public chemistry databases? How accurate and unambiguous is the representation of chemical structures and their measured properties in public chemistry databases?
We’ve started with the top 200 selling drugs. The three labs had to come to an agreement about which of the top 200 drugs were small molecules (many of the top 200 are monoclonal antibodies or polymers for example). We then had to decide would we deal with mixtures and combination drugs. Just to identify the list of NAMES of drugs we wanted to deal with was iterative and a negotiation.
Then we decided that each lab would work independently, each lab would have at least two members of the lab working on the same problem independently. We would have both intra-lab and inter-lab comparisons. We decided to start on a set of 10 drug names, using the GENERIC name as the name to work from. I started my part of the work just before I had to give a presentation at the EBI last week and was able to gather a lot of the data before the talk.
Starting with a chemical name how to you determine what the “correct” structure for that drug is. Think it’s easy? Try it! Where would you look? How would you confirm? What would the iterative loop look like in order for YOU to assert the chemical structure(s) for the drug “Vytorin”.
For me the process looks something like this. Use a level of “trust and experience” with previously used resources as a starting point and declare “This is the structure of X based on searching on the drug name for X”. Now, cross-reference, iterate, reiterate, find consistencies and collisions in order to come up with a final assertion, a list of consistent structures and the associated sources, and a set of other resources with inconsistent structures and a list of why they differ. Where possible, and if necessary, make edits to change the information (e.g. ChemSpider and Wikipedia). You can see an example of this for Vitamin K in the talk I gave at EBI here. For ten structures I came up with a number of observations for a number of drugs. The screenshot below summarizes some of the results (Click on the image to see the detail).
Represented in the table is the following information, true at the time of the search and may be already out of date
1) A search for thalidomide in ChEBI gives no hits
2) The structures of Zocor and Crestor on Drugbank are incorrect
3) There are no hits for Voglibose and Crestor on Common Chemistry
4) There were 3 incorrect structures on ChemSpider (now edited of course)
5) For most searches on a drug name on PubChem there are MULTIPLE hits and, for the set examined, the correct structure is in the results set. For example, there are 44 structures of Taxol retrieved with the search and the one I assert to be correct is there.
6) There were two incorrect structures on Wikipedia and one drug without an associated structure.
When I started the work I had a “trust” level for a number of the databases. My basic position at that time was as follows. I could rarely find the correct structure for a drug based on a text-based search of PubChem. I would generally find a set of hits and it was a lot of work to determine what was correct. Common Chemistry was excellent…but limited. Dailymed was generally good but structure representations could be abysmal. ChEBI, DrugBank, ChemIDPlus and Wikipedia were generally VERY good. Of all of the sources I used, despite the rich data on PubChem, I struggled most with this resource to find the correct structure. The results started to show that my trust perceptions were being challenged.
In parallel with the work to prepare this small dataset for the presentation at EBI I decided that it was appropriate to ask the community for their views on some of the databases I was looking at in this work…specifically asking for how much they “trusted” a resource. Trust means different things to different people. The word, and the question I was asking in the survey, would be interpreted in different ways. But that’s the way we work…so why fight it? The survey is online here…and if you haven’t filled it in PLEASE DO!
The answers to date, from the 46 responders, are below (Click on the image to see the detail):
There are some very interesting results in here…and, I willingly admit, some I find VERY surprising. 1 person has no experience with Wikipedia? Wow. The majority of people have not heard of PDSP, ChemIDPLus, DailyMed or DrugBank…without knowing who the people are that are providing feedback of course I should not be shocked. Most of these resources are not for chemists per se but for Life Scientists. The number of votes for “Always Trust” for ChemSpider and PubChem are very high, and one might say, are a compliment. The results are clearly ChemSpider-biased since I asked the question to my social network. The difference between the people who Always Trust PubChem and Commonly Trust PubChem is one person only. This is wildly different from my own views. I have heard people say that PubChem is the equivalent of quality to CAS except it’s free. Sorry folks…afraid not! (I have since heard at the EBI meeting from one of the people from PubChem that it is possible to do searches in certain ways to limit hits. It should be noted that this does not guarantee that the correct structure is retrieved.) On the flip side to this the distribution of people rarely trusting PubChem is also quite high so someone has had some interesting experiences!
There are a small number of people who NEVER Trust the resources, and early on one person declared they trusted none of them. I trusted myself to tell a colleague…that will be “Egon Willighagen” and this was later confirmed in his blog “Trust has No Place in Science“. That may be true, and the topic of a separate post, but my judgment is pretty good!
How would I fill in the questionnaire. I would NEVER flag “Always Trust” for any of the databases. I would be able to rank order the databases in terms of my perceptions/experience and extracted trust for the quality of results I would find. The answers WOULD be different before I had conducted the work on the first 10 drugs compared with now, after the pre-work.
As the host of ChemSpider I would prefer that no one “Always Trusts” the resource as that will stop people from taking care with the data and thereby removing the possible value of them curating the data. However I am more than happy to have it Commonly Trusted and we have been working hard to gain the community’s trust in this area.
This work has triggered a number of responses….I’ll make my own comments on their positions separately… but their opinions are worth reading:
Egon Willighagen: Truth has no place in Science
Egon Willighagen: Truth has no place in Science Part 2
Christina Pikas: The role of trust in science Christina has a comment “I think that Anthony (sp.) could have chosen a better word than trust in his survey. “which of these have you evaluated and decided you could use? which of these would you prefer to use based on your evaluations of their merit?” Christina is right..I could have chosen a different word but I judge (chosen carefully!) that the responses would not have differed much.
There is also a healthy exchange happening on Friendfeed.
This work has only just started. An examination of >150 “small molecule drugs” by three labs is going to provide a lot of data. The work isn’t over and we have much to do. We’re learning a lot in the process about assertional loops, iterative process, collaboration and agreement. It’s a great adventure.