I have been looking at the state of curated data on the internet and blogged last night about the messy world of curated data. I should emphasize…none of these commentaries are meant to be harsh. Believe me, I’ve gone through the process of validating data and it’s difficult. There will be mistakes but what we need are processes and systems to clean these data up efficiently. If I see an error I want to annotate it and let people know there is an error. With todays’s technologies it is not difficult.
Let’s take another example from DrugBank
That listed chemical name above the structure doesn’t look very consistent…I don’t see any stereochemistry, certainly no “dihydroxy” and overall…yes, it’s definitely wrong. The actual structure for that name is shown below. Looks like an entire half of the molecule is missing. The InChI and InChIKey are for the molecule shown in DrugBank but the link to KEGG is to the molecule shown below…here.
The links on DrugBank to PubCHem and ChEBI are to the molecule to the left. All of the data in the DrugBank record in terms of outlinks are for the structure on the left EXCEPT the actual structure on the record, and its associated SMILEs and InChIs are for the “2-amino-3,5-dihydro-4H-pyrrolo[2,3-d]pyrimidin-4-one” moiety. Oops.
Recently I pointed out to David Wishart, host of DrugBank, some of the issues I had been seeing and it appears there will be a major update to DrugBank in the next few weeks that, in theory, will address some, and hopefully all of these observations.