Continuing Conflicts in the Messy World of Internet Chemistry

16 Aug

I have been looking at the state of curated data on the internet and blogged last night about the messy world of curated data. I should emphasize…none of these commentaries are meant to be harsh. Believe me, I’ve gone through the process of validating data and it’s difficult. There will be mistakes but what we need are processes and systems to clean these data up efficiently. If I see an error I want to annotate it and let people know there is an error. With todays’s technologies it is not difficult.

Let’s take another example from DrugBank

That listed chemical name above the structure doesn’t look very consistent…I don’t see any stereochemistry, certainly no “dihydroxy” and overall…yes, it’s definitely wrong. The actual structure for that name is shown below. Looks like an entire half of the molecule is missing. The InChI and InChIKey are for the molecule shown in DrugBank but the link to KEGG is to the molecule shown below…here.

The links on DrugBank to PubCHem and ChEBI are to the molecule to the left. All of the data in the DrugBank record in terms of outlinks  are for the structure on the left EXCEPT the actual structure on the record, and its associated SMILEs and InChIs are for the  “2-amino-3,5-dihydro-4H-pyrrolo[2,3-d]pyrimidin-4-one” moiety. Oops.

Recently I pointed out to David Wishart, host of DrugBank, some of the issues I had been seeing and it appears there will be a major update to DrugBank in the next few weeks that, in theory, will address some, and hopefully all of these observations.

1 Comment

Posted by on August 16, 2010 in Uncategorized


Tags: , , , , ,

One Response to Continuing Conflicts in the Messy World of Internet Chemistry

  1. Cristina L.

    July 31, 2015 at 8:20 am

    Concerning the messy world of internet chemistry and DrugBank you are reporting here, and 5 years after this post…I would like to point out an inconsistent result I have also found in DrugBank.
    I am working with their data and want to analyse the target profile for each of the drugs listed in DrugBank.
    I have found multiple structures annotated more than twice in the database although being the same ones (same InChI, same InChIKey, same smiles) with different identifiers and names.
    This is not a problem at all (well, old entries should be removed I guess…), yet when looking at their name most of them correspond to steroisomers (R/S).
    First of all, steroisomers should get different and unique InChIKey/InChI. Moreover DrugBank emphasizes the fact of providing isomeric smiles, though does not seem so in these cases…
    Second of all, if looking at the protein associations that are provided for some of these compounds sharing same structure but annotated differently, should be exactly the same ones (as they correspond to the same molecule!)…though they are not!
    What is the problem here?
    – Is the name wrongly assigned?
    – Is the structure wrongly assigned?
    – Are the protein associations wrongly assigned?

    If searching in PubChem for the compounds according to their DrugBank ID they apparently show to have the same InChIKey as DrugBank provides… Still if PubChem has retrieved the data from DrugBank this might be like a snake biting its own tail.

    Check some examples: DB04352/DB04127, DB02399/DB03815, DB00151/DB03201.

    Would appreciate some hints about what to do or believe!
    Thank you in advance!


Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.