Reviewing Data Quality in the NCGC Pharmaceutical Collection Browser


I wasn’t aware of the NCGC Pharmaceutical Collection Browser until today. The work behind the development of the database and the browser is discussed in the publication here:

R. Huang, N. Southall, Y. Wang, A. Yasgar, P. Shinn, A. Jadhav, D.-T. Nguyen, C. P. Austin. The NCGC Pharmaceutical Collection: A Comprehensive Resource of Clinically Approved Drugs Enabling Repurposing and Chemical Genomics. Science Translational Medicine, 2011; 3 (80): 80ps16 DOI: 10.1126/scitranslmed.3001862

As is usual with new databases that come online I always concern myself with data quality. In order to take a look at the data quality I looked at the HTS amenable compounds subset of data. It’s a dataset of >7600 compounds. I ran a couple of very simple filters to try and identify potential issues with the data. In particular I was looking for presence/absence/confusions in stereochemistry. The filters also checked for valency issues and charge imbalance. Based on these checks my estimates are, for the HTS amenable compounds at least, the errors in the data amount to a minimum of 5% and probably over 10%. This is an estimate of course and it would be a lot of work to clean it all up. I’ll try and take a look at the entire database shortly.

Some examples of the errors I saw are below…Unfortunately there are many hundreds of errors in just this subset of the database. We keep creating databases, and in this case a 90 Mbyte desktop browser solution, but WHO is curating and checking the data? What is the cost to develop software that keeps getting invested relative to building quality datasets to use in the various systems? And so it continues….

Charge Balance Issues

NCGC Browser Charge Balance Issues - Screenshots from Browser

 

Imperfect, Absent and Incorrect Stereochemistry

Stereo Issues: Left Hand Side NCGC Structures, Right Hand Side "Correct Structures"

Incorrect Valence Issues

Valence Issues for "Tannic Acid Glycerite" - Screenshot from NCGC Browser

And, just to clarify…I am not saying that our own database, ChemSpider, is perfect. It’s not. But the crowds can help us improve it and curate the data online and immediately. One thing I DO like is that the developers thought ahead about getting immediate feedback as shown below. Unfortunately when I tried to use it it threw a message that there was an error so I don’t know whether the message got through. I hope to get a response at my email address.

Feedback Screen in the NPC Browser

 

  1. #1 by trung on April 28, 2011 - 11:00 pm

    Hi Tony, do you by any chance have proxy running on your computer? I’ll see if I can fix this. Thanks for finding the structure errors. Clearly, with limited resources, we have only been able to do any QC on a very small subset (i.e., the “NPC screening” set). Until I get the feedback feature fixed, please send me any and all errors that you find. Thanks so much!

    • #2 by tony on April 29, 2011 - 10:04 am

      The majority of the issues I found again in the NPC Screening set anyways. I am not sure how it was QC-ed but you should look especially at aluminium stearate, antimony tartrate, cycloprovera, Cortisone 21-phosphate and lots more. In order to review the data I download the data slice and review locally. The data definitely needs QC’ing again. Can you outline your process that you used and I might be able to suggest some tweaks to improve it.

      • #3 by trung on April 29, 2011 - 2:41 pm

        Thanks for the comments Tony. I must admit we’ve been very sloppy with stereo. Since not all data sources are equally reliable (I know I’m preaching to the choir here), we have to relax on how we encode the structures so as to achieve the best possible compromise between correctness and uniqueness. Even though we do have a rough ranking of data sources in terms of reliability, it’s just not possible to write code (at least not for me) to pick out the “correct” structure every time. Clearly, manual curation is the key (again, preaching to the choir), and is something that we have very little time dedicated to. (Basically we do this project in whatever spare time we have.) As we’ve noted in the paper, it’s still very much a work-in-progress and only through careful scrutiny by experts like you that the NPC resource can be improved. (We’ve struggled in deciding whether to release now but incomplete or wait until the cows come home.)

        The other source of errors that you’ve prominently pointed out is the handling of organometallic structures. This is entirely my fault. I just don’t have a good handle on what’s the “right” way to standardize these structures. I’ve made various attempts to get feedback from the community of our standardizer (e.g., http://blueobelisk.shapado.com/questions/are-there-any-open-source-efforts-to-produce-a-standardizer) but to no avail. Any pointers you have here would be much appreciated!

        This exchange has been very useful. I hope we can continue the dialog in some capacity.

        Regards,
        Trung

        • #4 by sean ekins on April 29, 2011 - 3:38 pm

          So you are saying you published in a science journal and I might add got lots of publicity for an imperfect database? And there is no warning to users! You should retract or correct it immediately.

          • #5 by Rajarshi Guha on April 29, 2011 - 9:13 pm

            A little shrill maybe?

            Yes, it’s not a perfect database (do you have any pointers to any perfect ones?). But as the paper very clearly states:

            “Information will be periodically updated as curation proceeds, new MEs are added as they are registered or approved, and errors are found. This process will benefit enormously from community feedback, and we urge users to employ the error report mechanism on the site (26)”

            I’m not sure why you’re making such a big deal out of this.

          • #6 by trung on April 29, 2011 - 10:51 pm

            Hi Sean,

            This is not at all the reaction that I was hoping for. Clearly, “incomplete” is a poor choice of word; “imperfect” is what I was looking for. Communication (or lack thereof) is one of my many flaws.

            I just want to set the record straight that I have the utmost respect for Tony, not only for what he does for the community but also for what he stands. If only there were more people like him, then we would probably have more drugs on the market by now. This is the reason why I come here, hoping to make our case to him and the community in helping us resolve some of the blemishes in the database. I certainly respect your opinions, though I think there is a more appropiate/official channel to express them.

            Sincerely,
            Trung

            PS: I just want to make the disclaimer that the opinions expressed here and in the previous post are entirely mine and, in no way, reflect those of my co-authors’.

        • #7 by tony on April 29, 2011 - 11:24 pm

          Trung…thanks for the feedback…some comments from me…

          1) I will do some review work on the dataset and share results with you as I move forward. It will be easier if I can provide feedback through the feedback feature.
          2) I’m acknowledging the comments re. “sloppy with stereo”. I encourage you to consider some type of flag on the structures, as we have done with ChemSpider, that shows that there are X out of Y stereocenters specifically defined as this becomes an indication of ambiguity and validity of the structure in many cases. THis might help the user to understand data quality more directly.
          3) In regards to reliability of data sources I am interested in your rank-ordering. My own perception of data quality was turned on its head recently as we started the validation of the top selling 150 drugs. I was shocked to see errors in what I though were high quality databases. WOuld you be willing to share your ranking..if not publicly by email???
          4)There are some VERY BASIC filters you could have run to highlight issues…certainly releasing the DB with pentavalent carbons was likely an error but I would check for invalid valences for other elements, for charge imbalance and for absent stereocenters. You will find the errors rather quickly.
          5) Organometallics are a challenge yes…for all of us. But there are only a few challenging examples and they should be easy to fix.

          I LOOK FORWARD (!!!) to continuing the dialog. I want to help. ANd what I want people to understand is that cheminformatics tools without attention to data quality is an ongoing issue and the community needs to collaboratively gather around the data and work together to fix it. CHEERS. If you want to connect by email please find me on LinkedIn…http://www.linkedin.com/in/antonywilliams. I am not posting my email here…though it’s in many places, as this post seemed to have upset some, especially “Focus from Washington”, the anonymous commenter on Sean’s post. I will help!

  2. #8 by Barend Mons on April 29, 2011 - 1:53 am

    Tony, very relevant comments of course.
    As stated many times between us: the Value of data is obviously intimately related to their quality and there is simply no way around community annotation.
    Maybe pointing people to the ‘solution’ (as a least a beginning of one) to credit people for their curation work is worthwhile:
    http://www.nature.com/ng/journal/v43/n4/full/ng0411-281.html

    • #9 by tony on April 29, 2011 - 10:06 am

      Thanks Barend…thanks for pointing out the paper. I have tweeted it too. I hope that people give it a look. Building interfaces for curation into Open PHACTS will of course be a key activity.

  3. #10 by sean ekins on April 29, 2011 - 8:55 am

    Tony – great analysis..I just blogged on this too http://tinyurl.com/5tsyem7

    • #11 by tony on April 29, 2011 - 10:08 am

      Thanks Sean…and I appreciate the suggestion to look at the quality just of the NPC Screening set. However, as you can see above, I found many of the same types of errors in that set immediately. I identified dozens of errors within a couple of minutes of opening the SDF file…I’ll do some more work looking into this on an upcoming trip.

  4. #12 by Joerg Kurt Wegner on April 29, 2011 - 11:29 am

    Thanks, good analysis, some histograms would be nice to get a feeling for how much is wrong.
    Besides, I really do not like their application, it is clunky and making the data directly available seems much more valuable than forcing people over this interface, while the overall data quality remains questionable. Finally, only structures seem exportable, all other data tables are closed and ‘unusable’ this way. Browsing data alone does not make it useful, does it?

    • #13 by tony on April 29, 2011 - 10:59 pm

      Joerg…it’s going to take me a few weeks to curate the data at a level where I can generate histograms…for sure very time consuming. My first tests with the browser were good. REsponsive and seemed to work well. I’ve been playing with it for two days and it seems slow now. The SDF files contain Synonyms, indications, CAS numbers, External IDs etc. but there are a pile of mappings that are not exported that would be valuable to have access to to validate. I think the NPC Screening set and be developed and be very valuable, and certainly would be good to make Open to all, especially to work in the Open PHACTS project for example, and it could get returned to the NCGC to update in their database. It’s a lot of work but doable.

  5. #14 by Markus Sitzmann on April 29, 2011 - 5:04 pm

    Since we talk about curation – this is my molecule of the month:

    http://www.chemspider.com/Chemical-Structure.19053748.html

    :-)

  6. #16 by Noel on April 30, 2011 - 5:31 am

    I think cycloprovera is a great example to use. The informatics resource was necessary in order to build the physical screening collection. We needed a concise physical collection of HTS amenable compounds that covered the APIs from approved drugs for those cases where the platform/reagents for the HTS assay were limiting, e.g. patient-derived fibroblasts where we could only get a few million cells as compared to most HTS-friendly cell lines which be easily grown up on the billions of cells scale. The database contains multiple structures/identifiers per ME though we only show a single one through the tool (one can get a sense of this by looking at the multiple CID linkouts from this structure). The focus of course is on the MEs from approved drugs, though we provide the full INN catalog which includes many unapproved drugs, etc. as a service to the community. The Sigma catalog structure and name given for cycloprovera is informative (E8004 β-Estradiol 17-cypionate). Where possible, we provide the links back to INN and others, which are authoritative for structure. Estradiol valerate links out to http://whqlibdoc.who.int/inn/proposed_lists/prop_INN_list35.pdf for instance. Transcription errors are to be expected from such sources. We hope that everyone in the community will be more forthcoming in providing their own curated structures back to the community going forward.

    • #17 by tony on May 10, 2011 - 10:23 pm

      Thanks for the comment Noel. I think you can see from many of the later comments that there are many other errors in the database that really shouldn’t have crept through based on the heuristics defined in the paper. My analysis work is complete and I am presently gathering my thoughts together as part of a publication. Cheers

      • #18 by Noel Southall on May 17, 2011 - 5:39 pm

        I think the key issue is the presentation of a single structure in the browser for a molecular entity record. You can see by the multiple linkouts to CIDs in the various records that there is confusion in the literature over the actual structure of said molecules. Our approach was to build consensus around the semantic concept of molecular entity and then make sure we had a physical sample of it in the screening collection. The browser does do a disservice in choosing to display just one structure based on that consensus. In fact, what we choose to display might not be the correct one (which is why this feedback is so helpful) but we’re pretty sure that the ME is not represented more than once by a physical substance in the screening collection and that if it is amenable to HTS in the first place it is either in the collection of flagged for aquisition.

        When you finish your analysis, please forward them to us – we hope that everyone in the community will be more forthcoming in providing their own curated structures back to the community going forward.

        • #19 by Leslie on May 18, 2011 - 7:44 am

          Noel: you raise an interesting approach. Perhaps going forward, NCGC should show all of the structures that are published and assign probabilities of correctness to each one. Then, make it possible for those probabilities to change as more “evidence” comes in. That evidence can be additional structural information, comments from subscribers, etc.

          My question is – when errors are discovered, do you have a way to return to the original source of the mistake and let them know that they need to clean house, too?

  7. #20 by Kishore Kumar Madala on August 13, 2011 - 3:53 am

    I suggest that you need a person to validate all the structures manually for everyday submissions. That person should be capable of running the queries to eliminate correct structures to capture wrong structures with respect to valence, aromaticity, charge balance, stereochemical and organo metallic structures.

    Can I have all the filters you are applying to achieve correctness of the database.

    • #21 by tony on August 13, 2011 - 10:01 pm

      Catching incorrect structures based on valenda, aromaticity, charge balance etc. is EASY. Determining undefined versus fully defined stereo is also easy but in certain cases of course racemates will be defined this way. The issue is more the agreement between the chemical names and the chemical structure. That is, unfortunately, limited to manual curation processes at present. How does GVKBIO validate name-structure relationships?

(will not be published)