Encouraging Collaboration in Washington as a Hub for Chemistry Databases


On August 25/26 I will be attending the 5th Meeting on U.S. Government Chemical Databases and Open Chemistry. I will have the opportunity to spend time with people I appreciate for the contributions they are making to chemistry: Martin Walker, JC Bradley, Andy Lang, Markus Sitzmann, Ann Richard, Frank Switzer, Evan Bolton, Marc Zimmermann, Wolf Ihlenfeldt, Steve Heller, John Overington, Noel O’Boyle, and many others. It is surely going to be an excellent meeting. The agenda is given here.

Some of the people listed above are associated with “Washington-based databases”. Databases that are developed in or around Washington by government-funded organizations – the FDA, NIH, NCBI/NLM, NCI, NIST. There are also other government funded databases, non-Washington-based, represented – EPA and CDC. If you are not sure what all those three letter acronyms are then here you go.

FDA – Food and Drug Administration

NIH – National Institutes of Health

NCBI/NLM – National Center for Biotechnology Information/National Library of Medicine

NCI – National Cancer Institute

EPA – Environmental Protection Agency

CDC – Center of Disease Control

NIST – National Institute of Standards and Technology

One organization with a chemistry database conspicuous by its absence is the NCGC data collection contained in the NPC Browser. I’ve blogged a lot about this one on this blog.

NCGC – NIH Chemical Genomics Center

I am hoping to get to talk to some members of the team if they attend the meeting though.

There will be a LOT of government databases represented at this meeting. I have experience with many of the databases provided by these institutions. The DSSTox database is one of the most highly curated databases based on my review of the data. The NCI resolver is an excellent resource with good quality data in terms of the accuracy of name-structure relationships.

The various databases are developed independently of each other. True, some of the databases contain contents from some of the other databases but, as far as I can tell, there is not much collaboration in terms of coordinated curation of data. What would it be like if each of these organizations participated in a roundtable discussion to agree to a process by which to collaboratively validate and curate the data, once and for all? Maybe this meeting can catalyze such a discussion. I would encourage the organizations to take advantage of other data sources that can share their data – ChEBI/ChEMBL is one example! If these various groups coordinate their work then the result could be a massively improved quality dataset to share across the databases and across the community. If this work was done then the group that assembled the NPC Browser would likely have a lot less work to do in terms of assembling the data. The various database providers should certainly have provided clean, curated data for many of the top known drugs. While working on a manuscript reviewing the quality of public domain chemistry databases I assembled a table of 25 of the top selling drugs in the US and checked the data quality in the NPC Browser relative to a gold standard set. The assembly of the data will be discussed in its entirety  in a later publication.

25 of the Top Selling Drugs in the USA - Data Quality in the NPC Browser

The errors listed in the table are:

1 Correct skeleton, No stereochemistry
2 Correct skeleton, Missing stereochemistry
3 Correct skeleton, Incorrect stereochemistry
4 Single component of multicomponent structure
5 Multiple components for single component structure
6 No structure returned based on Name Search
7 Incorrect skeleton
8 Multiple structures based on name search

 

Clearly there are a lot of errors in the structures associated with 25 of the best selling drugs on the US market. These should be the easy ones to get right as they are so well known!!! Collaboration between the domains top database providers would have helped, almost certainly. This would not necessarily be an issue of meshing technologies but agreeing on a common goal to have the highest quality data available. Since the government puts so much money into the development of these databases it would be appropriate to have some oversight and push for aligning efforts. Collaboration is essential!

With that in mind…a shameless pointer to how Sean Ekins, Maggie Hupcey and I BELIEVE in the need for collaboration…our book. If we can encourage others in the government chemistry databases to adopt active collaborative approaches wonderful things could happen.

Collaborative Computational Technologies for Biomedical Research

 

  1. #1 by Egon Willighagen on August 2, 2011 - 8:39 am

    I am looking forward to any and all online coverage, as I will be attending this exiting meeting!

  2. #3 by trung on August 3, 2011 - 1:39 am

    Hi Tony, we (NCGC) for sure will be there to face the firing squad 8-). We look forward to the roundtable discussion. A quick comment on your table: Would it be possible for you to qualify the numbers with the software version in your presentation? I suspect the numbers above were based on older versions of the software. Most of the issues identified should be fixed with the current version. Also, you might want to qualify the top 25 list with the year. Tegaserod was discontinued (2007?) so it’s unlikely to still be in the top 25.
    Trung

  3. #4 by tony on August 3, 2011 - 6:49 am

    Looking forward to seeing you at the meeting.

    The Excel Spreadsheet was assembled about a month ago. What I should do moving forward is make sure that I quote the exact version/date. Definitely a good request. Thank you!

    The List of the Bestselling Drugs that I used was the one listed on Wikipedia from 2006. I will submit a request that it is updated: http://en.wikipedia.org/wiki/List_of_bestselling_drugs

  4. #5 by tony on August 3, 2011 - 6:50 am

    This is an up to date list of the top 200 selling drugs: http://www.drugs.com/top200.html

  5. #6 by tony on August 3, 2011 - 6:55 am

    Trung…I will hopefully blog tonight about your new curation system. I really like it! Very impressive. I will make sure that I comment that any and all reviews I make of data are simply points in time as the data is continually changing. Same thing as we have with ChemSPider….the data are constantly under review. We have it easier as we are a website rather than an installable software package so it’s always just ChemSpider. All curation edits are logged in the History file and available for review. Best wishes.

(will not be published)