RSS

Encouraging Collaboration in Washington as a Hub for Chemistry Databases

01 Aug

On August 25/26 I will be attending the 5th Meeting on U.S. Government Chemical Databases and Open Chemistry. I will have the opportunity to spend time with people I appreciate for the contributions they are making to chemistry: Martin Walker, JC Bradley, Andy Lang, Markus Sitzmann, Ann Richard, Frank Switzer, Evan Bolton, Marc Zimmermann, Wolf Ihlenfeldt, Steve Heller, John Overington, Noel O’Boyle, and many others. It is surely going to be an excellent meeting. The agenda is given here.

Some of the people listed above are associated with “Washington-based databases”. Databases that are developed in or around Washington by government-funded organizations – the FDA, NIH, NCBI/NLM, NCI, NIST. There are also other government funded databases, non-Washington-based, represented – EPA and CDC. If you are not sure what all those three letter acronyms are then here you go.

FDA – Food and Drug Administration

NIH – National Institutes of Health

NCBI/NLM – National Center for Biotechnology Information/National Library of Medicine

NCI – National Cancer Institute

EPA – Environmental Protection Agency

CDC – Center of Disease Control

NIST – National Institute of Standards and Technology

One organization with a chemistry database conspicuous by its absence is the NCGC data collection contained in the NPC Browser. I’ve blogged a lot about this one on this blog.

NCGC – NIH Chemical Genomics Center

I am hoping to get to talk to some members of the team if they attend the meeting though.

There will be a LOT of government databases represented at this meeting. I have experience with many of the databases provided by these institutions. The DSSTox database is one of the most highly curated databases based on my review of the data. The NCI resolver is an excellent resource with good quality data in terms of the accuracy of name-structure relationships.

The various databases are developed independently of each other. True, some of the databases contain contents from some of the other databases but, as far as I can tell, there is not much collaboration in terms of coordinated curation of data. What would it be like if each of these organizations participated in a roundtable discussion to agree to a process by which to collaboratively validate and curate the data, once and for all? Maybe this meeting can catalyze such a discussion. I would encourage the organizations to take advantage of other data sources that can share their data – ChEBI/ChEMBL is one example! If these various groups coordinate their work then the result could be a massively improved quality dataset to share across the databases and across the community. If this work was done then the group that assembled the NPC Browser would likely have a lot less work to do in terms of assembling the data. The various database providers should certainly have provided clean, curated data for many of the top known drugs. While working on a manuscript reviewing the quality of public domain chemistry databases I assembled a table of 25 of the top selling drugs in the US and checked the data quality in the NPC Browser relative to a gold standard set. The assembly of the data will be discussed in its entirety  in a later publication.

25 of the Top Selling Drugs in the USA - Data Quality in the NPC Browser

The errors listed in the table are:

1 Correct skeleton, No stereochemistry
2 Correct skeleton, Missing stereochemistry
3 Correct skeleton, Incorrect stereochemistry
4 Single component of multicomponent structure
5 Multiple components for single component structure
6 No structure returned based on Name Search
7 Incorrect skeleton
8 Multiple structures based on name search

 

Clearly there are a lot of errors in the structures associated with 25 of the best selling drugs on the US market. These should be the easy ones to get right as they are so well known!!! Collaboration between the domains top database providers would have helped, almost certainly. This would not necessarily be an issue of meshing technologies but agreeing on a common goal to have the highest quality data available. Since the government puts so much money into the development of these databases it would be appropriate to have some oversight and push for aligning efforts. Collaboration is essential!

With that in mind…a shameless pointer to how Sean Ekins, Maggie Hupcey and I BELIEVE in the need for collaboration…our book. If we can encourage others in the government chemistry databases to adopt active collaborative approaches wonderful things could happen.

Collaborative Computational Technologies for Biomedical Research

 

 

About tony

Founder of ChemZoo Inc., the host of ChemSpider (www.chemspider.com). ChemSpider is an open access online database of chemical structures and property transaction based services to enable chemists around the world to data mine chemistry databases. The Royal Society of Chemistry acquired ChemSpider in May 2009. Presently working as a consortium member of the OpenPHACTS IMI project (http://www.openphacts.org/). This focuses on how drug discovery can utilize semantic technologies to improve decision making and brings together 22 European team members to develop an infrastructure to link together public and private data for the drug discovery community. I am also involved with the PharmaSea FP7 project (http://www.pharma-sea.eu/) trying to identify new classes of marine natural products with potential pharmacological activity. I am also one of the hosts for three wikis for Science: ScientistsDB, SciMobileApps and SciDBs. Over the past decade I held many responsibilities including the direction of the development of scientific software applications for spectroscopy and general chemistry, directing marketing efforts, sales and business development collaborations for the company. Eight years experience of analytical laboratory leadership and management. Experienced in experimental techniques, implementation of new NMR technologies, walk-up facility management, research and development, manufacturing support and teaching. Ability to provide situation analysis, creative solutions and establish good working relationships. Prolific author with over a 150 peer-reviewed scientific publications, 3 patents and over 300 public presentations. Specialties Leadership in the domain of free access Chemistry, Product and project management, Organizational and Leadership development, Competitive analysis and Business Development, Entrepreneurial.

Tags: , ,

6 Responses to Encouraging Collaboration in Washington as a Hub for Chemistry Databases

  1. Egon Willighagen

    August 2, 2011 at 8:39 am

    I am looking forward to any and all online coverage, as I will be attending this exiting meeting!

     
  2. trung

    August 3, 2011 at 1:39 am

    Hi Tony, we (NCGC) for sure will be there to face the firing squad 8-). We look forward to the roundtable discussion. A quick comment on your table: Would it be possible for you to qualify the numbers with the software version in your presentation? I suspect the numbers above were based on older versions of the software. Most of the issues identified should be fixed with the current version. Also, you might want to qualify the top 25 list with the year. Tegaserod was discontinued (2007?) so it’s unlikely to still be in the top 25.
    Trung

     
  3. tony

    August 3, 2011 at 6:49 am

    Looking forward to seeing you at the meeting.

    The Excel Spreadsheet was assembled about a month ago. What I should do moving forward is make sure that I quote the exact version/date. Definitely a good request. Thank you!

    The List of the Bestselling Drugs that I used was the one listed on Wikipedia from 2006. I will submit a request that it is updated: http://en.wikipedia.org/wiki/List_of_bestselling_drugs

     
  4. tony

    August 3, 2011 at 6:50 am

    This is an up to date list of the top 200 selling drugs: http://www.drugs.com/top200.html

     
  5. tony

    August 3, 2011 at 6:55 am

    Trung…I will hopefully blog tonight about your new curation system. I really like it! Very impressive. I will make sure that I comment that any and all reviews I make of data are simply points in time as the data is continually changing. Same thing as we have with ChemSPider….the data are constantly under review. We have it easier as we are a website rather than an installable software package so it’s always just ChemSpider. All curation edits are logged in the History file and available for review. Best wishes.

     

Leave a Reply

Your email address will not be published. Required fields are marked *