RSS

Search results for ‘NPC’

Continuing Review of the NPC Browser Content – Most Cleanup is the Responsibility of the Hosts

In the past two weeks I have been in a number of discussions regarding my blog posts about the NPC Browser. My last blog post brought a comment from Ajit Jadhav, one of the authors of the original Science Translational Medicines publication about the NPC Browser. Ajit commended Sean and I on our light-hearted approach to discussing the issues of quality.

Specifically he picked me up on the fact that American Cockroach IS listed on Dailymed as a medication. VERY interesting!  He commented

“Tony, Thanks for the amusing post. See here for more details of one example, american cockroach, which is Antigen Laboratories’ allergenic extract: http://dailymed.nlm.nih.gov/dailymed/drugInfo.cfm?id=12809.

And we can go on. But I would rather keep moving in a forward direction in life.

Regarding NPC… in case if it’s not clear yet, the collection is a small subset of HTS amenable compounds. The other content in the NPC Browser is supplementary.

Regarding you and Sean Ekins, you guys should go on the road as a comedic duo act. After all the serious scientific talks, the two of you can be the entertainment. One can be called Spinning and the other can be called Wheel.

I will volunteer to do the drum rolls for you :)

Gents, have fun working. Or… spinning if you enjoy that more. Apparently, the NPC Browser has hit a nerve in each of you. So I will check back on the blog to see what other entertainment you’re dishing out.  The more outrageous, the better! It just reveals more about you than the NPC Browser :)

Ajit”

My response is here and I insert a slice below.

“NPC was not ORIGINALLY described as a small set of HTS amenable compounds according to the Science Trans Med paper that describes it. According to the paper, and I quote “…the NCGC Pharmaceutical Collection (NPC) – a definitive collection of drugs registered or approved for use use in humans or animals.” It also states that the “NPC is the most comprehensive and accurate exposition to date of MEs registered or approved for human or vetinary use worldwide.” Having reviewed a subset of structures related to a particular class of compounds, over 140 entities, with a >70% failure in “accuracy”, I have to question this statement. I judge that the Merck Index (book form or electronic form) is a better collection. In case you are not aware of this resource details are:

http://www.merckbooks.com/mindex/referenceset.html

As I blogged in my post “Rabbits, Potatoes and other Vegetables in the NCGC Database” there are some interesting things in the database. Responding to a comment on that post I commented on other things listed in the database.

WATERCRESS
WATERMELON
WHEAT
WHEAT BRAN
WHEAT ENDOSPREM
WHEAT GERM
WHEAT GLUTEN
WHEAT GLUTEN
WHEAT MIDDLINGS
WHEAT MIDDLINGS
WHEAT MIDDLINGS
WHEY
WHITE FISH
WHITE MUSTARD
WHITE OAK BARK
WHITE PEPPER
WHITE WILLOW EXTRACT
WILD ROSE EXTRACT
WINE

I’ve searched these in DailyMed …not much luck I’m afraid 🙁

I DO believe that list below would give me hits in Dailymed but these members of the NCGC pharmaceutical collection are likely just a little generic!

List of "generics" in the NCGC pharmaceutical collection

It’s likely that most all Dailymed labels contain “ingredients, water and additives”. I wonder how many of them contain “self heal” though.

As defined in the original paper ““…the NCGC Pharmaceutical Collection (NPC) – a definitive collection of drugs registered or approved for use use in humans or animals.” Also “NPC is the most comprehensive and accurate exposition to date of MEs registered or approved for human or vetinary use worldwide.” I challenge that based on the observations above.

I have to argue that it is time to do some very basic browsing of the entries in the database that are simply text entries with no structures. There are MANY that are distinct chemicals for which the chemical can easily be located. There are also many common terms that should simply be deleted out of the dataset. Hundreds in fact. I judge that one good evening of work would catch many of the most obvious terms that are in error. I doubt that a crowdsourcing approach will address this and this very basic clean up is the responsibility of the database hosts. It’s certainly a reputation issue. Ajit commented “The other content in the NPC Browser is supplementary”. I am trying to understand how? It doesn’t align with my interpretation of the paper or that of many of the people who have been discussing the data set with me in recent weeks.

 

 

Searching for “Complete Synonyms” in PubChem and the NPC Browser

I am interested in feedback from online databases as to expected behaviors from a search. PubChem has a Complete Synonym search that limits a chemical name based search to the synonym field. Without that fielded search the search is across all text in a record, I assume. The difference in the results is shown below. The top image shows a search for Taxol and returning 59 results.

A search for Taxol in PubChem

Below is a search on Taxol[completesynonym]. This search returns 5 hits for Taxol.

I wonder whether most users of PubChem know that they need to add the [completesynonym] definition to limit the search? You might want to try Diamond and Diamond[completesynonym] as searches and look at the results.

I am assuming that on the NPC Browser a similar type of search can be conducted to limit results as a search on the drug Lidocaine returns 14 chemicals..all of them different. If this search exists I have missed it. Can anyone comment?

With ChemSpider we do our utmost to return a single structure for a clearly unique name such as Taxol and Lidocaine. We believe that’s what most people would expect. Thoughts and comments welcome.

 

Tags: , ,

Thanks for the acknowledgment from the hosts of the NPC Browser

Over the past few weeks I have been reviewing the NPC Browser and NCGC data content and have posted a number of posts on this blog. I have exchanged a small number of emails with the team and they have graciously acknowledged my efforts on the NPC webpage as shown below.

Acknowledgements

The NPC resource has benefited immensely from community feedback  since its initial release.  We are particularly grateful to the following individuals who have generously donated their time  and/or resources in helping us improve the NPC software and database:

  • Tudor Oprea  , Oleg Ursu,  and Sunset Molecular LLCgraciously donated the drug subset of the WOMBAT database.  This dataset was instrumental in enabling us to validate a large number  of curated structures (i.e., curation of curations so to speak).
  • Manish Sudwas an early adopter of the NPC resource.  His thorough analysis helped us debugged a number of errors in the software as well as database.
  • Matthew Hall of NCI  provided valuable feedback on the handling of metal containing compounds (certainly beyond organometallics).
  • Antony Williams‘ critical scrutinies of the compound content revealed numerous errors in the  original version of the database.  He also provided valuable feedback on other issues related to the software and data curation.

We also would like to extend our gratitude to everyone who has  contributed to the curation effort of the NPC database.  As a token of our appreciation, we have created, for each curator, a dedication  badge within the software to acknowledge his/her contribution.”

Thanks for the acknowledgments guys! I was curating data again yesterday…glad to participate.

 
Leave a comment

Posted by on July 26, 2011 in NPC Browser and NCGC Collection

 

Duplicate compounds in the NPC Browser and NCGC Dataset

I am presently working on a couple of articles, book chapters and guest blog posts regarding quality in public domain chemistry databases. In so doing I have continued to work through the data contained within the NPC Browser that I have blogged about many times before. I HAVE been adding curation comments to the data as I have worked through them and have removed inappropriately associated chemical names. Eventually it became too much of a burden relative to me getting my work done as there are so many edits required. What I have been looking for specifically is examples of what I thought would exist in the database – that of a failure to deduplicate. Deduplication, in terms of chemistry databases, is collapsing together records based on the same chemical structure. This sounds easy but it isn’t necessarily so….consider some of the complexities of collapsing tautomers. SIMPLE collapsing can be done by generating InChIKeys and deduplicating but InChI tautomer detection is imperfect and this approach will fail regularly. The majority of the cheminformatics toolkits have their own ways of generating fingerprints to deal with this issue of deduplication.

While browsing the database I came across Ranitidine, the active component of the well known drug Zantac. I found two records in the database. They are shown below and numbered as 1/2 and 2/2.

Ranitidine record 1.

Ranitidine record 2.

I have compared these records as molfiles. I have compared SMILES string (below).

CN\C(NCCSCC1=CC=C(CN(C)C)O1)=C/[N+]([O-])=O
CN\C(NCCSCC1=CC=C(CN(C)C)O1)=C/[N+]([O-])=O

I have compared InChIs

InChI=1S/C13H22N4O3S/c1-14-13(9-17(18)19)15-6-7-21-10-12-5-4-11(20-12)8-16(2)3/h4-5,9,14-15H,6-8,10H2,1-3H3/b13-9+
InChI=1S/C13H22N4O3S/c1-14-13(9-17(18)19)15-6-7-21-10-12-5-4-11(20-12)8-16(2)3/h4-5,9,14-15H,6-8,10H2,1-3H3/b13-9+

VMXUWOKSQNHOCA-UKTHLTGXSA-N
VMXUWOKSQNHOCA-UKTHLTGXSA-N

Try as I might I don’t see a difference between these structures. Why were they not deduplicated? This leads to the question how many more duplicates are in the database and why? I have no idea….just an observation.

 

 

 

 

 

 

Tags: ,

Confusing Search Results in the NPC Browser

For the past few days I have been in Prague at the PharmSciFair meeting. Beautiful city (with a little too much graffiti on the glorious architecture), good meeting and way too long a flight. I am traveling alone and despite the networking the disrupted sleep patterns have me working on various projects as I struggle to get more than 4 hours of sleep a night. The joys of sleep deprivation include productivity! After a break from looking at the NPC Browser and my earlier commentaries I did an update of the browser when I restarted it last night as Sean’s Collabchem blog has been keeping me updated on new changes.  Sean had informed me in two separate posts (X, Y) about the new disclaimer that states:

We are very much aware of a number of issues with the database and software. Please bear with us as we work meticulously to resolve them. Below are some of the specific issues we are addressing in the next update (which is likely to be version 1.1.0) of the software in a couple of weeks:

  • A new database release that incorporates a number of improvements to compound records (e.g., stereochemistry annotations, 2D layout, etc.).
  • There exists an indicator associated with each compound record to notify whether the record has or has not been manually curated. If the record has been curated, curators are acknowledged with proper attribution. .
  • A simple curation mechanism is integrated with each compound record so as to facilitate error reporting at finer resolutions (e.g., on an attribute basis).
  • Access to all curated records can be done through a single filter mechanism.
  • About ~2,000 QC LC/MS spectra are available for the “NPC screening” subset, courtesy of NCGC’s Bill Leister and the Analytical Chemistry group.
  • Other minor bug fixes and enhancements.

I started to poke around looking for the new features and discuss some below. Some definite moves in the right direction! Like it. I’ll comment on them in a separate post.

As I played more with the system and tried some new searches I started to get very concerned with the results I received. For example, a search on Taxol gives TWO answers. One is Taxol and the other is Ixabepilone, see below. This is weird. They are NOT the same structure at all so why would a search on one compound name bring back a separate “drug” which is really what the browser is supposed to provide us access to. The original paper reports “the creation of a definitive, complete, and non-redundant list of all approved molecular entities”. Certainly the two compounds are non-redundant (compare Taxol, and Ixabepilone). My first thought is that the search is looking for associated information that has been attached to the compound somehow. I found it under the therapeutic tab where it says…”Like taxol, Ixabepilone binds to the αβ-tubulin heterodimer subunit.” If it’s that subtle that will likely give rise to some very interesting challenges (see below). That will mean that I might search for a drug and ANY mention of that drug will retrieve hits. I expect that most people would expect to retrieve the drug itself, not all mentions ever of that drug. Maybe I’m wrong.

 

To take this to an extreme lets search for “Manganese” and see what we get in the NPC Browser…for one we see elemental manganese as the “compound” but associated with the label for the ionic form.

but also a LOT of organic compounds….one shown below. This is not an inherently obvious result but maybe exactly as its expected to be.

Of the 17 compounds retrieved with a search on Manganese only 5 actually have manganese in the formula. Do you find these results confusing? I would expect a Synonym only search would occur to retrieve just Mn++ (and maybe Mn if that is distinguished from Mn++ as a drug).

 

 

Review of NCGC Dataset in the NPC Browser Finished

For the past couple of weeks I have been looking at the NPC browser and the dataset contained within it. I am using it as an example of what type of data is finding its way into the public domain for use by Life Scientists. I had the “opportunity” to take a couple of LONG flights to and from Europe last week and late nights in hotels/ During the trip I finished my review of the data. This does NOT mean that I have a fully curated dataset …no chance. That would take a few weeks to assemble! However, it is enough data to insert some of the conclusions into a paper that has just returned from review as well as provide data for a paper presently being assembled. With that said I’m unlikely to report much more on the data until that paper is through review.

What I can comment is that the dataset does not seem to align with a lot of the comments in the original paper listed below.

R. Huang, N. Southall, Y. Wang, A. Yasgar, P. Shinn, A. Jadhav, D.-T. Nguyen, C. P. Austin. The NCGC Pharmaceutical Collection: A Comprehensive Resource of Clinically Approved Drugs Enabling Repurposing and Chemical Genomics. Science Translational Medicine, 2011; 3 (80): 80ps16 DOI: 10.1126/scitranslmed.3001862

The data has hardly been curated aqnd many of the suggested heuristics applied to the assembly of the dataset failed based on what came through the set that was issued. One of my favorite “drugs” in the screening set is shown below. I doubt Mn2+ is easily marketed as a drug, and having Mn2+ labeled as Selenium oxide, cadmium salt (1:1) seems a little strange. Having it labeled as Strontium tetraborate or barium tetraborate seems just as weird. This is one of many…many others will be discussed in a publication presently in development. Watch this space.

One of the "drugs" from the HTC Screening Set in the NPC Browser

 

 

 
 

Support for Common Compounds in the NPC Browser. Data Quality Part 3

Time for bed but a simple observation. There is no need that potassium dichromate cannot be accurately represented in a database. We do it on ChemSpider with no issue here. But for some reason it equates to Cr(I) hydroxide in the NPC Browser? Ummm….nope. Definitely not.

 

The structure for potassium dichromate....it is not Cr(I) hydroxide

 

FREE ACCESS A quality alert and call for improved curation of public chemistry databases

Our publication “A quality alert and call for improved curation of public chemistry databases” has been published in Drug Discovery Today. The abstract is given below.

This article has created quite a response. It has generated a lot of email traffic, both positive and negative (mostly positive I am glad to say), and has created enough response that the article has now been made Free Access here. Your comments welcome.

ABSTRACT 

“In the last ten years, public online databases have rapidly become trusted valuable resources upon which researchers rely for their chemical structures and data for use in cheminformatics, bioinformatics, systems biology, translational medicine and now drug repositioning or repurposing efforts. Their utility depends on the quality of the underlying molecular structures used. Unfortunately, the quality of much of the chemical structure-based data introduced to the public domain is poor. As an example we describe some of the errors found in the recently released NIH Chemical Genomics Center ‘NPC browser’ database as an example. There is an urgent need for government funded data curation to improve the quality of internet chemistry and to limit the proliferation of errors and wasted efforts.”

 
Leave a comment

Posted by on September 13, 2011 in Publications and Presentations

 

The story behind one publication – Or how not to win friends and influence people

This blog post is a co-authored post by Sean Ekins and I …we survived the challenges of getting this article published together so we’ll share the blogpost together also!

We recently published an editorial in Drug Discovery Today. It is entitled “A Quality Alert and Call for Improved Curation of Public Chemistry Databases” (LINK to http://dx.doi.org/10.1016/j.drudis.2011.07.007). The abstract is below:

In the last ten years, public online databases have rapidly become trusted valuable resources upon which researchers rely for their chemical structures and data for use in cheminformatics, bioinformatics, systems biology, translational medicine and now drug repositioning or repurposing efforts. Their utility depends on the quality of the underlying molecular structures used. Unfortunately, the quality of much of the chemical structure-based data introduced to the public domain is poor. As an example we describe some of the errors found in the recently released NIH Chemical Genomics Center “NPC browser” database as an example. There is an urgent need for government funded data curation to improve the quality of internet chemistry and to limit the proliferation of errors and wasted efforts.”

We certainly welcome your feedback on the article if you have read it. We have the PDF of the final published form of the article and will be able to circulate a limited number to interested parties (please contact us here in the comment section and provide emails).

Like everything we do in this complex-connected-world there is a long and interesting story behind this article. We would like to share it because we think it will be of interest and if nothing else it shows how difficult it is to get an important alert out there. OK, so it’s not on the scale of an asteroid coming towards earth in the next hour but the implications are far reaching and people need to know now to avert future scientific errors. Prior to our article being accepted for publication in DDT the manuscript had taken an interesting trip. Here it is in all its gory detail.

In a conversation one day Sean asked Tony whether or not he had ever written up the observations regarding the quality of data that he had been investigating, and blogging about, for almost 4 years. Tony commented that no, he’d considered it but had never got to it. Sean  suggested that there was likely a wider audience for the data being gathered than just the readers of the blog. Of course he was right…how many people would know Tony’s blog versus the number of people who might read an article in a journal. We have co-authored several articles in recent years and Sean has been a keen recipient of many of the curated datasets that Tony has assembled in order to use them in modeling studies. It was natural that we work together on a manuscript regarding the quality of data in Public Domain databases serving chemistry. We thought everybody would want to hear about it – how wrong we were!

We put an initial manuscript together and submitted it in late December 2010 as a Christmas present to the journal Science. We aimed BIG (why not- it was of general interest as molecule databases are proliferating) as we thought that the issue we had identified was an important one and urgent attention was needed. Simply put, millions of our tax dollars are invested in building public domain databases that are funded by grants to do this both in the US and elsewhere (Pubchem etc). Surely the quality of chemical structure data is important to everyone? Get a structure wrong and it’s no longer the molecule you say it is and that confuses everyone. The response from Science was “Thank you for submitting your manuscript “A Quality Alert for Chemistry Databases” to Science.  We discussed the manuscript extensively-but unfortunately we will not be able to publish it.   Although you have framed this issue as a data management and/or chemical complexity problem, it seems to us that it’s primarily economics-quality control is not  cost-free.” One rejection down.

Now if you were in our shoes what would you do next. That is right…we chose next to submit to Nature and the paper was again rejected rather quickly also. So we regrouped and then chose to submit to one of the top Open Access journals (PLoS Computational Biology) where it was accepted as being of interest and sent out for review (we were hopeful). We received some good feedback from the reviewers, some of the feedback abstracted below.

Reviewer #1: This manuscript brings forward a critical issue of chemical data quality in publicly available databases of biologically active molecules that has become available relatively recently.  …the authors should spend more time and give more examples from sister disciplines as well as illustrate how the use of wrong structures (or wrong data) affects molecular modeling/cheminformatics investigations. …the authors should provide compelling examples when using wrong structures without prior data curation led to erroneous models or predictions.
In summary, I believe that providing more screaming examples as to how data curaction impacts the outcome of scientific research in Cheminformatics, medicinal chemistry, chemical biology, etc., besides stating the obvious (for any scientific discipline) need to be rigorous and accurate when assembling and curating the data, will add significantly to the appeal of this highly relevant, potentially influential, and timely publication.

Reviewer #2: The quality of data freely available at the internet is more and more becoming a crucial issue in the medicinal chemistry community. This no longer only affects academic institutions which cannot afford paying for commercial databases, but also pharmaceutical industry, which is dumping these data into their in house data warehouses. However,  up to now the issue of quality was almost exclusively addressed from the side of biological data, as everyone assumed that a chemical structure is unambiguous, easy to check and thus the probability of errors is quite low. In addition, most of the data are provided by the owners or come from literature sources, where one can assume that a rigorous quality check has been performed.  In light of this, the present study is a highly valuable contribution to make the community aware of the fact that this might not be the case.… I propose to add several examples where a structure is represented indeed as wrong structure in that way that it would lead to a wrong logP, wrong TPSA or wrong number of H-bond acceptors.

Reviewer #3: This paper focuses on database errors in chemistry-based resources, specifically chemical names being associated with incorrect chemical structures.  It serves as a firm reminder to scientists to check their facts when working with freely available data sources.  It also highlights issues of data aggregation and the flow of chemistry data between resources.…No recent clear examples of publications were provided where where the data obtained from a public resource had an incorrect “dramatic effect” on results or gave “misleading predictions”.

One might even suggest that it seemed more like an “infomercial” geared towards favorable treatment towards the author’s public data resource than an actual scientific publication.  There seems little point to publish this work: without systematic study to show the extent of the issue; its effect on the literature; attempts to help identify the source errors (Lack of rigorous data exchange standards? File conversion issues? Disagreement over the actual chemical structure {as a function of time}?); strong, accurate and exemplary examples to demonstrate the issue; and suggestions on ways to remediate the situation.  As this paper stands, without a serious attempt to address the real issues, this reviewer recommends rejection of this publication.

The paper was rejected. At the same time as this rejection the NCGC data collection was released with the NPC Browser with much fanfare in Science Translational Medicine and Sean passed the dataset along to Tony for review of the data. It was a timely occurrence and a PERFECT example of the issues with data quality that we had been writing about in the articles rejected by all the major journals to this point. We commented back to the PLoS Computational Biology editor as follows “When this perspective was written we wrote it in a manner that it was not a research paper per se. I believe that we can address ALL comments from the reviewers using data that has been assembled in the past two weeks as a result of examining the dataset released by the NIH’s NCGC team. Some very basic comments are made below and I would expand on these issues in the edits to the manuscript. If you could please review the blogposts below and let me know whether you would accept an edited submission. We believe that our review of this particular dataset is a perfect example of the alert to data quality that we are pointing to and gives direct examples. Thank you.

http://www.chemconnector.com/2011/04/28/reviewing-data-quality-in-the-ncgc-pharmaceutical-collection-browser/

http://www.chemconnector.com/2011/05/02/what-is-a-drug-data-quality-in-the-ncgc-pharmaceutical-collection-browser-part-2/

http://www.chemconnector.com/2011/05/08/data-quality-in-the-ncgc-pharmaceutical-collection-browser-part-4/

Unfortunately, we could not sway the editor to accept the commentary despite inserting a lot of details about the NCGC dataset quality and fully addressing, we believe, all reviewers comments.

Since the manuscript about the NPC Browser had been published in STM we thought it was appropriate to draw attention to the data quality in the original publication in the same journal. With a newly updated manuscript we then submitted to STM. That rejection was fast “Thank you for submitting your manuscript “A Quality Alert and Call for Improved Curation of Public Chemistry Databases” to Science Translational Medicine. During the initial screening process your manuscript did not receive a high enough priority rating to warrant in-depth review. We are therefore notifying you that your submitted manuscript has been rejected.”

They did not want to allow us to report on the issue that was perfectly represented in an article published in their journal and, in our opinion, had been inadequately reviewed in regards to the quality of data in the database reported and highlighted in their journal. We did our best to explain the connection of our article to the original paper regarding the NPC Browser but made no progress in having the decision reconsidered.

At this point we made a decision to submit as an editorial to Drug Discovery Today. We have published in this journal before because it has a wide readership in both the industry and elsewhere and covers chemistry and informatics topics. We also have had very positive experience with both the review and publishing processes, being fair and fast. In addition this journal is outside of any potential “political circles of influence” being Europe-based and with a broad international editorial board and audience. It was suggested to us that we were struggling to get the article published as we were highlighting potentially sensitive issues. We’re not saying we agree…but it was suggested. We submitted the balanced article in the form detailing our findings with the NPC Browser dataset, received very positive feedback and the rest is history. The article is now out. Finally. Eight months after it was submitted in its original form.

Publishing this article has been a very enlightening experience. With one journal we had two positive reviewers with suggested changes and one negative reviewer. We addressed all suggested changes but it was still rejected. We submitted to three other journals in total and were rejected outright. Our article is, we believe, an appropriate and well-founded editorial about the quality of data. The details about the NPC browser and NCGC Dataset were knitted into the article based on timing…it was the last public domain dataset to be released after we had written the original article. We have been accused of having something against the NCGC database.  It is but one representative of the data quality issue we have reported on. Many tens of hours have been spent reviewing the data quality and providing feedback and guidance on our findings. We are not against any database…we are for improving data quality.

What has it taught us that might be applicable to other people’s publishing experiences?

1. It appears that pointing out an important problem and offering solutions does not win you friends. Possibly it was too contentious and this is the reason is did not get published in some of the journals it was submitted to? Or maybe it was badly written…you can be the judge.

3. It is possible that doing scientific analyses of database quality is not something reviewers or journals want to hear about because they publish articles that use such databases to create new analyses and therefore new papers that they in turn publish. If you turn around and tell them that these very databases have errors then what is the quality of the output? Certainly not 100% trustworthy. But then again what is?

4. Many scientists TRUST chemistry and biology databases  that are so often reused, reanalyzed and integrated with new cheminformatics or bioinformatics tools. The authors of such articles do not appear to analyze for problems caused by poor data quality or hypotheses that are incorrect due to poor underlying data.

5. We believe, raising an alternative viewpoint that challenges the status quo even in area that one would think was solved – e.g. structure quality, does not get a fair or balanced airing in the mainstream media.

Thankfully, the wider dissemination of the DDT editorial should get more people thinking about it. We have prepared an extensive manuscript that provides an even deeper analysis of the topic and we will be submitting that to journals. If any journal editors are reading this blog post and would like to welcome us to submit the complete manuscript we’d love to hear from you.

This is, in our opinion, a newsworthy topic. The very fact that the NPC browser and underlying datasets had issues that illustrated the problems we had alerted Science, Nature and PLoS Computational Biology etc. to months before, attests to that. We hope that those journals will agree that this is a serious issue. Who  do we turn too to report a problem that affects everyone in the scientific community? The multiple submission processes, with reformatting and reworking consumed many hours. The blog is certainly a simpler, openly reviewed format for reporting for sure.

It should be noted in the interest of full disclosure that Sean is on the editorial board of the DDT journal – which comes with no financial reward- but that Tony submitted the article.  Also, neither Sean or I have a financial interest in improving database quality, but we believe it is the right thing to do.

 

Encouraging Collaboration in Washington as a Hub for Chemistry Databases

On August 25/26 I will be attending the 5th Meeting on U.S. Government Chemical Databases and Open Chemistry. I will have the opportunity to spend time with people I appreciate for the contributions they are making to chemistry: Martin Walker, JC Bradley, Andy Lang, Markus Sitzmann, Ann Richard, Frank Switzer, Evan Bolton, Marc Zimmermann, Wolf Ihlenfeldt, Steve Heller, John Overington, Noel O’Boyle, and many others. It is surely going to be an excellent meeting. The agenda is given here.

Some of the people listed above are associated with “Washington-based databases”. Databases that are developed in or around Washington by government-funded organizations – the FDA, NIH, NCBI/NLM, NCI, NIST. There are also other government funded databases, non-Washington-based, represented – EPA and CDC. If you are not sure what all those three letter acronyms are then here you go.

FDA – Food and Drug Administration

NIH – National Institutes of Health

NCBI/NLM – National Center for Biotechnology Information/National Library of Medicine

NCI – National Cancer Institute

EPA – Environmental Protection Agency

CDC – Center of Disease Control

NIST – National Institute of Standards and Technology

One organization with a chemistry database conspicuous by its absence is the NCGC data collection contained in the NPC Browser. I’ve blogged a lot about this one on this blog.

NCGC – NIH Chemical Genomics Center

I am hoping to get to talk to some members of the team if they attend the meeting though.

There will be a LOT of government databases represented at this meeting. I have experience with many of the databases provided by these institutions. The DSSTox database is one of the most highly curated databases based on my review of the data. The NCI resolver is an excellent resource with good quality data in terms of the accuracy of name-structure relationships.

The various databases are developed independently of each other. True, some of the databases contain contents from some of the other databases but, as far as I can tell, there is not much collaboration in terms of coordinated curation of data. What would it be like if each of these organizations participated in a roundtable discussion to agree to a process by which to collaboratively validate and curate the data, once and for all? Maybe this meeting can catalyze such a discussion. I would encourage the organizations to take advantage of other data sources that can share their data – ChEBI/ChEMBL is one example! If these various groups coordinate their work then the result could be a massively improved quality dataset to share across the databases and across the community. If this work was done then the group that assembled the NPC Browser would likely have a lot less work to do in terms of assembling the data. The various database providers should certainly have provided clean, curated data for many of the top known drugs. While working on a manuscript reviewing the quality of public domain chemistry databases I assembled a table of 25 of the top selling drugs in the US and checked the data quality in the NPC Browser relative to a gold standard set. The assembly of the data will be discussed in its entirety  in a later publication.

25 of the Top Selling Drugs in the USA - Data Quality in the NPC Browser

The errors listed in the table are:

1 Correct skeleton, No stereochemistry
2 Correct skeleton, Missing stereochemistry
3 Correct skeleton, Incorrect stereochemistry
4 Single component of multicomponent structure
5 Multiple components for single component structure
6 No structure returned based on Name Search
7 Incorrect skeleton
8 Multiple structures based on name search

 

Clearly there are a lot of errors in the structures associated with 25 of the best selling drugs on the US market. These should be the easy ones to get right as they are so well known!!! Collaboration between the domains top database providers would have helped, almost certainly. This would not necessarily be an issue of meshing technologies but agreeing on a common goal to have the highest quality data available. Since the government puts so much money into the development of these databases it would be appropriate to have some oversight and push for aligning efforts. Collaboration is essential!

With that in mind…a shameless pointer to how Sean Ekins, Maggie Hupcey and I BELIEVE in the need for collaboration…our book. If we can encourage others in the government chemistry databases to adopt active collaborative approaches wonderful things could happen.

Collaborative Computational Technologies for Biomedical Research

 

 

Tags: , ,

 
Stop SOPA