RSS

Category Archives: Data Quality

How Many Varieties of Bosutinib Made it to Public Domain Databases

Recently Derek Lowe blogged about Bosutinib and the fact that there has been confusion, especially with the suppliers, regarding what that compound is. The summary of the story is that there has been TWO compounds that have been provided via the suppliers under the name Bosutinib. He commented in the blog that “Odds are, some batch of the wrong stuff has been resold through the supplier community since at least 2006 – this sort of thing goes on all the time.”

Ryan Sasaki from ACD/Labs had a follow up blog post and that contains the structures of both forms that have been sold. I was wondering which of these was most prominent in the public domain databases. I chose to check this by drawing the compounds and generating the InChIs and InChIKeys and seeing what showed up.

The two structures for Bosutinib

A search for both InChIKeys gives 162 hits for the compound on the left and ZERO hits for the compound on the right. See below.

Bosutinib Structure 1

Bosutinib Structure 2

It turns out that using InChIKeys to search for the incorrect structure (the one on the right copied from Ryan’s blog post) turns up NO hits. Only the correct structure made it to public domain databases it seems. INTERESTING!

 

 
1 Comment

Posted by on June 4, 2012 in Data Quality

 

On the Accuracy of Chemical Structures Found on the Internet

A poster presented at the ACS Meeting in San Diego with the UNC Chapel Hill group…

On the Accuracy of Chemical Structures Found on the Internet

The Internet has been widely lauded as a great equalizer of information access.  However, the absence of any central authority on content places the burden on the end-user to verify the quality of the information accessed.  We have examined the accuracy of the chemical structures of ca. 200 major pharmaceutical products that can be found on the internet.  We have demonstrated that while erroneous structures are commonplace, it is possible to determine the correct structures by utilizing a carefully defined structure validation workflow.  In addition, we and others have shown that the use of un-curated structures affects the accuracy of cheminformatics investigations such as QSAR modeling. Furthermore, models built for carefully curated datasets can be used to correct erroneously reported biological data.  We posit that chemical datasets must be carefully curated prior to any cheminformatics investigations.  We summarize best practices developed in our groups for data curation.

 

 

The long term cost of inferior database quality

Our recent Drug Discovery Today article

One of the highlights of the past year has been my continued collaborations with Sean Ekins on the issues of data quality, modeling of data and the applications of mobile technologies. Recently our commentary on the long term cost of inferior database quality was published in Drug Discovery Today and is available online here.

 
Leave a comment

Posted by on December 30, 2011 in Data Quality, General Communications

 

Terminal Dimethyl means Death by Methane twice

When writing talks I try to find interesting (and where possible fun) examples of how challenging the world of managing chemistry data is for all of us that work in the world of managing 10s of thousands, or in our cases millions of compound pages for the community to use. I have told many stories over the past few years of the challenges we collectively have in regards to data quality and how it flows between our databases unabated. My latest example used at the recent talk at the EBI (ChemSpider – An Online Database and Registration System Linking the Web) was the structure known as Terminal Dimethyl presently on PubChem, DrugBankWolfram Alpha and PDBe. It was originally inherited into ChemSpider also but has been deprecated. I left a comment on DrugBank a couple of weeks ago but it hasn’t been published yet…generally such errors are removed VERY quickly by the DrugBank hosts. I added a comment to Wolfram Alpha and received a canned response and no changes to the record as yet.

There ARE ways to communally resolve these issues and I will blog about that shortly.

 
1 Comment

Posted by on October 31, 2011 in Data Quality, Humor, Quality and Content

 

Tags: , , , ,

Why I do not trust web statistics

Any of us that host websites like to use stats. We like to review our traffic, our hits, some of us follow our Alexa ranking or our Compete ranking. I have blogged previously about Alexa applied to ChemSpider and Compete applied to ChemSpider. The Alexa rankings and Compete rankings are available for the site online. These are PUBLIC rankings and the value of public rankings such as this is the relative ranking that you can observe…according to Compete ChemSpider unique users is about 1/2 that of PubChem and 9 times that of eMolecules.

Whether this is true or not is a whole different question. Why? Because I simply don’t believe most public stats on websites like this. For accurate measures of traffic I would choose internal Google Analytics code over any of the public website measures.

Here’s an example of how stats get distorted. I’d like to celebrate the result I am about to report but I DOUBT it’s true! Yesterday I gave a talk to a group of students and faculty at the University of Arkansas, Little Rock. The presentation is below.

 

Based on feedback from my host, Bob Belford, the presentation was well received. I enjoyed giving the presentation and used the BigBlueButton system to do it, using with skype for audio, and then sharing my desktop from Skype for a live demo of ChemSpider. However, I doubt it’s my best ever presentation! Within 5 hours I had received an email from Slideshare that it was being talked about on LinkedIn and was one of the top talked about presentations and they had put it on the home page of Slideshare! Overnight it had registered >10,000 views.

My Slideshare presentation on the Slideshare Home Page

Now, I wish that 10,000 people had looked at the presentation! That’s what it says now! But I believe the stats are likely biased and people have looked at the Slideshare home page 10,000 times while it was shown in the “Hot on LinkedIn” page. I may be wrong…maybe it is that popular! And if so thanks for reading.

 

 
 

Tags: ,

The story behind one publication – Or how not to win friends and influence people

This blog post is a co-authored post by Sean Ekins and I …we survived the challenges of getting this article published together so we’ll share the blogpost together also!

We recently published an editorial in Drug Discovery Today. It is entitled “A Quality Alert and Call for Improved Curation of Public Chemistry Databases” (LINK to http://dx.doi.org/10.1016/j.drudis.2011.07.007). The abstract is below:

In the last ten years, public online databases have rapidly become trusted valuable resources upon which researchers rely for their chemical structures and data for use in cheminformatics, bioinformatics, systems biology, translational medicine and now drug repositioning or repurposing efforts. Their utility depends on the quality of the underlying molecular structures used. Unfortunately, the quality of much of the chemical structure-based data introduced to the public domain is poor. As an example we describe some of the errors found in the recently released NIH Chemical Genomics Center “NPC browser” database as an example. There is an urgent need for government funded data curation to improve the quality of internet chemistry and to limit the proliferation of errors and wasted efforts.”

We certainly welcome your feedback on the article if you have read it. We have the PDF of the final published form of the article and will be able to circulate a limited number to interested parties (please contact us here in the comment section and provide emails).

Like everything we do in this complex-connected-world there is a long and interesting story behind this article. We would like to share it because we think it will be of interest and if nothing else it shows how difficult it is to get an important alert out there. OK, so it’s not on the scale of an asteroid coming towards earth in the next hour but the implications are far reaching and people need to know now to avert future scientific errors. Prior to our article being accepted for publication in DDT the manuscript had taken an interesting trip. Here it is in all its gory detail.

In a conversation one day Sean asked Tony whether or not he had ever written up the observations regarding the quality of data that he had been investigating, and blogging about, for almost 4 years. Tony commented that no, he’d considered it but had never got to it. Sean  suggested that there was likely a wider audience for the data being gathered than just the readers of the blog. Of course he was right…how many people would know Tony’s blog versus the number of people who might read an article in a journal. We have co-authored several articles in recent years and Sean has been a keen recipient of many of the curated datasets that Tony has assembled in order to use them in modeling studies. It was natural that we work together on a manuscript regarding the quality of data in Public Domain databases serving chemistry. We thought everybody would want to hear about it – how wrong we were!

We put an initial manuscript together and submitted it in late December 2010 as a Christmas present to the journal Science. We aimed BIG (why not- it was of general interest as molecule databases are proliferating) as we thought that the issue we had identified was an important one and urgent attention was needed. Simply put, millions of our tax dollars are invested in building public domain databases that are funded by grants to do this both in the US and elsewhere (Pubchem etc). Surely the quality of chemical structure data is important to everyone? Get a structure wrong and it’s no longer the molecule you say it is and that confuses everyone. The response from Science was “Thank you for submitting your manuscript “A Quality Alert for Chemistry Databases” to Science.  We discussed the manuscript extensively-but unfortunately we will not be able to publish it.   Although you have framed this issue as a data management and/or chemical complexity problem, it seems to us that it’s primarily economics-quality control is not  cost-free.” One rejection down.

Now if you were in our shoes what would you do next. That is right…we chose next to submit to Nature and the paper was again rejected rather quickly also. So we regrouped and then chose to submit to one of the top Open Access journals (PLoS Computational Biology) where it was accepted as being of interest and sent out for review (we were hopeful). We received some good feedback from the reviewers, some of the feedback abstracted below.

Reviewer #1: This manuscript brings forward a critical issue of chemical data quality in publicly available databases of biologically active molecules that has become available relatively recently.  …the authors should spend more time and give more examples from sister disciplines as well as illustrate how the use of wrong structures (or wrong data) affects molecular modeling/cheminformatics investigations. …the authors should provide compelling examples when using wrong structures without prior data curation led to erroneous models or predictions.
In summary, I believe that providing more screaming examples as to how data curaction impacts the outcome of scientific research in Cheminformatics, medicinal chemistry, chemical biology, etc., besides stating the obvious (for any scientific discipline) need to be rigorous and accurate when assembling and curating the data, will add significantly to the appeal of this highly relevant, potentially influential, and timely publication.

Reviewer #2: The quality of data freely available at the internet is more and more becoming a crucial issue in the medicinal chemistry community. This no longer only affects academic institutions which cannot afford paying for commercial databases, but also pharmaceutical industry, which is dumping these data into their in house data warehouses. However,  up to now the issue of quality was almost exclusively addressed from the side of biological data, as everyone assumed that a chemical structure is unambiguous, easy to check and thus the probability of errors is quite low. In addition, most of the data are provided by the owners or come from literature sources, where one can assume that a rigorous quality check has been performed.  In light of this, the present study is a highly valuable contribution to make the community aware of the fact that this might not be the case.… I propose to add several examples where a structure is represented indeed as wrong structure in that way that it would lead to a wrong logP, wrong TPSA or wrong number of H-bond acceptors.

Reviewer #3: This paper focuses on database errors in chemistry-based resources, specifically chemical names being associated with incorrect chemical structures.  It serves as a firm reminder to scientists to check their facts when working with freely available data sources.  It also highlights issues of data aggregation and the flow of chemistry data between resources.…No recent clear examples of publications were provided where where the data obtained from a public resource had an incorrect “dramatic effect” on results or gave “misleading predictions”.

One might even suggest that it seemed more like an “infomercial” geared towards favorable treatment towards the author’s public data resource than an actual scientific publication.  There seems little point to publish this work: without systematic study to show the extent of the issue; its effect on the literature; attempts to help identify the source errors (Lack of rigorous data exchange standards? File conversion issues? Disagreement over the actual chemical structure {as a function of time}?); strong, accurate and exemplary examples to demonstrate the issue; and suggestions on ways to remediate the situation.  As this paper stands, without a serious attempt to address the real issues, this reviewer recommends rejection of this publication.

The paper was rejected. At the same time as this rejection the NCGC data collection was released with the NPC Browser with much fanfare in Science Translational Medicine and Sean passed the dataset along to Tony for review of the data. It was a timely occurrence and a PERFECT example of the issues with data quality that we had been writing about in the articles rejected by all the major journals to this point. We commented back to the PLoS Computational Biology editor as follows “When this perspective was written we wrote it in a manner that it was not a research paper per se. I believe that we can address ALL comments from the reviewers using data that has been assembled in the past two weeks as a result of examining the dataset released by the NIH’s NCGC team. Some very basic comments are made below and I would expand on these issues in the edits to the manuscript. If you could please review the blogposts below and let me know whether you would accept an edited submission. We believe that our review of this particular dataset is a perfect example of the alert to data quality that we are pointing to and gives direct examples. Thank you.

http://www.chemconnector.com/2011/04/28/reviewing-data-quality-in-the-ncgc-pharmaceutical-collection-browser/

http://www.chemconnector.com/2011/05/02/what-is-a-drug-data-quality-in-the-ncgc-pharmaceutical-collection-browser-part-2/

http://www.chemconnector.com/2011/05/08/data-quality-in-the-ncgc-pharmaceutical-collection-browser-part-4/

Unfortunately, we could not sway the editor to accept the commentary despite inserting a lot of details about the NCGC dataset quality and fully addressing, we believe, all reviewers comments.

Since the manuscript about the NPC Browser had been published in STM we thought it was appropriate to draw attention to the data quality in the original publication in the same journal. With a newly updated manuscript we then submitted to STM. That rejection was fast “Thank you for submitting your manuscript “A Quality Alert and Call for Improved Curation of Public Chemistry Databases” to Science Translational Medicine. During the initial screening process your manuscript did not receive a high enough priority rating to warrant in-depth review. We are therefore notifying you that your submitted manuscript has been rejected.”

They did not want to allow us to report on the issue that was perfectly represented in an article published in their journal and, in our opinion, had been inadequately reviewed in regards to the quality of data in the database reported and highlighted in their journal. We did our best to explain the connection of our article to the original paper regarding the NPC Browser but made no progress in having the decision reconsidered.

At this point we made a decision to submit as an editorial to Drug Discovery Today. We have published in this journal before because it has a wide readership in both the industry and elsewhere and covers chemistry and informatics topics. We also have had very positive experience with both the review and publishing processes, being fair and fast. In addition this journal is outside of any potential “political circles of influence” being Europe-based and with a broad international editorial board and audience. It was suggested to us that we were struggling to get the article published as we were highlighting potentially sensitive issues. We’re not saying we agree…but it was suggested. We submitted the balanced article in the form detailing our findings with the NPC Browser dataset, received very positive feedback and the rest is history. The article is now out. Finally. Eight months after it was submitted in its original form.

Publishing this article has been a very enlightening experience. With one journal we had two positive reviewers with suggested changes and one negative reviewer. We addressed all suggested changes but it was still rejected. We submitted to three other journals in total and were rejected outright. Our article is, we believe, an appropriate and well-founded editorial about the quality of data. The details about the NPC browser and NCGC Dataset were knitted into the article based on timing…it was the last public domain dataset to be released after we had written the original article. We have been accused of having something against the NCGC database.  It is but one representative of the data quality issue we have reported on. Many tens of hours have been spent reviewing the data quality and providing feedback and guidance on our findings. We are not against any database…we are for improving data quality.

What has it taught us that might be applicable to other people’s publishing experiences?

1. It appears that pointing out an important problem and offering solutions does not win you friends. Possibly it was too contentious and this is the reason is did not get published in some of the journals it was submitted to? Or maybe it was badly written…you can be the judge.

3. It is possible that doing scientific analyses of database quality is not something reviewers or journals want to hear about because they publish articles that use such databases to create new analyses and therefore new papers that they in turn publish. If you turn around and tell them that these very databases have errors then what is the quality of the output? Certainly not 100% trustworthy. But then again what is?

4. Many scientists TRUST chemistry and biology databases  that are so often reused, reanalyzed and integrated with new cheminformatics or bioinformatics tools. The authors of such articles do not appear to analyze for problems caused by poor data quality or hypotheses that are incorrect due to poor underlying data.

5. We believe, raising an alternative viewpoint that challenges the status quo even in area that one would think was solved – e.g. structure quality, does not get a fair or balanced airing in the mainstream media.

Thankfully, the wider dissemination of the DDT editorial should get more people thinking about it. We have prepared an extensive manuscript that provides an even deeper analysis of the topic and we will be submitting that to journals. If any journal editors are reading this blog post and would like to welcome us to submit the complete manuscript we’d love to hear from you.

This is, in our opinion, a newsworthy topic. The very fact that the NPC browser and underlying datasets had issues that illustrated the problems we had alerted Science, Nature and PLoS Computational Biology etc. to months before, attests to that. We hope that those journals will agree that this is a serious issue. Who  do we turn too to report a problem that affects everyone in the scientific community? The multiple submission processes, with reformatting and reworking consumed many hours. The blog is certainly a simpler, openly reviewed format for reporting for sure.

It should be noted in the interest of full disclosure that Sean is on the editorial board of the DDT journal – which comes with no financial reward- but that Tony submitted the article.  Also, neither Sean or I have a financial interest in improving database quality, but we believe it is the right thing to do.

 

The Open Notebook Science Melting Point Data Book

Over the past few weeks I have been collaborating with JC Bradley and Andy Lang to develop a curated source of melting point data. JC, as usual, has eloquently detailed the story to this point over on his blog. The findings to date have been fascinating and I won’t retell them on this blog. I recommend these posts from JC Bradley read in order…they tell the bulk of the story spread over a number of months. Fascinating reading in terms of quality, cross validation of data and errors in the data sources.

Chemical Information Validation Results from Fall 2010

Alfa Aesar melting point data now openly available

Validating Melting Point Data from Alfa Aesar, EPI and MDPI

Open Modeling of Melting Point Data

More Open Melting Points from EPI and other sources: on the path to ultimate curation

The quest to determine the melting point of 4-benzyltoluene

More on 4-benzyltoluene and the impact of melting point data curation and transparency

The 4-benzyltoluene melting point twist

Rapid analysis of melting point trends and models using Google Apps Scripts

 

My role has been to help out with the processing of data, curating data using some of the procedures developed while reviewing data over the past few years, and helping to source data. It’s been a great collaboration but JC and Andy have done the heavy lifting…and done it well!

The work has culminated in the release of a book to both Nature Precedings and available via Lulu as JC has described here. As he said “Now that the book has been accepted by Nature Precedings, it provides a convenient mechanism for citation via DOI, a formal author list, version control, etc. The book is also now available from LuLu.com either as a free PDF download or a physical copy. Because the book runs 699 pages (it covers 2706 unique compounds) the lowest price we could get is $30.96,
which just covers printing and shipping.” It will be interesting to see whether people buy the book or simply go electronic. Time will tell.

Our Melting Point Data Book

 

 

Ongoing Comparisons between Microsoft Academic Search and Google Scholar Citations

I have been blogging on Google Scholar Citations in recent days and noticing some interesting details (1,2,3). I have been in exchanges with the Microsoft Academic Search support team on Twitter trying to collapse multiple accounts. They are helping.

I have since continued my comparison to look for differences in the two platforms. There are some very obvious differences. One GLARING example…on Google Scholar my top cited paper has 50 citations.  On Microsoft Academic Search it has 3. BIG difference!

Citations on Google Scholar

 

Citations on Microsoft Academic Search

 
1 Comment

Posted by on August 7, 2011 in Data Quality, General Communications

 

Tags: , ,

Encouraging Collaboration in Washington as a Hub for Chemistry Databases

On August 25/26 I will be attending the 5th Meeting on U.S. Government Chemical Databases and Open Chemistry. I will have the opportunity to spend time with people I appreciate for the contributions they are making to chemistry: Martin Walker, JC Bradley, Andy Lang, Markus Sitzmann, Ann Richard, Frank Switzer, Evan Bolton, Marc Zimmermann, Wolf Ihlenfeldt, Steve Heller, John Overington, Noel O’Boyle, and many others. It is surely going to be an excellent meeting. The agenda is given here.

Some of the people listed above are associated with “Washington-based databases”. Databases that are developed in or around Washington by government-funded organizations – the FDA, NIH, NCBI/NLM, NCI, NIST. There are also other government funded databases, non-Washington-based, represented – EPA and CDC. If you are not sure what all those three letter acronyms are then here you go.

FDA – Food and Drug Administration

NIH – National Institutes of Health

NCBI/NLM – National Center for Biotechnology Information/National Library of Medicine

NCI – National Cancer Institute

EPA – Environmental Protection Agency

CDC – Center of Disease Control

NIST – National Institute of Standards and Technology

One organization with a chemistry database conspicuous by its absence is the NCGC data collection contained in the NPC Browser. I’ve blogged a lot about this one on this blog.

NCGC – NIH Chemical Genomics Center

I am hoping to get to talk to some members of the team if they attend the meeting though.

There will be a LOT of government databases represented at this meeting. I have experience with many of the databases provided by these institutions. The DSSTox database is one of the most highly curated databases based on my review of the data. The NCI resolver is an excellent resource with good quality data in terms of the accuracy of name-structure relationships.

The various databases are developed independently of each other. True, some of the databases contain contents from some of the other databases but, as far as I can tell, there is not much collaboration in terms of coordinated curation of data. What would it be like if each of these organizations participated in a roundtable discussion to agree to a process by which to collaboratively validate and curate the data, once and for all? Maybe this meeting can catalyze such a discussion. I would encourage the organizations to take advantage of other data sources that can share their data – ChEBI/ChEMBL is one example! If these various groups coordinate their work then the result could be a massively improved quality dataset to share across the databases and across the community. If this work was done then the group that assembled the NPC Browser would likely have a lot less work to do in terms of assembling the data. The various database providers should certainly have provided clean, curated data for many of the top known drugs. While working on a manuscript reviewing the quality of public domain chemistry databases I assembled a table of 25 of the top selling drugs in the US and checked the data quality in the NPC Browser relative to a gold standard set. The assembly of the data will be discussed in its entirety  in a later publication.

25 of the Top Selling Drugs in the USA - Data Quality in the NPC Browser

The errors listed in the table are:

1 Correct skeleton, No stereochemistry
2 Correct skeleton, Missing stereochemistry
3 Correct skeleton, Incorrect stereochemistry
4 Single component of multicomponent structure
5 Multiple components for single component structure
6 No structure returned based on Name Search
7 Incorrect skeleton
8 Multiple structures based on name search

 

Clearly there are a lot of errors in the structures associated with 25 of the best selling drugs on the US market. These should be the easy ones to get right as they are so well known!!! Collaboration between the domains top database providers would have helped, almost certainly. This would not necessarily be an issue of meshing technologies but agreeing on a common goal to have the highest quality data available. Since the government puts so much money into the development of these databases it would be appropriate to have some oversight and push for aligning efforts. Collaboration is essential!

With that in mind…a shameless pointer to how Sean Ekins, Maggie Hupcey and I BELIEVE in the need for collaboration…our book. If we can encourage others in the government chemistry databases to adopt active collaborative approaches wonderful things could happen.

Collaborative Computational Technologies for Biomedical Research

 

 

Tags: , ,

Duplicate compounds in the NPC Browser and NCGC Dataset

I am presently working on a couple of articles, book chapters and guest blog posts regarding quality in public domain chemistry databases. In so doing I have continued to work through the data contained within the NPC Browser that I have blogged about many times before. I HAVE been adding curation comments to the data as I have worked through them and have removed inappropriately associated chemical names. Eventually it became too much of a burden relative to me getting my work done as there are so many edits required. What I have been looking for specifically is examples of what I thought would exist in the database – that of a failure to deduplicate. Deduplication, in terms of chemistry databases, is collapsing together records based on the same chemical structure. This sounds easy but it isn’t necessarily so….consider some of the complexities of collapsing tautomers. SIMPLE collapsing can be done by generating InChIKeys and deduplicating but InChI tautomer detection is imperfect and this approach will fail regularly. The majority of the cheminformatics toolkits have their own ways of generating fingerprints to deal with this issue of deduplication.

While browsing the database I came across Ranitidine, the active component of the well known drug Zantac. I found two records in the database. They are shown below and numbered as 1/2 and 2/2.

Ranitidine record 1.

Ranitidine record 2.

I have compared these records as molfiles. I have compared SMILES string (below).

CN\C(NCCSCC1=CC=C(CN(C)C)O1)=C/[N+]([O-])=O
CN\C(NCCSCC1=CC=C(CN(C)C)O1)=C/[N+]([O-])=O

I have compared InChIs

InChI=1S/C13H22N4O3S/c1-14-13(9-17(18)19)15-6-7-21-10-12-5-4-11(20-12)8-16(2)3/h4-5,9,14-15H,6-8,10H2,1-3H3/b13-9+
InChI=1S/C13H22N4O3S/c1-14-13(9-17(18)19)15-6-7-21-10-12-5-4-11(20-12)8-16(2)3/h4-5,9,14-15H,6-8,10H2,1-3H3/b13-9+

VMXUWOKSQNHOCA-UKTHLTGXSA-N
VMXUWOKSQNHOCA-UKTHLTGXSA-N

Try as I might I don’t see a difference between these structures. Why were they not deduplicated? This leads to the question how many more duplicates are in the database and why? I have no idea….just an observation.

 

 

 

 

 

 

Tags: ,

 
Stop SOPA