The story behind one publication – Or how not to win friends and influence people


This blog post is a co-authored post by Sean Ekins and I …we survived the challenges of getting this article published together so we’ll share the blogpost together also!

We recently published an editorial in Drug Discovery Today. It is entitled “A Quality Alert and Call for Improved Curation of Public Chemistry Databases” (LINK to http://dx.doi.org/10.1016/j.drudis.2011.07.007). The abstract is below:

In the last ten years, public online databases have rapidly become trusted valuable resources upon which researchers rely for their chemical structures and data for use in cheminformatics, bioinformatics, systems biology, translational medicine and now drug repositioning or repurposing efforts. Their utility depends on the quality of the underlying molecular structures used. Unfortunately, the quality of much of the chemical structure-based data introduced to the public domain is poor. As an example we describe some of the errors found in the recently released NIH Chemical Genomics Center “NPC browser” database as an example. There is an urgent need for government funded data curation to improve the quality of internet chemistry and to limit the proliferation of errors and wasted efforts.”

We certainly welcome your feedback on the article if you have read it. We have the PDF of the final published form of the article and will be able to circulate a limited number to interested parties (please contact us here in the comment section and provide emails).

Like everything we do in this complex-connected-world there is a long and interesting story behind this article. We would like to share it because we think it will be of interest and if nothing else it shows how difficult it is to get an important alert out there. OK, so it’s not on the scale of an asteroid coming towards earth in the next hour but the implications are far reaching and people need to know now to avert future scientific errors. Prior to our article being accepted for publication in DDT the manuscript had taken an interesting trip. Here it is in all its gory detail.

In a conversation one day Sean asked Tony whether or not he had ever written up the observations regarding the quality of data that he had been investigating, and blogging about, for almost 4 years. Tony commented that no, he’d considered it but had never got to it. Sean  suggested that there was likely a wider audience for the data being gathered than just the readers of the blog. Of course he was right…how many people would know Tony’s blog versus the number of people who might read an article in a journal. We have co-authored several articles in recent years and Sean has been a keen recipient of many of the curated datasets that Tony has assembled in order to use them in modeling studies. It was natural that we work together on a manuscript regarding the quality of data in Public Domain databases serving chemistry. We thought everybody would want to hear about it – how wrong we were!

We put an initial manuscript together and submitted it in late December 2010 as a Christmas present to the journal Science. We aimed BIG (why not- it was of general interest as molecule databases are proliferating) as we thought that the issue we had identified was an important one and urgent attention was needed. Simply put, millions of our tax dollars are invested in building public domain databases that are funded by grants to do this both in the US and elsewhere (Pubchem etc). Surely the quality of chemical structure data is important to everyone? Get a structure wrong and it’s no longer the molecule you say it is and that confuses everyone. The response from Science was “Thank you for submitting your manuscript “A Quality Alert for Chemistry Databases” to Science.  We discussed the manuscript extensively-but unfortunately we will not be able to publish it.   Although you have framed this issue as a data management and/or chemical complexity problem, it seems to us that it’s primarily economics-quality control is not  cost-free.” One rejection down.

Now if you were in our shoes what would you do next. That is right…we chose next to submit to Nature and the paper was again rejected rather quickly also. So we regrouped and then chose to submit to one of the top Open Access journals (PLoS Computational Biology) where it was accepted as being of interest and sent out for review (we were hopeful). We received some good feedback from the reviewers, some of the feedback abstracted below.

Reviewer #1: This manuscript brings forward a critical issue of chemical data quality in publicly available databases of biologically active molecules that has become available relatively recently.  …the authors should spend more time and give more examples from sister disciplines as well as illustrate how the use of wrong structures (or wrong data) affects molecular modeling/cheminformatics investigations. …the authors should provide compelling examples when using wrong structures without prior data curation led to erroneous models or predictions.
In summary, I believe that providing more screaming examples as to how data curaction impacts the outcome of scientific research in Cheminformatics, medicinal chemistry, chemical biology, etc., besides stating the obvious (for any scientific discipline) need to be rigorous and accurate when assembling and curating the data, will add significantly to the appeal of this highly relevant, potentially influential, and timely publication.

Reviewer #2: The quality of data freely available at the internet is more and more becoming a crucial issue in the medicinal chemistry community. This no longer only affects academic institutions which cannot afford paying for commercial databases, but also pharmaceutical industry, which is dumping these data into their in house data warehouses. However,  up to now the issue of quality was almost exclusively addressed from the side of biological data, as everyone assumed that a chemical structure is unambiguous, easy to check and thus the probability of errors is quite low. In addition, most of the data are provided by the owners or come from literature sources, where one can assume that a rigorous quality check has been performed.  In light of this, the present study is a highly valuable contribution to make the community aware of the fact that this might not be the case.… I propose to add several examples where a structure is represented indeed as wrong structure in that way that it would lead to a wrong logP, wrong TPSA or wrong number of H-bond acceptors.

Reviewer #3: This paper focuses on database errors in chemistry-based resources, specifically chemical names being associated with incorrect chemical structures.  It serves as a firm reminder to scientists to check their facts when working with freely available data sources.  It also highlights issues of data aggregation and the flow of chemistry data between resources.…No recent clear examples of publications were provided where where the data obtained from a public resource had an incorrect “dramatic effect” on results or gave “misleading predictions”.

One might even suggest that it seemed more like an “infomercial” geared towards favorable treatment towards the author’s public data resource than an actual scientific publication.  There seems little point to publish this work: without systematic study to show the extent of the issue; its effect on the literature; attempts to help identify the source errors (Lack of rigorous data exchange standards? File conversion issues? Disagreement over the actual chemical structure {as a function of time}?); strong, accurate and exemplary examples to demonstrate the issue; and suggestions on ways to remediate the situation.  As this paper stands, without a serious attempt to address the real issues, this reviewer recommends rejection of this publication.

The paper was rejected. At the same time as this rejection the NCGC data collection was released with the NPC Browser with much fanfare in Science Translational Medicine and Sean passed the dataset along to Tony for review of the data. It was a timely occurrence and a PERFECT example of the issues with data quality that we had been writing about in the articles rejected by all the major journals to this point. We commented back to the PLoS Computational Biology editor as follows “When this perspective was written we wrote it in a manner that it was not a research paper per se. I believe that we can address ALL comments from the reviewers using data that has been assembled in the past two weeks as a result of examining the dataset released by the NIH’s NCGC team. Some very basic comments are made below and I would expand on these issues in the edits to the manuscript. If you could please review the blogposts below and let me know whether you would accept an edited submission. We believe that our review of this particular dataset is a perfect example of the alert to data quality that we are pointing to and gives direct examples. Thank you.

http://www.chemconnector.com/2011/04/28/reviewing-data-quality-in-the-ncgc-pharmaceutical-collection-browser/

http://www.chemconnector.com/2011/05/02/what-is-a-drug-data-quality-in-the-ncgc-pharmaceutical-collection-browser-part-2/

http://www.chemconnector.com/2011/05/08/data-quality-in-the-ncgc-pharmaceutical-collection-browser-part-4/

Unfortunately, we could not sway the editor to accept the commentary despite inserting a lot of details about the NCGC dataset quality and fully addressing, we believe, all reviewers comments.

Since the manuscript about the NPC Browser had been published in STM we thought it was appropriate to draw attention to the data quality in the original publication in the same journal. With a newly updated manuscript we then submitted to STM. That rejection was fast “Thank you for submitting your manuscript “A Quality Alert and Call for Improved Curation of Public Chemistry Databases” to Science Translational Medicine. During the initial screening process your manuscript did not receive a high enough priority rating to warrant in-depth review. We are therefore notifying you that your submitted manuscript has been rejected.”

They did not want to allow us to report on the issue that was perfectly represented in an article published in their journal and, in our opinion, had been inadequately reviewed in regards to the quality of data in the database reported and highlighted in their journal. We did our best to explain the connection of our article to the original paper regarding the NPC Browser but made no progress in having the decision reconsidered.

At this point we made a decision to submit as an editorial to Drug Discovery Today. We have published in this journal before because it has a wide readership in both the industry and elsewhere and covers chemistry and informatics topics. We also have had very positive experience with both the review and publishing processes, being fair and fast. In addition this journal is outside of any potential “political circles of influence” being Europe-based and with a broad international editorial board and audience. It was suggested to us that we were struggling to get the article published as we were highlighting potentially sensitive issues. We’re not saying we agree…but it was suggested. We submitted the balanced article in the form detailing our findings with the NPC Browser dataset, received very positive feedback and the rest is history. The article is now out. Finally. Eight months after it was submitted in its original form.

Publishing this article has been a very enlightening experience. With one journal we had two positive reviewers with suggested changes and one negative reviewer. We addressed all suggested changes but it was still rejected. We submitted to three other journals in total and were rejected outright. Our article is, we believe, an appropriate and well-founded editorial about the quality of data. The details about the NPC browser and NCGC Dataset were knitted into the article based on timing…it was the last public domain dataset to be released after we had written the original article. We have been accused of having something against the NCGC database.  It is but one representative of the data quality issue we have reported on. Many tens of hours have been spent reviewing the data quality and providing feedback and guidance on our findings. We are not against any database…we are for improving data quality.

What has it taught us that might be applicable to other people’s publishing experiences?

1. It appears that pointing out an important problem and offering solutions does not win you friends. Possibly it was too contentious and this is the reason is did not get published in some of the journals it was submitted to? Or maybe it was badly written…you can be the judge.

3. It is possible that doing scientific analyses of database quality is not something reviewers or journals want to hear about because they publish articles that use such databases to create new analyses and therefore new papers that they in turn publish. If you turn around and tell them that these very databases have errors then what is the quality of the output? Certainly not 100% trustworthy. But then again what is?

4. Many scientists TRUST chemistry and biology databases  that are so often reused, reanalyzed and integrated with new cheminformatics or bioinformatics tools. The authors of such articles do not appear to analyze for problems caused by poor data quality or hypotheses that are incorrect due to poor underlying data.

5. We believe, raising an alternative viewpoint that challenges the status quo even in area that one would think was solved – e.g. structure quality, does not get a fair or balanced airing in the mainstream media.

Thankfully, the wider dissemination of the DDT editorial should get more people thinking about it. We have prepared an extensive manuscript that provides an even deeper analysis of the topic and we will be submitting that to journals. If any journal editors are reading this blog post and would like to welcome us to submit the complete manuscript we’d love to hear from you.

This is, in our opinion, a newsworthy topic. The very fact that the NPC browser and underlying datasets had issues that illustrated the problems we had alerted Science, Nature and PLoS Computational Biology etc. to months before, attests to that. We hope that those journals will agree that this is a serious issue. Who  do we turn too to report a problem that affects everyone in the scientific community? The multiple submission processes, with reformatting and reworking consumed many hours. The blog is certainly a simpler, openly reviewed format for reporting for sure.

It should be noted in the interest of full disclosure that Sean is on the editorial board of the DDT journal – which comes with no financial reward- but that Tony submitted the article.  Also, neither Sean or I have a financial interest in improving database quality, but we believe it is the right thing to do.

  1. #1 by Barry on September 8, 2011 - 2:10 am

    I’d be interested in a copy of your DDT editorial.

  2. #2 by tony on September 13, 2011 - 7:38 am

    Barry…the article has now been made free to access here: http://www.drugdiscoverytoday.com/download/617

(will not be published)