Archive for category Quality and Content

Accessing chemical health and safety data online using Royal Society of Chemistry resources

This is the second presentation I gave at the ACS Meeting in Indianapolis

Accessing chemical health and safety data online using Royal Society of Chemistry resources

The internet has opened up access to large amounts of chemistry related data that can be harvested and assembled into rich resources of value to chemists. The Royal Society of Chemistry’s ChemSpider database has assembled an electronic collection of over 28 million chemicals from over 400 data sources and some of the assembled data is certainly of value to those searching for chemical health and safety information. Since ChemSpider is a text and structure searchable database chemists are able to find relevant information using both of their general search approaches. This presentation will provide an overview of the types of chemical health and safety data and information made available via ChemSpider and discuss how the data are sourced, aggregated and validated. We will examine how the data can be made available via mobile devices and examine the issue of data quality and its potential impacts on such a database.


No Comments

Will the correct structure of Fluvastatin please stand up

Eventually there will be simple answers to the question commonly asked by chemists. “What is the chemical structure of INSERT NAME?” This is going to be true for drugs as the various online databases work together to clean up, curate, qualify and declare what a chemical structure is for a particular drug. While we can have the purists argument about structure drawings not representing reality, for example that compounds are atoms bonded together by shared clouds of electrons that at any point in time may be changing, reorganizing, tautomerizing etc the reality is also that we need a common language for information exchange and in the world of visual depictions for chemistry the layout in a 2D structure diagram is it. As we come together as a community to agree on preferred ways to standardize chemicals to assist in representations in databases for example, this situation will improve. The efforts of the FDA to define structure representation standards, with the support of pharma, will contribute. For now we are left with the challenges of different representations in different databases as well as simply the quality of data being fed into these databases. These are some of the issues we are trying to resolve as we build Open PHACTS. We are trying to link data from various resurces, noting and resolving conflicts when we can, and curating as necessary with the ultimate intention that this information will flow out into the community and be picked up by the database hosts and addressed, fixed, challenged as appropriate.

I’ve been looking for a new example showing the challenges of data integration considering that in Open PHACTS at present we are integrating chemistry from three primary data sets (for now)… DrugBank, ChEBI, ChEMBL. So, let’s consider Fluvastatin. The usual challenges of trying to determine what the “correct” chemical structure representation is for the compound is an iterative loop but let’s see what we can find in our datasets as we iterate. I KNOW from 4 years of looking at chemistry on Wikipedia that the data quality for chemical compound representations is very good. So, starting there we find the Wikipedia record here. The DrugBox links to a number of records in other databases.

One of these is ChemSpider and it has the SAME representation. On ChEBI the representation is inconsistent with no defined stereochemistry (except the E- double bond). Since ChEBI is manually curated and the compound carries 3 stars this should be correct. There are two records LINKED from this ChEBI record.

rel-(3R,5R)-fluvastatin (CHEBI:38566)

rel-(3R,5S)-fluvastatin (CHEBI:38561)

On Drugbank the compound has INVERTED stereochemistry from that on ChemSPider and Wikipedia… WP and ChemSpider has 3R,5S while DrugBank has 3S,5R but it DOES say in the pharmacodynamics sectionb “It is prepared as a racemate of two erythro enantiomers of which the 3R,5S enantiomer exerts the pharmacologic effect. ” confirming that the 3R,5S form is the ACTIVE form.

ChEMBL matches Wikipedia and ChemSpider here.

So, to summarize what we get when we search for Fluvastatin

Stereo 3R,5S for Wikipedia, ChemSPider, ChEMBL

Stereo 3S,5R for DrugBank

No stereo for ChEBI

Welcome to the complexities of name-structure relationships. These are some of the challenges we need to deal with on Open PHACTS. Dailymed defines the sodium fluvastatin as “Fluvastatin sodium is sodium (±)-(3R*, 5S*, 6E)-7-[3-(p-fluorophenyl)-1-isopropylindol-2-yl]-3,5-dihydroxy-6-heptanoate” so the relative form….



1 Comment

Presentation at NFAIS 2012 on Five Years of Experience of Crowdsourcing Chemistry for the Community

I just gave a presentation at the NFAIS conference in Philadelphia with the conference focus being “Born of Disruption: An Emerging New Normal for the Information Landscape”. I was on a panel with Lee Dirks from Microsoft Research and Kristen Fisher-Ratan from PLoS. Both gave very interesting talks and it was a pleasure to be on the panel with them.

My talk was entited “Crowdsourcing Chemistry for the Community – 5 Years of Experiences” with the abstract below.

“ChemSpider is one of the internet’s primary resources for chemists. ChemSpider is a structure-centric platform and hosts over 26 million unique chemical entities sourced from over 400 different data sources and delivers information including commercial availability, associated publications, patents, analytical data, experimental and predicted properties. ChemSpider serves a rather unique role to the community in that any chemist has the ability to deposit, curate and annotate data. In this manner they can contribute their skills, and data, to any chemist using the system. A number of parallel projects have been developed from the initial platform including ChemSpider SyntheticPages, a community generated database of reaction syntheses, and the Learn Chemistry wiki, an educational wiki for secondary school students.

This presentation will provide an overview of the project in terms of our success in engaging scientists to contribute to crowdsouring chemistry. We will also discuss some of our plans to encourage future participation and engagement in this and related projects.”

The talk is embedded below. I thank the organizers for the ability to ask questions during the talk and get responses using a clicker feedback system (I didn’t realise ahead of time that the questions would consume a few seconds and ran over on my talk..agh). I will get the answers to the questions and post them in a separate post. Interesting answers…


1 Comment

Improving Online Chemistry One Structure at a Time

Last week I was in the United Kingdom for numerous meetings and at the end of the week struggled to drive north to Macclesfield to the AstraZeneca site there to give a presentation on ChemSpider for an old colleague of mine from the Eastman Kodak company. I had not seen Tony Bristow in well over a decade but we reminisced about the good old days at Kodak (Tony worked in Harrow, UK and I worked in Rochester, NY. Tony is a Mass Spectrometrist and I am an NMR Spectroscopist by training). We also discussed how scientists are increasingly tapping into the ChemSpider resource to aid in the identification of chemical compounds using, especially, Mass Spectrometry. We have numerous examples now of when people are solving their structure ID issues directly by searching ChemSpider and are building up a portfolio of success stories.

The presentation I gave is below and loaded on SlideShare in case you want to download it.


No Comments

Copy Editors Redefine Standard Units During the Proofing Process

I write a lot of publications, averaging about one peer-reviewed publication or book chapter per month. I have published with a number of publishers including my employer (Royal Society of Chemistry), with Elsevier, Wiley, Springer, ACS and many others. The experience with each publisher is different but, generally, pleasant, and high quality. However, once in awhile the experience is “interesting”. I especially have had some very interesting peer-review “experiences”. But that is not the point ot this post. This post is about the other end of the process…paper reviewed, paper accepted and into proofing stage.

Last month Sean Ekins and I had a paper accepted and we listed in the paper some physicochemical parameters. These included logP, pKa, Lipinski parameters and Polar Surface Area, commonly known as PSA. When we got the paper back for proofing PSA had been replaced by “Prostate Specific Antigen“. It was a good catch on Sean’s part as first proofreader to spot it! How would that happen? One has to imagine a set of scripts that are searching for abbreviations and doing a find and replace. For PSA clearly context matters. For most biology papers the prostate specific antigen conversion for PSA might make sense. It doesn’t really make sense for chemistry and QSAR modeling. So, it’s all about context.

We recently submitted an article in relation to our work on Computer Assisted Structure Elucidation. This is at a time when our book on CASE is about to go to the printers! This is one of our most interesting applications of ACD/Structure Elucidator and will be discussed in more detail when the paper is published. The paper is going to be published with Wiley’s Magnetic Resonance in Chemistry. MRC is my favorite NMR journal by far and I am always happy to publish there! After all these years I was shocked when the feedback from the copy-editors for our paper said…

The copy-editor was suggesting that we changing all instances of PPM for chemical shift to mg/g. Excuse me, but reout usually? Are you serious. First of all PPM is THE defined unit for chemical shift. Did IUPAC change it without us knowing? PPM is a dimensionless unit, based on Hz/MHz, thus the 1oE-6  dependence. Even if it was in terms of Gauss (another interpretation of the mg/G) it should be microGauss/Gauss, so mcg/G.

Anyway, it makes no sense right? Surely it is just an oversight, just a one off? Unfortunately no…this entire paper HAS been published with every PPM reference to chemical shift changed into mg/g. How did that happen? We have to imagine a search and replace replacement, acceptance of the “house style” by the author and no oversight by the editor post-proofing. The result, chemical shifts are now quoted in milligrams/gram. Terrific! Surely a context issue of some type…but truth be told, I am not sure for what!

Is this a side effect of non-skilled copy-editors? A result of off-shoring? Whatever the reason it is wrong..unless IUPAC truly decided on a new standard????! NOT….


Terminal Dimethyl means Death by Methane twice

When writing talks I try to find interesting (and where possible fun) examples of how challenging the world of managing chemistry data is for all of us that work in the world of managing 10s of thousands, or in our cases millions of compound pages for the community to use. I have told many stories over the past few years of the challenges we collectively have in regards to data quality and how it flows between our databases unabated. My latest example used at the recent talk at the EBI (ChemSpider – An Online Database and Registration System Linking the Web) was the structure known as Terminal Dimethyl presently on PubChem, DrugBankWolfram Alpha and PDBe. It was originally inherited into ChemSpider also but has been deprecated. I left a comment on DrugBank a couple of weeks ago but it hasn’t been published yet…generally such errors are removed VERY quickly by the DrugBank hosts. I added a comment to Wolfram Alpha and received a canned response and no changes to the record as yet.

There ARE ways to communally resolve these issues and I will blog about that shortly.

, , , ,

1 Comment

Open Notebook Science and One Future for Scientific Research

A few weeks ago I was invited to give a presentation to the Board of Directors at Burroughs Wellcome. I was very interested in taking this opportunity to discuss my views on Open Science, Open Notebook Science, Open Data etc with this group of very esteemed scientists. However, it turned out it clashed with a planned vacation. Since my friend and frequent co-author Sean Ekins is an evangelist for open science for drug discovery, improving data quality, and Mobile Apps, and since we think alike on so many levels, I asked Sean whether he’d want to give the presentation. And, always welcoming adventure Sean jumped at the chance to present.

As it turned out Hurricane Rina resulted in us cancelling our vacation so I ended up attending the presentation with Sean. While we had bounced the slides between each other prior to the presentation Sean did a terrific job as the presenter and we had some very interesting questions regarding what is standing in the way of open science, especially around chemistry databases (of compounds), what are good examples of bioinformatics projects that are successful, and whether there are “risks” inherent to Open Science, especially in regards to what is shared online in public compound databases. I thoroughly enjoyed the meeting, short as it was and am glad that we were given the opportunity.

Sean has eloquently outlined the nature of the presentation at his site (he is Collabchem) and the presentation is below for your comments and review. I recommend that you check out Sean’s other presentations too!


No Comments

Mobile Chemistry and Generation App

A presentation given today at the ICIC Meeting in Barcelona #icic2011

While the internet has been revolutionizing our access to data and information via our computers, computers have been miniaturizing to the point where a smart phone offers capabilities that many desktops could not deliver less than a decade ago. Mobile browser technology and app-based delivery for software has now delivered into our hands further access to data via phones, pads and tablets. Whether it be in the form of chemical calculators, accessing publishers websites or public domain databases containing millions of chemical structures, mobile chemistry is here and is expanding in capability and coverage at a dramatic rate. This presentation will review the status of mobile devices and how they are being used to enable chemists.



, , , ,

No Comments

The story behind one publication – Or how not to win friends and influence people

This blog post is a co-authored post by Sean Ekins and I …we survived the challenges of getting this article published together so we’ll share the blogpost together also!

We recently published an editorial in Drug Discovery Today. It is entitled “A Quality Alert and Call for Improved Curation of Public Chemistry Databases” (LINK to The abstract is below:

In the last ten years, public online databases have rapidly become trusted valuable resources upon which researchers rely for their chemical structures and data for use in cheminformatics, bioinformatics, systems biology, translational medicine and now drug repositioning or repurposing efforts. Their utility depends on the quality of the underlying molecular structures used. Unfortunately, the quality of much of the chemical structure-based data introduced to the public domain is poor. As an example we describe some of the errors found in the recently released NIH Chemical Genomics Center “NPC browser” database as an example. There is an urgent need for government funded data curation to improve the quality of internet chemistry and to limit the proliferation of errors and wasted efforts.”

We certainly welcome your feedback on the article if you have read it. We have the PDF of the final published form of the article and will be able to circulate a limited number to interested parties (please contact us here in the comment section and provide emails).

Like everything we do in this complex-connected-world there is a long and interesting story behind this article. We would like to share it because we think it will be of interest and if nothing else it shows how difficult it is to get an important alert out there. OK, so it’s not on the scale of an asteroid coming towards earth in the next hour but the implications are far reaching and people need to know now to avert future scientific errors. Prior to our article being accepted for publication in DDT the manuscript had taken an interesting trip. Here it is in all its gory detail.

In a conversation one day Sean asked Tony whether or not he had ever written up the observations regarding the quality of data that he had been investigating, and blogging about, for almost 4 years. Tony commented that no, he’d considered it but had never got to it. Sean  suggested that there was likely a wider audience for the data being gathered than just the readers of the blog. Of course he was right…how many people would know Tony’s blog versus the number of people who might read an article in a journal. We have co-authored several articles in recent years and Sean has been a keen recipient of many of the curated datasets that Tony has assembled in order to use them in modeling studies. It was natural that we work together on a manuscript regarding the quality of data in Public Domain databases serving chemistry. We thought everybody would want to hear about it – how wrong we were!

We put an initial manuscript together and submitted it in late December 2010 as a Christmas present to the journal Science. We aimed BIG (why not- it was of general interest as molecule databases are proliferating) as we thought that the issue we had identified was an important one and urgent attention was needed. Simply put, millions of our tax dollars are invested in building public domain databases that are funded by grants to do this both in the US and elsewhere (Pubchem etc). Surely the quality of chemical structure data is important to everyone? Get a structure wrong and it’s no longer the molecule you say it is and that confuses everyone. The response from Science was “Thank you for submitting your manuscript “A Quality Alert for Chemistry Databases” to Science.  We discussed the manuscript extensively-but unfortunately we will not be able to publish it.   Although you have framed this issue as a data management and/or chemical complexity problem, it seems to us that it’s primarily economics-quality control is not  cost-free.” One rejection down.

Now if you were in our shoes what would you do next. That is right…we chose next to submit to Nature and the paper was again rejected rather quickly also. So we regrouped and then chose to submit to one of the top Open Access journals (PLoS Computational Biology) where it was accepted as being of interest and sent out for review (we were hopeful). We received some good feedback from the reviewers, some of the feedback abstracted below.

Reviewer #1: This manuscript brings forward a critical issue of chemical data quality in publicly available databases of biologically active molecules that has become available relatively recently.  …the authors should spend more time and give more examples from sister disciplines as well as illustrate how the use of wrong structures (or wrong data) affects molecular modeling/cheminformatics investigations. …the authors should provide compelling examples when using wrong structures without prior data curation led to erroneous models or predictions.
In summary, I believe that providing more screaming examples as to how data curaction impacts the outcome of scientific research in Cheminformatics, medicinal chemistry, chemical biology, etc., besides stating the obvious (for any scientific discipline) need to be rigorous and accurate when assembling and curating the data, will add significantly to the appeal of this highly relevant, potentially influential, and timely publication.

Reviewer #2: The quality of data freely available at the internet is more and more becoming a crucial issue in the medicinal chemistry community. This no longer only affects academic institutions which cannot afford paying for commercial databases, but also pharmaceutical industry, which is dumping these data into their in house data warehouses. However,  up to now the issue of quality was almost exclusively addressed from the side of biological data, as everyone assumed that a chemical structure is unambiguous, easy to check and thus the probability of errors is quite low. In addition, most of the data are provided by the owners or come from literature sources, where one can assume that a rigorous quality check has been performed.  In light of this, the present study is a highly valuable contribution to make the community aware of the fact that this might not be the case.… I propose to add several examples where a structure is represented indeed as wrong structure in that way that it would lead to a wrong logP, wrong TPSA or wrong number of H-bond acceptors.

Reviewer #3: This paper focuses on database errors in chemistry-based resources, specifically chemical names being associated with incorrect chemical structures.  It serves as a firm reminder to scientists to check their facts when working with freely available data sources.  It also highlights issues of data aggregation and the flow of chemistry data between resources.…No recent clear examples of publications were provided where where the data obtained from a public resource had an incorrect “dramatic effect” on results or gave “misleading predictions”.

One might even suggest that it seemed more like an “infomercial” geared towards favorable treatment towards the author’s public data resource than an actual scientific publication.  There seems little point to publish this work: without systematic study to show the extent of the issue; its effect on the literature; attempts to help identify the source errors (Lack of rigorous data exchange standards? File conversion issues? Disagreement over the actual chemical structure {as a function of time}?); strong, accurate and exemplary examples to demonstrate the issue; and suggestions on ways to remediate the situation.  As this paper stands, without a serious attempt to address the real issues, this reviewer recommends rejection of this publication.

The paper was rejected. At the same time as this rejection the NCGC data collection was released with the NPC Browser with much fanfare in Science Translational Medicine and Sean passed the dataset along to Tony for review of the data. It was a timely occurrence and a PERFECT example of the issues with data quality that we had been writing about in the articles rejected by all the major journals to this point. We commented back to the PLoS Computational Biology editor as follows “When this perspective was written we wrote it in a manner that it was not a research paper per se. I believe that we can address ALL comments from the reviewers using data that has been assembled in the past two weeks as a result of examining the dataset released by the NIH’s NCGC team. Some very basic comments are made below and I would expand on these issues in the edits to the manuscript. If you could please review the blogposts below and let me know whether you would accept an edited submission. We believe that our review of this particular dataset is a perfect example of the alert to data quality that we are pointing to and gives direct examples. Thank you.

Unfortunately, we could not sway the editor to accept the commentary despite inserting a lot of details about the NCGC dataset quality and fully addressing, we believe, all reviewers comments.

Since the manuscript about the NPC Browser had been published in STM we thought it was appropriate to draw attention to the data quality in the original publication in the same journal. With a newly updated manuscript we then submitted to STM. That rejection was fast “Thank you for submitting your manuscript “A Quality Alert and Call for Improved Curation of Public Chemistry Databases” to Science Translational Medicine. During the initial screening process your manuscript did not receive a high enough priority rating to warrant in-depth review. We are therefore notifying you that your submitted manuscript has been rejected.”

They did not want to allow us to report on the issue that was perfectly represented in an article published in their journal and, in our opinion, had been inadequately reviewed in regards to the quality of data in the database reported and highlighted in their journal. We did our best to explain the connection of our article to the original paper regarding the NPC Browser but made no progress in having the decision reconsidered.

At this point we made a decision to submit as an editorial to Drug Discovery Today. We have published in this journal before because it has a wide readership in both the industry and elsewhere and covers chemistry and informatics topics. We also have had very positive experience with both the review and publishing processes, being fair and fast. In addition this journal is outside of any potential “political circles of influence” being Europe-based and with a broad international editorial board and audience. It was suggested to us that we were struggling to get the article published as we were highlighting potentially sensitive issues. We’re not saying we agree…but it was suggested. We submitted the balanced article in the form detailing our findings with the NPC Browser dataset, received very positive feedback and the rest is history. The article is now out. Finally. Eight months after it was submitted in its original form.

Publishing this article has been a very enlightening experience. With one journal we had two positive reviewers with suggested changes and one negative reviewer. We addressed all suggested changes but it was still rejected. We submitted to three other journals in total and were rejected outright. Our article is, we believe, an appropriate and well-founded editorial about the quality of data. The details about the NPC browser and NCGC Dataset were knitted into the article based on timing…it was the last public domain dataset to be released after we had written the original article. We have been accused of having something against the NCGC database.  It is but one representative of the data quality issue we have reported on. Many tens of hours have been spent reviewing the data quality and providing feedback and guidance on our findings. We are not against any database…we are for improving data quality.

What has it taught us that might be applicable to other people’s publishing experiences?

1. It appears that pointing out an important problem and offering solutions does not win you friends. Possibly it was too contentious and this is the reason is did not get published in some of the journals it was submitted to? Or maybe it was badly written…you can be the judge.

3. It is possible that doing scientific analyses of database quality is not something reviewers or journals want to hear about because they publish articles that use such databases to create new analyses and therefore new papers that they in turn publish. If you turn around and tell them that these very databases have errors then what is the quality of the output? Certainly not 100% trustworthy. But then again what is?

4. Many scientists TRUST chemistry and biology databases  that are so often reused, reanalyzed and integrated with new cheminformatics or bioinformatics tools. The authors of such articles do not appear to analyze for problems caused by poor data quality or hypotheses that are incorrect due to poor underlying data.

5. We believe, raising an alternative viewpoint that challenges the status quo even in area that one would think was solved – e.g. structure quality, does not get a fair or balanced airing in the mainstream media.

Thankfully, the wider dissemination of the DDT editorial should get more people thinking about it. We have prepared an extensive manuscript that provides an even deeper analysis of the topic and we will be submitting that to journals. If any journal editors are reading this blog post and would like to welcome us to submit the complete manuscript we’d love to hear from you.

This is, in our opinion, a newsworthy topic. The very fact that the NPC browser and underlying datasets had issues that illustrated the problems we had alerted Science, Nature and PLoS Computational Biology etc. to months before, attests to that. We hope that those journals will agree that this is a serious issue. Who  do we turn too to report a problem that affects everyone in the scientific community? The multiple submission processes, with reformatting and reworking consumed many hours. The blog is certainly a simpler, openly reviewed format for reporting for sure.

It should be noted in the interest of full disclosure that Sean is on the editorial board of the DDT journal – which comes with no financial reward- but that Tony submitted the article.  Also, neither Sean or I have a financial interest in improving database quality, but we believe it is the right thing to do.


Potential Issues with Google Scholar Citations

I have blogged recently about my experiences with Google Scholar Citations. (1,2). It has been useful in highlighting what science I have published that people might find interest as well as trends in citation patterns. It has also highlighted some potential issues in the data.

My Top Citations on Google Scholar Citations

I must admit I was quite surprised to see that the top cited paper was one from Eastman Kodak company where we looked at interactions between Sodium Dodecyl sulfate and gelatin, followed by work I did at the University of Ottawa. This work was in 1994 and 1991 respectively. This work was almost 20 years ago so it does make sense that the aggregation of citations over the years might have reached those levels. However, I would have expected that my work in the areas of NMR prediction, Computer-Assisted Structure Elucidation (CASE) and Indirect Covariance would have garnered a lot more citations, but that work did come about 10 years later. It is good to see that the more recent papers, for example that from 2008 on internet-based tools for communication and collaboration in chemistry, has garnered a following.

My papers that supposedly have no citations

Above is shown a list of my papers from as far back as 1990 that appear to not have any citations. There are also a lot of recent papers listed that I KNOW are cited, multiple times, as they have been referred to in some of my own publications. For example, the second one in the list, from 2009, entitled “Computer-assisted methods for molecular structure elucidation: realizing a spectroscopist’s dream” is an Open Access article and according to the journal statistics it is the top read article of all time on the Journal of Cheminformatics as shown here with, as of this writing, 10770 accesses. In fact, if I search the article directly on Google Scholar I find it IS cited 7 times as shown below.


The JChemInf Paper on Google Scholar shows 7 Citations

I don’t know why it shows up as cited in Google Scholar but not in Google Scholar Citations. However, the same issue exists for the paper on the Spectral game. See below. Shows no citations on Google Schoalr Citations but shows 7 on Google Scholar.

CItations for the Spectral Game Paper

Notice that in BOTH cases the article is listed as the Journal of Cheminformatics, not as the title of the paper. Maybe THIS is the reason the citations are missed. Maybe the publisher for the Journal of Cheminformatics is not exposed in a manner that has the publications indexed properly? Maybe….




1 Comment