Archive for category Quality and Content
This is a presentation I gave at the National Institute of Standards and Technology on July 30th 2014
Experiences in Hosting Big Chemistry Data Collections for the Community
Access to scientific information has changed dramatically as a result of the web and its underpinning technologies. The quantities of data, the array of tools available to search and analyze, the devices and the shift in community participation continues to expand while the pace of change does not appear to be slowing. RSC hosts a number of chemistry data resources for the community including ChemSpider, one of the community’s primary online public compound databases. Containing tens of millions of chemical compounds and its associated data ChemSpider serves data tens of thousands of chemists every day. The platform offers the ability for crowdsourcing enabling the community to deposit and curate data. This presentation will provide an overview of the expanding reach of this cheminformatics platform and the nature of the solutions that it helps to enable including structure validation and text mining and semantic markup. ChemSpider is limited in scope as a chemical compound database and we are presently architecting the RSC Data Repository, a platform that will enable us to extend our reach to include chemical reactions, analytical data, and diverse data depositions from chemists across various domains. We will also discuss the possibilities it offers in terms of supporting data modeling and sharing. The future of scientific information and communication will be underpinned by these efforts, influenced by increasing participation from the scientific community.
Data Mining Dissertations and Adventures and Experiences in the World of Chemistry
This presentation was given at the CLIR/DLF Postdoctoral Fellowship Summer Seminar at Bryn Mawr college in Pennsylvania on July 29th 2014. The intention was to communicate what we are doing in the fields of text and data mining in the domain of chemistry and specifically around mining the RSC archive publication and chemistry dissertations and theses. How would these experiences map over to the humanities?
MOST people who are reading this blog post have likely performed peer review over the years. I have reviewed a lot of manuscripts over the years. It has changed a lot over the past decade in many ways. A couple of examples of how things have changed for me
1) More requests to review papers – and I increasingly turn down requests because they are from journals I have never heard of (some may call them “predatory publishers”), some are in areas for which I have no expertise (e.g. electrical engineering), and sometimes because I simply don’t have time.
2) I have seen papers I have reviewed show up essentially untouched in other journals (no edits and simply reformatted) and commonly these “refused papers” are accepted into what I deem to be “lower quality” publications.
Of course over the past ten years I’ve also had a lot of papers go through peer review for myself and my co-authors. This experience has also been very interesting, if not entertaining. Some examples:
1) I have experienced the third reviewer where an editor has held up a manuscript or demanded changes to match some of their own expectations while other reviewers were publish as is.
2) I have had the request to shorten excellent manuscripts to help with “page limits”….in the electronic age???
3) I have been on the receiving end of non-scientific reviews that have blocked a paper. My personal favorite “Mobile apps are a fad of the youth.”
My best story of peer review, and an example where modern technologies would have been so enabling at the time, is as follows.
I was asked to review a paper regarding the performance of Carbon-13 NMR prediction for this paper. A slice of the abstract says
“Further we compare the neural network predictions to those of a wide variety of other 13C chemical shift prediction tools including incremental methods (CHEMDRAW, SPECTOOL), quantum chemical calculation (GAUSSIAN, COSMOS), and HOSE code fragment-based prediction (SPECINFO, ACD/CNMR, PREDICTIT NMR) for the 47 13C-NMR shifts of Taxol, a natural product including many structural features of organic substances. The smallest standard deviations were achieved here with the neural network (1.3 ppm) and SPECINFO (1.0 ppm).”
This was an important time for me as this paper was comparing various NMR predictors and comparing the performance based on ONE chemical structure. And while any one point of comparison is up for discussion there were 47 shifts so you could argue it is a bigger data set. One of the programs under review was a PRODUCT that I managed at ACD/Labs, CNMR Predictor. Therefore I clearly had a concern as, essentially, the success of this product was partly responsible for my income. Any comparison that made the software look poor in performance was an issue. Was this a conflict of interest…maybe…but I judge myself to still be objective.
Table 3 listed the experimental shifts as well as the predicted shifts from the different algorithms and the size of the accompanying circle/ellipse was a visual indicator of a large difference between experimental and predicted. We will assume that all experimental assignments are correct and that there are no transcription errors between the predicted values from each algorithm and input into the table. A piece of Table 3 is shown below.
I kind of pride myself on being a little bit of a stickler for detail when it comes to reviewing data quality. Those of you who read this blog will know that. As I reviewed the data I was a little puzzled by the magnitude of the errors for certain Carbon nuclei, specifically for Carbons 23 and 27.
What was interesting to me was that the experimental shifts for 23 and 27 were 142.0, 133.2 ppm respectively yet the predicted shifts were 132.8, 142.7 ppm respectively. It struck me that they looked like they were switched. This was what drew my attention to reviewing the data in more detail. I will cut a long story short but I redrew the molecule of Taxol as input into the same version of software that was used for the publication and got a DIFFERENT answer than that reported. I was able to distinguish WHY it was different…it was down to the orientation of a bond in the input molecule that was input by the reporting authors and this made the CNMR prediction worse.
I reported this detail to the editors in a detailed letter and recommended the manuscript for publication with the caveat that the numbers for the column representing CNMR 6.0 be edited to accurately reflect the performance of the algorithm and provide the details. I was shocked to see the manuscript published later WITHOUT any of the edits made for the numbers and inaccurately representing the performance of the algorithm. I contacted the editors and after a couple of exchanges received quite a dressing down that the editor overseeing the manuscript refused to get between a commercial concern and reported science.
What does this mean? That software companies don’t do science and only academics do? I have similar experience of my colleagues in industry being treated with bias relative to my colleagues in academia. I believe my friends in industry, commercial concerns and academia can all be objective scientists….and after all, doesn’t academia teach the chemists that come out to industry and the commercial software world? These are my experiences…I welcome any comments you may have about the bias. BUT, back to the story…
The manuscript was published in June 2002 and as product manager I had to deal with questions around algorithmic performance for many months because “the peer-review literature said…”. This was NOT the only instance of a situation like this as a couple of years later it was reported that ACD/CNMR could not handle stereochemistry only to determine with the scientist who wrote the paper that he had thrown a software switch that affected his results. Software can be tricky and unfortunately the best performance can often come through the hands of those that write the software. Sad but true in many cases.
In August 2004 we published an addendum with one of the original authors regarding the work describing the entire situation in detail. It was over two years from the original publication to the final addendum. I do not believe there was any malicious intent on behalf of the authors of the original manuscript but that was in the days where the only place to issue a rebuttal was in the journal and we could not get editorial support to do it. How would it happen today if a paper came out that was suspicious. There are a myriad number of tools available now….
Yes, I would blog the story here, as I am doing now. Yes I would express concern at the situation on Twitter with the hope of gaining redress. I would likely tell the story in a Slideshare presentation and make a narrated movie and make it available via an embed in the Slideshare presentation on my account. I would hope that the publisher nowadays would at least allow me to add a comment to the article but I do understand that this comment would likely be monitored and mediated and they may choose not to expose it to the readers. I like the implementation on PLoS and have used it on one of our articles previously.
Could I maybe make use of a technology like Kudos that I have started using. I have reported it on this blog already here. I certainly could not claim the ORIGINAL article and start associating information with it regarding the performance of the algorithms…and that is a shame. But MAYBE in the future Kudos would consider letting OTHER people make comments and associate information/data with an article on Kudos. Risky? Maybe. However, I can claim the rebuttal that I was a co-author on and start associating information with that….certainly the original paper and ultimately linking to this blog. In fact, in the future is a rebuttal going to be a manuscript that I publish out on something like Figshare, grab a DOI there and maybe ask Kudos to treat that as a published rebuttal? Peer review of that rebuttal could then happen as comments on Figshare and Kudos directly and maybe in the future Kudos Views and Altmetric measures of that becomes a measure of the importance. We live in very interesting times as these technologies expand, mesh and integrate.
The Importance of the InChI Identifier as a Foundation Technology for eScience Platforms at the Royal Society of Chemistry
This is a presentation I gave today at Bio-IT 2014 here in Boston. I was in the company of a number of my favorite people to be o the agenda with… Steve Heller, Steve Boyer, Evan Bolton and Chris Southan.
The Importance of the InChI Identifier as a Foundation Technology for eScience Platforms at the Royal Society of Chemistry
The Royal Society of Chemistry hosts one of the largest online chemistry databases containing almost 30 million unique chemical structures. The database, ChemSpider, provides the underpinning for a series of eScience projects allowing for the integration of chemical compounds with our archive of scientific publications, the delivery of a reaction database containing millions of reactions as well as a chemical validation and standardization platform developed to help improve the quality of structural representations on the internet. The InChI has been a fundamental part of each of our projects and has been pivotal in our support of international projects such as the Open PHACTS semantic web project integrating chemistry and biology data and the PharmaSea project focused on identifying novel chemical components from the ocean with the intention of identifying new antibiotics. This presentation will provide an overview of the importance of InChI in the development of many of our eScience platforms and how we have used it specifically in the ChemSpider project to provide integration across hundreds of websites and chemistry databases across the web. We will discuss how we are now expanding our efforts to develop a Global Chemistry Network encompassing efforts in Open Source Drug Discovery and the support of data management for neglected diseases.
This is the second presentation I gave at the ACS Meeting in Indianapolis
Accessing chemical health and safety data online using Royal Society of Chemistry resources
The internet has opened up access to large amounts of chemistry related data that can be harvested and assembled into rich resources of value to chemists. The Royal Society of Chemistry’s ChemSpider database has assembled an electronic collection of over 28 million chemicals from over 400 data sources and some of the assembled data is certainly of value to those searching for chemical health and safety information. Since ChemSpider is a text and structure searchable database chemists are able to find relevant information using both of their general search approaches. This presentation will provide an overview of the types of chemical health and safety data and information made available via ChemSpider and discuss how the data are sourced, aggregated and validated. We will examine how the data can be made available via mobile devices and examine the issue of data quality and its potential impacts on such a database.
Eventually there will be simple answers to the question commonly asked by chemists. “What is the chemical structure of INSERT NAME?” This is going to be true for drugs as the various online databases work together to clean up, curate, qualify and declare what a chemical structure is for a particular drug. While we can have the purists argument about structure drawings not representing reality, for example that compounds are atoms bonded together by shared clouds of electrons that at any point in time may be changing, reorganizing, tautomerizing etc the reality is also that we need a common language for information exchange and in the world of visual depictions for chemistry the layout in a 2D structure diagram is it. As we come together as a community to agree on preferred ways to standardize chemicals to assist in representations in databases for example, this situation will improve. The efforts of the FDA to define structure representation standards, with the support of pharma, will contribute. For now we are left with the challenges of different representations in different databases as well as simply the quality of data being fed into these databases. These are some of the issues we are trying to resolve as we build Open PHACTS. We are trying to link data from various resurces, noting and resolving conflicts when we can, and curating as necessary with the ultimate intention that this information will flow out into the community and be picked up by the database hosts and addressed, fixed, challenged as appropriate.
I’ve been looking for a new example showing the challenges of data integration considering that in Open PHACTS at present we are integrating chemistry from three primary data sets (for now)… DrugBank, ChEBI, ChEMBL. So, let’s consider Fluvastatin. The usual challenges of trying to determine what the “correct” chemical structure representation is for the compound is an iterative loop but let’s see what we can find in our datasets as we iterate. I KNOW from 4 years of looking at chemistry on Wikipedia that the data quality for chemical compound representations is very good. So, starting there we find the Wikipedia record here. The DrugBox links to a number of records in other databases.
One of these is ChemSpider and it has the SAME representation. On ChEBI the representation is inconsistent with no defined stereochemistry (except the E- double bond). Since ChEBI is manually curated and the compound carries 3 stars this should be correct. There are two records LINKED from this ChEBI record.
On Drugbank the compound has INVERTED stereochemistry from that on ChemSPider and Wikipedia… WP and ChemSpider has 3R,5S while DrugBank has 3S,5R but it DOES say in the pharmacodynamics sectionb “It is prepared as a racemate of two erythro enantiomers of which the 3R,5S enantiomer exerts the pharmacologic effect. ” confirming that the 3R,5S form is the ACTIVE form.
ChEMBL matches Wikipedia and ChemSpider here.
So, to summarize what we get when we search for Fluvastatin
Stereo 3R,5S for Wikipedia, ChemSPider, ChEMBL
Stereo 3S,5R for DrugBank
No stereo for ChEBI
Welcome to the complexities of name-structure relationships. These are some of the challenges we need to deal with on Open PHACTS. Dailymed defines the sodium fluvastatin as “Fluvastatin sodium is sodium (±)-(3R*, 5S*, 6E)-7-[3-(p-fluorophenyl)-1-isopropylindol-2-yl]-3,5-dihydroxy-6-heptanoate” so the relative form….
I just gave a presentation at the NFAIS conference in Philadelphia with the conference focus being “Born of Disruption: An Emerging New Normal for the Information Landscape”. I was on a panel with Lee Dirks from Microsoft Research and Kristen Fisher-Ratan from PLoS. Both gave very interesting talks and it was a pleasure to be on the panel with them.
My talk was entited “Crowdsourcing Chemistry for the Community – 5 Years of Experiences” with the abstract below.
“ChemSpider is one of the internet’s primary resources for chemists. ChemSpider is a structure-centric platform and hosts over 26 million unique chemical entities sourced from over 400 different data sources and delivers information including commercial availability, associated publications, patents, analytical data, experimental and predicted properties. ChemSpider serves a rather unique role to the community in that any chemist has the ability to deposit, curate and annotate data. In this manner they can contribute their skills, and data, to any chemist using the system. A number of parallel projects have been developed from the initial platform including ChemSpider SyntheticPages, a community generated database of reaction syntheses, and the Learn Chemistry wiki, an educational wiki for secondary school students.
This presentation will provide an overview of the project in terms of our success in engaging scientists to contribute to crowdsouring chemistry. We will also discuss some of our plans to encourage future participation and engagement in this and related projects.”
The talk is embedded below. I thank the organizers for the ability to ask questions during the talk and get responses using a clicker feedback system (I didn’t realise ahead of time that the questions would consume a few seconds and ran over on my talk..agh). I will get the answers to the questions and post them in a separate post. Interesting answers…
Last week I was in the United Kingdom for numerous meetings and at the end of the week struggled to drive north to Macclesfield to the AstraZeneca site there to give a presentation on ChemSpider for an old colleague of mine from the Eastman Kodak company. I had not seen Tony Bristow in well over a decade but we reminisced about the good old days at Kodak (Tony worked in Harrow, UK and I worked in Rochester, NY. Tony is a Mass Spectrometrist and I am an NMR Spectroscopist by training). We also discussed how scientists are increasingly tapping into the ChemSpider resource to aid in the identification of chemical compounds using, especially, Mass Spectrometry. We have numerous examples now of when people are solving their structure ID issues directly by searching ChemSpider and are building up a portfolio of success stories.
The presentation I gave is below and loaded on SlideShare in case you want to download it.
I write a lot of publications, averaging about one peer-reviewed publication or book chapter per month. I have published with a number of publishers including my employer (Royal Society of Chemistry), with Elsevier, Wiley, Springer, ACS and many others. The experience with each publisher is different but, generally, pleasant, and high quality. However, once in awhile the experience is “interesting”. I especially have had some very interesting peer-review “experiences”. But that is not the point ot this post. This post is about the other end of the process…paper reviewed, paper accepted and into proofing stage.
Last month Sean Ekins and I had a paper accepted and we listed in the paper some physicochemical parameters. These included logP, pKa, Lipinski parameters and Polar Surface Area, commonly known as PSA. When we got the paper back for proofing PSA had been replaced by “Prostate Specific Antigen“. It was a good catch on Sean’s part as first proofreader to spot it! How would that happen? One has to imagine a set of scripts that are searching for abbreviations and doing a find and replace. For PSA clearly context matters. For most biology papers the prostate specific antigen conversion for PSA might make sense. It doesn’t really make sense for chemistry and QSAR modeling. So, it’s all about context.
We recently submitted an article in relation to our work on Computer Assisted Structure Elucidation. This is at a time when our book on CASE is about to go to the printers! This is one of our most interesting applications of ACD/Structure Elucidator and will be discussed in more detail when the paper is published. The paper is going to be published with Wiley’s Magnetic Resonance in Chemistry. MRC is my favorite NMR journal by far and I am always happy to publish there! After all these years I was shocked when the feedback from the copy-editors for our paper said…
The copy-editor was suggesting that we changing all instances of PPM for chemical shift to mg/g. Excuse me, but reout usually? Are you serious. First of all PPM is THE defined unit for chemical shift. Did IUPAC change it without us knowing? PPM is a dimensionless unit, based on Hz/MHz, thus the 1oE-6 dependence. Even if it was in terms of Gauss (another interpretation of the mg/G) it should be microGauss/Gauss, so mcg/G.
Anyway, it makes no sense right? Surely it is just an oversight, just a one off? Unfortunately no…this entire paper HAS been published with every PPM reference to chemical shift changed into mg/g. How did that happen? We have to imagine a search and replace replacement, acceptance of the “house style” by the author and no oversight by the editor post-proofing. The result, chemical shifts are now quoted in milligrams/gram. Terrific! Surely a context issue of some type…but truth be told, I am not sure for what!
Is this a side effect of non-skilled copy-editors? A result of off-shoring? Whatever the reason it is wrong..unless IUPAC truly decided on a new standard????! NOT….
When writing talks I try to find interesting (and where possible fun) examples of how challenging the world of managing chemistry data is for all of us that work in the world of managing 10s of thousands, or in our cases millions of compound pages for the community to use. I have told many stories over the past few years of the challenges we collectively have in regards to data quality and how it flows between our databases unabated. My latest example used at the recent talk at the EBI (ChemSpider – An Online Database and Registration System Linking the Web) was the structure known as Terminal Dimethyl presently on PubChem, DrugBank, Wolfram Alpha and PDBe. It was originally inherited into ChemSpider also but has been deprecated. I left a comment on DrugBank a couple of weeks ago but it hasn’t been published yet…generally such errors are removed VERY quickly by the DrugBank hosts. I added a comment to Wolfram Alpha and received a canned response and no changes to the record as yet.
There ARE ways to communally resolve these issues and I will blog about that shortly.