Archive for category Quality and Content
This is a presentation that I delivered at the ACS Division of Chemical Information meeting regarding “Reproducibility, Reporting, Sharing & Plagiarism” at ACS Denver on 23rd March 2015.
I took the opportunity to remove my hat that has me be the VP of Strategic Development at RSC, and a member of the cheminformatics group that built ChemSpider and works on other RSC projects related to it. Instead I presented on how a LACK OF MANDATES from publishers on me in terms of submission of data accompanying articles I am involved with writing is actually weakening my scientific record as data is not getting shared in the most useful forms possible to the benefit of the community. I think there would be benefits for publishers to start pushing me for MORE data, in fairly general standards, and allowing me (and others) to download the data in the form of molecules (and collections), spectral data, CSV files etc.
Yes, I am a Williams. And THAT is an incredibly common surname. But I am an Antony Williams, notice no H in the name, i.e. NOT Anthony. In the field of chemistry there are not many of us around…a couple I know of, but not many overall. Google Scholar does an extremely good job of automatically associating my newly published articles with my Citations profile here: https://scholar.google.com/citations?user=O2L8nh4AAAAJ
I am assuming that this is done by understanding the type of work I publish on, some of the co-author network maps that have been established as my profile has developed etc. I assume that there approach is very intelligent relative to some of the more commonplace searches that have been implemented….certainly the results are GOOD.
I noticed one disastrous example today when our article “ChemTrove: Enabling a Generic ELN to Support Chemistry Through the Use of Transferable Plug-ins and Online Data Sources” was published on the Journal of Chemical Information and Modeling here. Right there to the left of the abstract is an offer to look at other content by the authors.
I was interested to see what else ACS knew about my content so I clicked on my name…which performed this search: http://pubs.acs.org/action/doSearch?ContribStored=Williams%2C+A and provided me with 96 articles by Andrew Williams (mostly), by Aaron Williams, by Anthony Williams (not me) and Allan Williams (to name a few). Eventually I managed to find 3 that were associated with me by searching the list for Antony Williams but none of those I published as Antony J. Williams were recovered.
Also, my colleague Valery Tkachenko is listed as an author with a misspelling as Valery Tkachenkov. What is simply inappropriate in my opinion is how the process involved taking the list of our submitted names..copied below directly from the submitted manuscript and changing them to their own interpretation of how we would want to see our names listed.
Aileen E. Day*†, Simon J. Coles‡, Colin L. Bird‡, Jeremy G. Frey‡, Richard J. Whitby‡, Valery E. Tkachenko§, Antony J. Williams§
Notice that for Aileen and Jeremy the middle initials were expanded, Colin had his middle initial changed from L. to I., Richard, Valery and I had our middle initials dropped and Valery had a v added to his surname. Why not simply copy and paste the names from the manuscript?
I will point out that this is a “Just Accepted” manuscript and likely the changes in names will be caught and edited, especially now I have just pointed them out. “Just accepted” does have some disclaimers:
While they can edit the names to match what we originally provided I don’t think it will fix the issue regarding finding all of my articles on ACS journals as when navigated to one of my other articles here, http://pubs.acs.org/doi/abs/10.1021/es0713072, and did the search from my listed name it found exactly the same 96 hits.
Maybe a thought to use my ORCID profile http://orcid.org/0000-0002-2668-4821 to look for ACS journal articles associated with my name?
Unfortunately the data is already out in the wild as when I claimed the article on Kudos all of the name spelling issues had clearly spilled over via the DOI: https://www.growkudos.com/articles/10.1021%252Fci5005948
Ah…the things that surprise me….or not.
This is a presentation I gave at the National Institute of Standards and Technology on July 30th 2014
Experiences in Hosting Big Chemistry Data Collections for the Community
Access to scientific information has changed dramatically as a result of the web and its underpinning technologies. The quantities of data, the array of tools available to search and analyze, the devices and the shift in community participation continues to expand while the pace of change does not appear to be slowing. RSC hosts a number of chemistry data resources for the community including ChemSpider, one of the community’s primary online public compound databases. Containing tens of millions of chemical compounds and its associated data ChemSpider serves data tens of thousands of chemists every day. The platform offers the ability for crowdsourcing enabling the community to deposit and curate data. This presentation will provide an overview of the expanding reach of this cheminformatics platform and the nature of the solutions that it helps to enable including structure validation and text mining and semantic markup. ChemSpider is limited in scope as a chemical compound database and we are presently architecting the RSC Data Repository, a platform that will enable us to extend our reach to include chemical reactions, analytical data, and diverse data depositions from chemists across various domains. We will also discuss the possibilities it offers in terms of supporting data modeling and sharing. The future of scientific information and communication will be underpinned by these efforts, influenced by increasing participation from the scientific community.
Data Mining Dissertations and Adventures and Experiences in the World of Chemistry
This presentation was given at the CLIR/DLF Postdoctoral Fellowship Summer Seminar at Bryn Mawr college in Pennsylvania on July 29th 2014. The intention was to communicate what we are doing in the fields of text and data mining in the domain of chemistry and specifically around mining the RSC archive publication and chemistry dissertations and theses. How would these experiences map over to the humanities?
MOST people who are reading this blog post have likely performed peer review over the years. I have reviewed a lot of manuscripts over the years. It has changed a lot over the past decade in many ways. A couple of examples of how things have changed for me
1) More requests to review papers – and I increasingly turn down requests because they are from journals I have never heard of (some may call them “predatory publishers”), some are in areas for which I have no expertise (e.g. electrical engineering), and sometimes because I simply don’t have time.
2) I have seen papers I have reviewed show up essentially untouched in other journals (no edits and simply reformatted) and commonly these “refused papers” are accepted into what I deem to be “lower quality” publications.
Of course over the past ten years I’ve also had a lot of papers go through peer review for myself and my co-authors. This experience has also been very interesting, if not entertaining. Some examples:
1) I have experienced the third reviewer where an editor has held up a manuscript or demanded changes to match some of their own expectations while other reviewers were publish as is.
2) I have had the request to shorten excellent manuscripts to help with “page limits”….in the electronic age???
3) I have been on the receiving end of non-scientific reviews that have blocked a paper. My personal favorite “Mobile apps are a fad of the youth.”
My best story of peer review, and an example where modern technologies would have been so enabling at the time, is as follows.
I was asked to review a paper regarding the performance of Carbon-13 NMR prediction for this paper. A slice of the abstract says
“Further we compare the neural network predictions to those of a wide variety of other 13C chemical shift prediction tools including incremental methods (CHEMDRAW, SPECTOOL), quantum chemical calculation (GAUSSIAN, COSMOS), and HOSE code fragment-based prediction (SPECINFO, ACD/CNMR, PREDICTIT NMR) for the 47 13C-NMR shifts of Taxol, a natural product including many structural features of organic substances. The smallest standard deviations were achieved here with the neural network (1.3 ppm) and SPECINFO (1.0 ppm).”
This was an important time for me as this paper was comparing various NMR predictors and comparing the performance based on ONE chemical structure. And while any one point of comparison is up for discussion there were 47 shifts so you could argue it is a bigger data set. One of the programs under review was a PRODUCT that I managed at ACD/Labs, CNMR Predictor. Therefore I clearly had a concern as, essentially, the success of this product was partly responsible for my income. Any comparison that made the software look poor in performance was an issue. Was this a conflict of interest…maybe…but I judge myself to still be objective.
Table 3 listed the experimental shifts as well as the predicted shifts from the different algorithms and the size of the accompanying circle/ellipse was a visual indicator of a large difference between experimental and predicted. We will assume that all experimental assignments are correct and that there are no transcription errors between the predicted values from each algorithm and input into the table. A piece of Table 3 is shown below.
I kind of pride myself on being a little bit of a stickler for detail when it comes to reviewing data quality. Those of you who read this blog will know that. As I reviewed the data I was a little puzzled by the magnitude of the errors for certain Carbon nuclei, specifically for Carbons 23 and 27.
What was interesting to me was that the experimental shifts for 23 and 27 were 142.0, 133.2 ppm respectively yet the predicted shifts were 132.8, 142.7 ppm respectively. It struck me that they looked like they were switched. This was what drew my attention to reviewing the data in more detail. I will cut a long story short but I redrew the molecule of Taxol as input into the same version of software that was used for the publication and got a DIFFERENT answer than that reported. I was able to distinguish WHY it was different…it was down to the orientation of a bond in the input molecule that was input by the reporting authors and this made the CNMR prediction worse.
I reported this detail to the editors in a detailed letter and recommended the manuscript for publication with the caveat that the numbers for the column representing CNMR 6.0 be edited to accurately reflect the performance of the algorithm and provide the details. I was shocked to see the manuscript published later WITHOUT any of the edits made for the numbers and inaccurately representing the performance of the algorithm. I contacted the editors and after a couple of exchanges received quite a dressing down that the editor overseeing the manuscript refused to get between a commercial concern and reported science.
What does this mean? That software companies don’t do science and only academics do? I have similar experience of my colleagues in industry being treated with bias relative to my colleagues in academia. I believe my friends in industry, commercial concerns and academia can all be objective scientists….and after all, doesn’t academia teach the chemists that come out to industry and the commercial software world? These are my experiences…I welcome any comments you may have about the bias. BUT, back to the story…
The manuscript was published in June 2002 and as product manager I had to deal with questions around algorithmic performance for many months because “the peer-review literature said…”. This was NOT the only instance of a situation like this as a couple of years later it was reported that ACD/CNMR could not handle stereochemistry only to determine with the scientist who wrote the paper that he had thrown a software switch that affected his results. Software can be tricky and unfortunately the best performance can often come through the hands of those that write the software. Sad but true in many cases.
In August 2004 we published an addendum with one of the original authors regarding the work describing the entire situation in detail. It was over two years from the original publication to the final addendum. I do not believe there was any malicious intent on behalf of the authors of the original manuscript but that was in the days where the only place to issue a rebuttal was in the journal and we could not get editorial support to do it. How would it happen today if a paper came out that was suspicious. There are a myriad number of tools available now….
Yes, I would blog the story here, as I am doing now. Yes I would express concern at the situation on Twitter with the hope of gaining redress. I would likely tell the story in a Slideshare presentation and make a narrated movie and make it available via an embed in the Slideshare presentation on my account. I would hope that the publisher nowadays would at least allow me to add a comment to the article but I do understand that this comment would likely be monitored and mediated and they may choose not to expose it to the readers. I like the implementation on PLoS and have used it on one of our articles previously.
Could I maybe make use of a technology like Kudos that I have started using. I have reported it on this blog already here. I certainly could not claim the ORIGINAL article and start associating information with it regarding the performance of the algorithms…and that is a shame. But MAYBE in the future Kudos would consider letting OTHER people make comments and associate information/data with an article on Kudos. Risky? Maybe. However, I can claim the rebuttal that I was a co-author on and start associating information with that….certainly the original paper and ultimately linking to this blog. In fact, in the future is a rebuttal going to be a manuscript that I publish out on something like Figshare, grab a DOI there and maybe ask Kudos to treat that as a published rebuttal? Peer review of that rebuttal could then happen as comments on Figshare and Kudos directly and maybe in the future Kudos Views and Altmetric measures of that becomes a measure of the importance. We live in very interesting times as these technologies expand, mesh and integrate.
The Importance of the InChI Identifier as a Foundation Technology for eScience Platforms at the Royal Society of Chemistry
This is a presentation I gave today at Bio-IT 2014 here in Boston. I was in the company of a number of my favorite people to be o the agenda with… Steve Heller, Steve Boyer, Evan Bolton and Chris Southan.
The Importance of the InChI Identifier as a Foundation Technology for eScience Platforms at the Royal Society of Chemistry
The Royal Society of Chemistry hosts one of the largest online chemistry databases containing almost 30 million unique chemical structures. The database, ChemSpider, provides the underpinning for a series of eScience projects allowing for the integration of chemical compounds with our archive of scientific publications, the delivery of a reaction database containing millions of reactions as well as a chemical validation and standardization platform developed to help improve the quality of structural representations on the internet. The InChI has been a fundamental part of each of our projects and has been pivotal in our support of international projects such as the Open PHACTS semantic web project integrating chemistry and biology data and the PharmaSea project focused on identifying novel chemical components from the ocean with the intention of identifying new antibiotics. This presentation will provide an overview of the importance of InChI in the development of many of our eScience platforms and how we have used it specifically in the ChemSpider project to provide integration across hundreds of websites and chemistry databases across the web. We will discuss how we are now expanding our efforts to develop a Global Chemistry Network encompassing efforts in Open Source Drug Discovery and the support of data management for neglected diseases.
This is the second presentation I gave at the ACS Meeting in Indianapolis
Accessing chemical health and safety data online using Royal Society of Chemistry resources
The internet has opened up access to large amounts of chemistry related data that can be harvested and assembled into rich resources of value to chemists. The Royal Society of Chemistry’s ChemSpider database has assembled an electronic collection of over 28 million chemicals from over 400 data sources and some of the assembled data is certainly of value to those searching for chemical health and safety information. Since ChemSpider is a text and structure searchable database chemists are able to find relevant information using both of their general search approaches. This presentation will provide an overview of the types of chemical health and safety data and information made available via ChemSpider and discuss how the data are sourced, aggregated and validated. We will examine how the data can be made available via mobile devices and examine the issue of data quality and its potential impacts on such a database.
Eventually there will be simple answers to the question commonly asked by chemists. “What is the chemical structure of INSERT NAME?” This is going to be true for drugs as the various online databases work together to clean up, curate, qualify and declare what a chemical structure is for a particular drug. While we can have the purists argument about structure drawings not representing reality, for example that compounds are atoms bonded together by shared clouds of electrons that at any point in time may be changing, reorganizing, tautomerizing etc the reality is also that we need a common language for information exchange and in the world of visual depictions for chemistry the layout in a 2D structure diagram is it. As we come together as a community to agree on preferred ways to standardize chemicals to assist in representations in databases for example, this situation will improve. The efforts of the FDA to define structure representation standards, with the support of pharma, will contribute. For now we are left with the challenges of different representations in different databases as well as simply the quality of data being fed into these databases. These are some of the issues we are trying to resolve as we build Open PHACTS. We are trying to link data from various resurces, noting and resolving conflicts when we can, and curating as necessary with the ultimate intention that this information will flow out into the community and be picked up by the database hosts and addressed, fixed, challenged as appropriate.
I’ve been looking for a new example showing the challenges of data integration considering that in Open PHACTS at present we are integrating chemistry from three primary data sets (for now)… DrugBank, ChEBI, ChEMBL. So, let’s consider Fluvastatin. The usual challenges of trying to determine what the “correct” chemical structure representation is for the compound is an iterative loop but let’s see what we can find in our datasets as we iterate. I KNOW from 4 years of looking at chemistry on Wikipedia that the data quality for chemical compound representations is very good. So, starting there we find the Wikipedia record here. The DrugBox links to a number of records in other databases.
One of these is ChemSpider and it has the SAME representation. On ChEBI the representation is inconsistent with no defined stereochemistry (except the E- double bond). Since ChEBI is manually curated and the compound carries 3 stars this should be correct. There are two records LINKED from this ChEBI record.
On Drugbank the compound has INVERTED stereochemistry from that on ChemSPider and Wikipedia… WP and ChemSpider has 3R,5S while DrugBank has 3S,5R but it DOES say in the pharmacodynamics sectionb “It is prepared as a racemate of two erythro enantiomers of which the 3R,5S enantiomer exerts the pharmacologic effect. ” confirming that the 3R,5S form is the ACTIVE form.
ChEMBL matches Wikipedia and ChemSpider here.
So, to summarize what we get when we search for Fluvastatin
Stereo 3R,5S for Wikipedia, ChemSPider, ChEMBL
Stereo 3S,5R for DrugBank
No stereo for ChEBI
Welcome to the complexities of name-structure relationships. These are some of the challenges we need to deal with on Open PHACTS. Dailymed defines the sodium fluvastatin as “Fluvastatin sodium is sodium (±)-(3R*, 5S*, 6E)-7-[3-(p-fluorophenyl)-1-isopropylindol-2-yl]-3,5-dihydroxy-6-heptanoate” so the relative form….
I just gave a presentation at the NFAIS conference in Philadelphia with the conference focus being “Born of Disruption: An Emerging New Normal for the Information Landscape”. I was on a panel with Lee Dirks from Microsoft Research and Kristen Fisher-Ratan from PLoS. Both gave very interesting talks and it was a pleasure to be on the panel with them.
My talk was entited “Crowdsourcing Chemistry for the Community – 5 Years of Experiences” with the abstract below.
“ChemSpider is one of the internet’s primary resources for chemists. ChemSpider is a structure-centric platform and hosts over 26 million unique chemical entities sourced from over 400 different data sources and delivers information including commercial availability, associated publications, patents, analytical data, experimental and predicted properties. ChemSpider serves a rather unique role to the community in that any chemist has the ability to deposit, curate and annotate data. In this manner they can contribute their skills, and data, to any chemist using the system. A number of parallel projects have been developed from the initial platform including ChemSpider SyntheticPages, a community generated database of reaction syntheses, and the Learn Chemistry wiki, an educational wiki for secondary school students.
This presentation will provide an overview of the project in terms of our success in engaging scientists to contribute to crowdsouring chemistry. We will also discuss some of our plans to encourage future participation and engagement in this and related projects.”
The talk is embedded below. I thank the organizers for the ability to ask questions during the talk and get responses using a clicker feedback system (I didn’t realise ahead of time that the questions would consume a few seconds and ran over on my talk..agh). I will get the answers to the questions and post them in a separate post. Interesting answers…
Last week I was in the United Kingdom for numerous meetings and at the end of the week struggled to drive north to Macclesfield to the AstraZeneca site there to give a presentation on ChemSpider for an old colleague of mine from the Eastman Kodak company. I had not seen Tony Bristow in well over a decade but we reminisced about the good old days at Kodak (Tony worked in Harrow, UK and I worked in Rochester, NY. Tony is a Mass Spectrometrist and I am an NMR Spectroscopist by training). We also discussed how scientists are increasingly tapping into the ChemSpider resource to aid in the identification of chemical compounds using, especially, Mass Spectrometry. We have numerous examples now of when people are solving their structure ID issues directly by searching ChemSpider and are building up a portfolio of success stories.
The presentation I gave is below and loaded on SlideShare in case you want to download it.