Archive for category Open Access Publishing

Open Science for Identifying “Known Unknown” Chemicals

I am happy to announce the publishing of an article regarding “Open Science for Identifying “Known Unknown” Chemicals” at I have been involved with two other articles about the identification of “Known Unknowns”.

The first one was a ChemSpider article: “”Identification of “known unknowns” utilizing accurate mass data and ChemSpider”. Journal of The American Society for Mass Spectrometry. 23: 179–185. doi:10.1007/s13361-011-0265-y.”

The second one was a recent article from the EPA: “”Identifying known unknowns using the US EPA’s CompTox Chemistry Dashboard”. Analytical and Bioanalytical Chemistry. 409: 1729–1735. doi:10.1007/s00216-016-0139-z.”

The most recent publication was a collaboration with Emma Schymanski from Eawag and it was a real pleasure to write this together. If you are interested in how Open Science can contribute to the challenges associated with the identification of known unknowns check out our latest publication!

No Comments

Why Have I Pushed so Much Traffic To Twitter This Weekend? GAMING or SAVVY?

Next Tuesday, November 29th, I am leading a two hour workshop as described here:

The NC-ACS together with RTI International is excited to provide dinner and a workshop titled “Building an Online Profile Using Social Networking and Amplification Tools for Scientists”!

DATE AND TIME: Tue, November 29, 2016, 6:00 PM – 9:00 PM EST

LOCATION: The Frontier, 800 Park Offices Drive, Triangle, NC 27709

The event includes dinner from The Farmery starting at 6PM! The workshop will begin promptly at 6:30PM.

Please note to bring your computer and let our Speaker, Antony Williams, help you build your online profile!

Space is limited!  Please register here:

In advance of that gathering I was fortunate to have two papers published last week and I wanted to show how I could use Social Media to drive attention, views, downloads and altmetrics to those papers. They are:

Programmatic conversion of crystal structures into 3D printable files using Jmol at


An automated curation procedure for addressing chemical errors and inconsistencies in public datasets used in QSAR modelling at

I started pushing the 3D printing article out on Friday morning and noticed a surge in attention early in the day and it continued throughout the day. I kept attention going throughout the weekend and saw less attention and while it is possible that I saturated my network of connections I think what is more likely is people are simply away from their computers at the weekend and Twitter will get less attention from the overall network. That’s my hypothesis, yet to be proven. It SHOULD be noted that the initial surge in AltMetrics came from the publisher themselves when they pushed it out for us as authors. See I suggest making sure your PUBLISHER is pushing out your article via Twitter as part of their service. And BOOK PUBLISHERS should be using Twitter in the same way.

For the automated curation procedure for data curation and QSAR modeling paper I FOUND that on Friday night about midnight….as I kept checking back to see when it was finally published. (Emails to authors would be a good idea don’t you think?). I pushed that out after midnight on Friday and the attention, and corresponding AltMetrics are way less than for the 3D article. Maybe it’s because the article is less interesting (but I don’t agree with that for my network). Maybe, and more likely I think, is Friday night release and throughout Saturday has less overall Twitter attention (see original hypothesis). But it could be I simply saturated the network with my first 3D printing posting. It’s not possible to tease this out with this one experiment so there will be others. Maybe the study has already been done???

In any case the 3D printing one has good altmetric scores now (40 as of 12:50pm on Sunday) and the QSAR modeling paper is lagging (a score of 4). I think a big contribution to the lagging altmetrics for the QSAR modeling paper is the fact that SAR and QSAR in Environmental Research from Taylor and Francis may not have much of a following and may not tweet out the article directly (the last comments I saw about SAR and QSAR on Twitter were mostly in 2013) . One other MAJOR contributing factor may be that JChemInf is FULLY Open Access and our 3D article is fully Open. The SAR and QSAR article in Taylor and Francis has an Open Access option and we didn’t use it, yet. Again, just hypotheses.

Thanks to @JChemInf for doing their job well re. pushing it out to Twitter.I think it helped….

No Comments

Our dire need to mandate data standards and expectations for scientific publishing

This is a presentation that I delivered at the ACS Division of Chemical Information meeting regarding “Reproducibility, Reporting, Sharing & Plagiarism” at ACS Denver on 23rd March 2015.

I took the opportunity to remove my hat that has me be the VP of Strategic Development at RSC, and a member of the cheminformatics group that built ChemSpider and works on other RSC projects related to it. Instead I presented on how a LACK OF MANDATES from publishers on me in terms of submission of data accompanying articles I am involved with writing is actually weakening my scientific record as data is not getting shared in the most useful forms possible to the benefit of the community. I think there would be benefits for publishers to start pushing me for MORE data, in fairly general standards, and allowing me (and others) to download the data in the form of molecules (and collections), spectral data, CSV files etc.


No Comments

Pacifichem: The Increasing Influence of Openness in the Domain of Chemistry (#325)

Pacifichem 2015 will see me co-hosting two sessions at the meeting. The first one is described here,The Evolving Nature of Scholarly Communication: Connecting Scholars with Each Other and with Society“, and the second one is “The Increasing Influence of Openness in the Domain of Chemistry“.

The call for abstracts is open for BOTH sessions now until April 3rd. I believe that the majority of readers of this blog almost certainly would be interested in attending these sessions and hopefully contributing to them so please submit your abstracts soon before the deadline expires!

Outline of the Session:

Chemists are being impacted by openness every day. Open Access publishing is being encouraged by various funding agencies that support our work, we are using open source code on a daily basis, whether we know it or not, and we are increasingly accessing open data via Internet searches. The proliferation of open science is providing access to an increasing number of free and re-usable data sources for chemistry. Crowdsourcing platforms allow chemists to contribute data, annotations and assertions, including chemical compounds, reaction schema, and analytical data. Collectively these data are facilitating education, enhancing discoverability and underpinning decision making in the laboratory. Through this symposium we aim to bring together participants serving up resources for the community and engage the audience in reviewing the success, opportunities and future of open science in its various forms. The future development of science will be increasingly impacted by the open contributions and innovation and chemistry in particular is one of the scientific domains presently gaining momentum

No Comments

A presentation at Research Square: The Benefits of Participation in the Social Web of Science

Yesterday I had the privilege of giving a presentation at Research Square in Durham. In terms of an audience, and an environment to present, it was certainly an ideal environment and very recipient audience….but how could it not be with their mission being to provide “research communication without roadblocks”. As the MC for the day commented about when she joined Research Square “I thought “I’d found my peeps””. So many of the conversations over lunch were about commonality of views..and it appears…our networks are so similar….yup, definitely my type of peeps. 🙂

If you don’t think you know Research Square then maybe you know some of their brands? Rubriq, Journal Guide and American Journal Experts.

The Benefits of Participation in the Social Web of Science

With the flourishing environment of platforms for sharing data, establishing an online profile and engaging in scientific discourse through alternative modes of publishing and participation, there are numerous potential benefits. However, while many scientists invest significant amounts of time in sharing their activities and opinions with friends and family the majority do not make use of the new opportunities to participate in the developing social web of science, despite the potential impact and influence on future careers. We now have many new ways to contribute to science outside of the classical publishing model. These include the ability to annotate and curate data, to “publish” in new ways on blogs and micropublishing sites, and many of these activities can be as part of a growing crowdsourcing network. Our efforts in this area are already being indexed and exposed on the internet via our publications, presentations and data and increasingly we are being quantified. This presentation will provide an overview of the various types of networking and collaborative sites available to scientists and ways to expose their scientific activities online. Many of these can ultimately contribute to the developing metrics of a scientist as identified in the new world of alternative metrics. Participation offers a great opportunity to develop a scientific profile within the community and may ultimately be very beneficial, especially to scientists early in their career.

No Comments

A chemistry data repository to serve them all

A presentation that I am giving around UK universities in September/October 2014

A chemistry data repository to serve them all

Over the past five years the Royal Society of Chemistry has become world renowned for its public domain compound database that integrates chemical structures with online resources and available data. ChemSpider regularly serves over 50,000 users per day who are seeking chemistry related data. In parallel we have used ChemSpider and available software services to underpin a number of grant-based projects that we have been involved with: Open PHACTS – a semantic web project integrating chemistry and biology data, PharmaSea – seeking out new natural products from the ocean and the National Chemical Database Service for the United Kingdom. We are presently developing a new architecture that will offer broader scope in terms of the types of chemistry data that can be hosted. This presentation will provide an overview of our Cheminformatics activities at RSC, the development of a new architecture for a data repository that will underpin a global chemistry network, and the challenges ahead, as well as our activities in releasing software and data to the chemistry community.

No Comments

The future of scientific information & communication presented at the SUNY Potsdam Academic Festival

This is a LONG presentation….I talk about the “It’s All About Me” attitude that can positively feed science….we want to share OUR science, we want people to know about our opinions, our activities, our collaborators, we want to get funding, recognition and attribution. And why not…it can all be to the benefit of science.

This presentation was given at the SUNY Potsdam Academic Festival

The future of scientific information & communication

Our access to scientific information has changed in ways that were hardly imagined even by the early pioneers of the internet. The immense quantities of data and the array of tools available to search and analyze online content continues to expand while the pace of change does not appear to be slowing. While scientists now have access to the enormous capacities and capability of the internet the vast majority of scientific communication continues to be through peer-reviewed scientific journals. The measure of a scientist’s contribution is primarily represented by their publication profile and the citations to their published works and offers an incomplete view of their activities. However, we are at the beginning of a new revolution where the ability to communicate offers the opportunity to embrace new forms of publishing and where scientific participation and influence will be measured in new ways. This presentation will provide an overview of our new generation of “openness” in which open source, open standards, open access and open data are proliferating. The future of scientific information and communication will be underpinned by these efforts, influenced by increasing participation from the scientific community and facilitated collaboration and ultimately accelerate scientific progress.

, , , ,

1 Comment

Why Open Drug Discovery Needs Four Simple Rules for Licensing Data and Models

There are a number of people in my domain that I have great appreciation for and that I enjoy working with. So, an opportunity to co-author on rules for licensing data with Sean Ekins and John Wilbanks was an opportunity too good to miss. There are a lot of opinions, rants and views on data licensing floating around the internet, discussed at conferences and over beverages. Meanwhile we have opinions too and have shared them through this perspective on PLoS Computational Biology through this paper: “Why Open Drug Discovery Needs Four Simple Rules for Licensing Data and Models”

, ,

1 Comment

Open Notebook Science and One Future for Scientific Research

A few weeks ago I was invited to give a presentation to the Board of Directors at Burroughs Wellcome. I was very interested in taking this opportunity to discuss my views on Open Science, Open Notebook Science, Open Data etc with this group of very esteemed scientists. However, it turned out it clashed with a planned vacation. Since my friend and frequent co-author Sean Ekins is an evangelist for open science for drug discovery, improving data quality, and Mobile Apps, and since we think alike on so many levels, I asked Sean whether he’d want to give the presentation. And, always welcoming adventure Sean jumped at the chance to present.

As it turned out Hurricane Rina resulted in us cancelling our vacation so I ended up attending the presentation with Sean. While we had bounced the slides between each other prior to the presentation Sean did a terrific job as the presenter and we had some very interesting questions regarding what is standing in the way of open science, especially around chemistry databases (of compounds), what are good examples of bioinformatics projects that are successful, and whether there are “risks” inherent to Open Science, especially in regards to what is shared online in public compound databases. I thoroughly enjoyed the meeting, short as it was and am glad that we were given the opportunity.

Sean has eloquently outlined the nature of the presentation at his site (he is Collabchem) and the presentation is below for your comments and review. I recommend that you check out Sean’s other presentations too!


No Comments

A Response to Raw Data in Organic Chemistry Papers and Open Science

@mattoddchem has posted “Raw Data in Organic Chemistry Papers/Open Science” regarding his wanting to  “share data in my field I have to think about how to best share those data”. I recommend you read his whole post before continuing.

Mat has made a number of comments (my annotations added as AJW>)

What are the downsides of posting raw data in organic chemistry, either in papers or to lab book posts:

1) You have to save the data and then upload them. Well, this was a problem in 1995, but not now.

AJW> The problem is definitely not in the technology of uploading to a website. It can be easy to do BUT there is still an activation barrier: wanting to share, can I share (do I have the rights or will my boss/supervisor be upset?),  will there be negative repercussions for sharing.

2) The data files are large. Not really. A 1H NMR spectrum is ca. 200KB.

AJW> Depending on the magnetic field strength and chosen resolution file sizes can be less than half this size or twice this size. It depends a lot on format also. Binary file formats can give you 100k which blow up to 1/2 mbyte as JCAMP (assuming none of the compressed formats)

3) It’s a pain. Yes, a little. But we must suffer for things we love.

AJW> Yes indeed….

4) People might find mistakes in my spectra/assignments. Yes. You’re a scientist. This is a Good Thing.

AJW> If we were afraid of the reexamination of our data we shouldn’t get into science, we shouldn’t publish and we shouldn’t share. How many scientists are like that???

Mat went on to ask about what types of raw data should be posted?

“If we can establish that we should be posting raw data, then what kinds of data should we share, and how? This post is meant to outline an answer, and ask for feedback from anyone who’s already thought about this.

1) X-ray crystallography. This is the exception. Data are routinely deposited raw, and may be downloaded. Not always the case, but XRD blazes a trail here.

AJW> I’m not a crystallographer…I’m an NMR jock by training. So the term “raw data” to me means what comes off the instrument. I think that most crystallography data is processed to extract the CIF file so it’s not raw per se. In NMR raw would be the binary file format from the vendor as a FID file, the time domain data. A frequency domain, phased, baseline corrected and referenced spectrum would not be raw data. The FID is not the most useful form of the spectrum for sure, and real time processing of the FID with all necessary corrections (referencing and phasing etc) is not going to succeed for all spectra.

2) NMR spectroscopy. The big one. IUPAC recommends the JCAMP-DX file format. Jean-Claude Bradley has been a proponent of this format, and has demonstrated how it can be used in all kinds of applications.

AJW> JCAMP is the de facto standard for data exchange in NMR. Most of the vendors (probably all) can autosave their data in JCAMP or at least Export in JCAMP. The third part processing applications import and export JCAMP, some in all of their various flavors….for example, the ACD/NMR processor package has the ability to export all of the various JCAMP flavors….including real only, real and imaginary, with extensions for integrals, peaks, assignments etc. The five JCAMP forms are listed. There are of course no guarantees that all of these various JCAMP flavors can be interpreted by other programs. JCAMP has that type of complexity unfortunately.

ACD/NMR Processor Export to JCAMP


We’ve played with it, and in one of our recent papers we deposited all the NMR data in this format in the SI. We’ve been posting JCAMP-DX files in our online electronic lab notebooks, e.g. here.

AJW> I downloaded a number of these and read them easily into my NMR processing package that I use on my PC. I submitted the first structure in the series 2a (N-(3-Azidopropyl)-1-methylpyrrole-2-carbo​xamide) to ChemSpider (it wasn’t on the database already) and then added the H1 and C13 spectra with a few clicks (see here how). The spectra are flagged as Open Data and available for everyone to download and reuse. I am assuming this is okay since the PLoS article is Open Access, though the data itself isn’t flagged as Open. The data are on ChemSpider here and shown below.

Spectra in ChemSpider


My opinion of this file format (both generating it, and reading it) has not been great. There are two formats, I understand, and we found that if we saved the data in the wrong format, we couldn’t read the data with certain programs, but could with others. i.e. we had to get the generation of the file just right.

AJW> There are way more flavors than two! See above in the image. Over 10 years at ACD/Labs we had to change the JCAMP reader 10s of times to support all of the different implementations of the “standard” JCAMP. A standard set is not a standard followed…

That kind of trickiness, though small, just inevitably means people won’t bother to generate or use the files on a mass scale (unless the journals decide to back it). PDF’s popularity is based on the ubiquity of the reader. JCAMP-DX works well with Jspecview, a free, open source NMR data reader. We’ve not enjoyed our experiences with this, either, though it’s a wonderful endeavour. This led us to look at whether there was a need for saving the data in a particular format, or whether we could just save the raw data, and process those data with a free piece of software. After looking at this with our resident NMR guru, Ian Luck, we found that saving raw data is easy (it’s just a copy and paste of what’s produced by the machine) and that the raw data can be read by free software such as Spinworks or ACDLabs, obviously in addition to our in-house software. This seems ideal? Does anyone have the reason IUPAC prefers a derived data format over the raw data, other than JCAMP-DX is a single file? Aren’t raw data likely to be the most generically useful long-term?

AJW> Assuming that raw data means binary file formats from a spectrometer vendor I can comment that binary files have big issues in terms of longevity. Binary files are that way for a reason …one reason being that the vendors have proprietary information regarding their data acquisition and corrections in the format that are then efficiently dealt with by the vendors software. To try and process Varian, Bruker or Jeol files from many years ago, from their archaic software is likely not easy in even their own modern software. Depending on the vendors to provide their binary file formats to code against generally requires permission AND legal contracts. When I joined a place of work about 20 years ago we had a room of old Winchester drives in OLD Bruker and Varian formats. Not only could we NOT read the files we couldn’t mount the Winchester drives. What about 5.25″ drives, 3.5″ drives, optical disks etc. Lots of that data is almost certainly rotted at this point. JCAMP does offer longevity. Even when a spectrum cannot be read generally it is not difficult to hack the reader. Fixes are easy.

I don’t know if people have experience of this. I was in touch with one of the ACS journals recently, who indicated that their view was that the journal is not a data repository, and that posting of raw data (which was in their view to some extent desirable) should be posted elsewhere, e.g. to an institutional repository. This is an option. I think it’s less convenient. PLoS seem happy to host the data.

AJW> PLoS are hosting the data but in order to view the data you have to have a JCAMP viewer or processing package that will read the JCAMP spectrum. What a reader might like to be able to do is view the spectrum on the Supp Info page. This is of course possible as can be done using the Embed Functionality of ChemSpider’s spectra. The spectra could be posted to ChemSpider and embedded in the PLoS pages if the journal allowed it. There are multiple advantages to this including the spectra being made available to the Spectral Game.

3) IR data. Don’t know if there is a standard. If the file is small, saving raw data could be encouraged. Would allow easy comparisons of fingerprint regions.

AJW> JCAMP is also a standard for IR. In fact I think JCAMP was developed for IR first? See here for IR spectra on ChemSpider.

4) Mass spectrometry. It’s not clear to me there is a huge advantage here to sharing raw data, for a typical low res experiment?

AJW> Most mass spec for chemists is done using soft ionization for mass only. Fragmentation obviously is of more value. If its Electron Impact then JCAMP will suffice. See here. In general NetCDF if the preferred format but we don’t support that on ChemSpider at present.

5) HPLC data. Again, the outputs are fairly simple, and I’m not clear about the advantage of raw data (which I’m assuming would be absorbance vs. time table). Would (perhaps) permit verification that traces have not been cropped to remove pesky impurities.

AJW> Adding HPLC support would not be difficult as an HPLC curve. If it was 2D data of LC and UV  that would be problematic.

6) Anything else?

AJW> Raman, Electron Spin Resonance, 2DNMR, Near infrared, Far infrared, UV-Vis etc. All of these, except ESR, are supported in ChemSpider. Check out this example of H1, C13 and THREE 2D NMR spectra!”

An alternative format of interest might be AnIML, the Analytical Information Markup Language. This format was started a number of years ago and the site does look like it is static but you can follow AnIML on Twitter and it is still alive. I am a little pessimistic that it will provide much value in the near future as the standard will first need to be released, then accepted, then tools will need to be developed/extended to support the format.  The challenge then is either that some organization will need to write all the format converters for all of the vendor instrument binary file formats (and support them for tech support etc) OR the vendors themselves will need to export AnIML – with similar issues to those that arose with JCAMP where the vendors developed their flavor. Clearly a validation suite for checking AnIML files would be appropriate. Overall AnIML has promise but it has taken years to get here already and I would say it will have limited impact in the foreseeable future.

For Mat’s spectral data it would be great to deposit it en masse to ChemSpider rather than for us to deal with it one spectrum and one compound at a time. That is why we already have developed mass deposition tools. If we receive the JDX files in individual directories with the associated molfile then we can do mass deposition. The spectra are then available on ChemSpider immediately, are available to the Spectral Game and also will be fed to our NEW implementation of SpectraSchool that is presently under development and will use spectra under ChemSpider as the foundation data. I’d say the solution for Mat is mostly in place…

, , ,


%d bloggers like this: