PITTCON Poster: ChemSpider – building an online database of open spectra

This is a poster I presented at Pittcon on Wednesday March 11th, 2015

ChemSpider – building an online database of open spectra

ChemSpider is an online database of over 30 million chemical compounds sourced from over 500 different sources including government laboratories, chemical vendors, public resources and publications. Developed with the intention of building community for chemists ChemSpider allows its users to deposit data including structures, properties, links to external resources and various forms of spectral data. Over the past few years ChemSpider has aggregated almost 20000 high quality NMR and IR spectra and continues to expand as the community deposits additional types of data. The majority of spectral data is licensed as Open Data allowing it to be downloaded and reused in presentations, lesson plans and for teaching purposes. This poster will present our existing technology and our plans to host a million spectra in our developing online data repository.

Presentations at the ACS Meeting in Denver

Having just returned from Pittcon late last night I am now turning my attention to the next set of presentations to be given at the ACS Denver meeting. These are listed below. If any of the blog readers will be at the ACS meeting it would be great to catch-up. See you there.

PAPER TITLE: Importance of data standards for large scale data integration in chemistry (final paper number: CINF 39)
DAY & TIME OF PRESENTATION: Wednesday, March, 25, 2015 from 11:20 AM – 11:50 AM
ROOM & LOCATION: Room 110 – Colorado Convention Center

Increasingly online databases are being used for the purpose of structure identification. In many cases an unknown to an investigator is known in the chemical literature or online database and these “known unknowns” are commonly available in these aggregated internet resources. The identification of these types of compounds in commercial, environmental, forensic, and natural product samples can be identified by searching against these large aggregated databases querying by either elemental composition or monoisotopic mass. We will report on the search approaches that we offer on aggregated compound databases hosted by the Royal Society of Chemistry and how these resources can be used for the purpose of structure identification. We will also report on our progress in the area of hosting interactive spectral data, including assignments, on our data repository and how we are using our analytical data platform for the purpose of natural product dereplication.


PAPER TITLE: Give me kudos for taking responsibility for self-marketing my scientific publications and increase impact (final paper number: CINF 8)
DAY & TIME OF PRESENTATION: Sunday, March, 22, 2015 from 2:15 PM – 2:40 PM
ROOM & LOCATION: Room 110 – Colorado Convention Center

The authoring of a scientific publication can represent the culmination of many tens if not 100s of hours of data collection and analysis. The authoring and peer-review process itself often represents a major undertaking in terms of assembling the publication and passing through review. Considering the amount of work invested in the production of a scientific article it is therefore quite surprising that authors, post-publication, invest very little effort in communicating the value and potential impact of their article to the community. Social networking has clearly demonstrated the ability to self-market and drive attention. At the same time, the increasing volume of literature (over a million new articles are published every year), requires authors to take on a more direct role in ensuring their work gets read and cited. This requirement may grow with the emergence of a range of metrics at the article level, shifting attention away from where a researcher publishes to the performance of their individual articles. Therefore, a separate platform to facilitate social networking and other discovery tools to communicate the value of published science to the community would be of value. In parallel the possibility to enhance an article by linking to additional information (presentations, videos, blog posts etc) allows for enrichment of the article post-publication, a capability not available via the publishers platform. This presentation will provide a personal overview of the experiences of using the Kudos Platform and how it ultimately benefits my ability to communicate an integrated view of my research to the community.



PAPER TITLE: Providing access to a million NMR spectra via the web (final paper number: CHED 91)
SESSION: NMR Spectroscopy in the Undergraduate Curriculum
DAY & TIME OF PRESENTATION: Sunday, March, 22, 2015 from 4:15 PM – 4:35 PM
ROOM & LOCATION: Gold – Sheraton Denver Downtown Hotel

Access to large scale NMR collections of spectral data can be used for a number of purposes in terms of teaching spectroscopy to students. The data can be used for teaching purposes in lectures, as training data sets for spectral interpretation and structure elucidation, and to underpin educational resources such as the Royal Society of Chemistry’s SpectralGame (www.spectralgame.com). These resources have been available for a number of years but have been limited to rather small collections of spectral data and specifically only about 3000 spectra. In order to expand the data collection and provide richer resources for the community we have been gathering data from various laboratories and, as part of a research project, we have used text-mining approaches to extract spectral data from articles and patents in the form of textual strings and utilized algorithms to convert the data into spectral representations. While these spectra are reconstructions of text representations of the original spectral data we are investigating their value in terms of utilizing for the purpose of structure identification. This presentation will report on the processes of extracting structure-spectral pairs from text, approaches to performing automated spectral verification and our intention to assemble a spectral collection of a million NMR spectra and make them available online.


PAPER TITLE: Using online chemistry databases to facilitate structure identification in mass spectral data (final paper number: ANYL 45)
SESSION: Advances in Mass Spectrometry
DAY & TIME OF PRESENTATION: Tuesday, March, 24, 2015 from 8:45 AM – 9:05 AM
ROOM & LOCATION: Aspen Room A – Embassy Suites Denver – Downtown Convention Center

The Royal Society of Chemistry hosts large scale data collections and provides access to the data to the chemistry community. The largest RSC data set of wide scale interest to the community offers access to tens of millions of compounds. The host platform, ChemSpider, is limited as it is a structure centric hub only. A new architecture, the RSC data repository, has been developed that extends support to reactions, spectral data, crystallography data and related property data. It is also the architecture underlying a series of exemplar projects for managing data for a number of diverse laboratories. The adoption of data standards for the integration and distribution of data has been essential. Specific standards include molecular structure formats such as molfiles and InChIs, and spectral data formats such as JCAMP. This presentation will report on our development of the data repository, the importance of utilizing standards for data integration, the flexible nature of the architecture to deliver solutions for various laboratories and our efforts to develop new large data collections. This includes text-mining efforts to extract large spectrum-structure collections from large corpuses.

Micropublishing of 200 words isn’t new but the Journal of Brief Ideas is

Nature recently posted about a Journal that Publishes 200 Word Articles. The reporter commented “it is the latest online journal promises to bring a little brevity to science by accepting submissions of 200 words or less”. Initially I thought it was a Nature experiment but it isn’t. The intention around this new Journal of Brief Ideas is outlined here : http://beta.briefideas.org/about.

Some of the comments on the Nature post are interesting. This one from Bob Buntrock, who I know well from the Chemical Information list server probably represents a large number of people:

“200 words is not even a good abstract in most cases. Sop to the Twitter crowd. Since I do not nor plan to use social media for scientific communication, I’ll never use it and I’ll tend not to respect it.”

Personally, I BELIEVE in micropublishing. That’s why when I joined RSC over 5 years ago and we unveiled ChemSpider at our first conference in Glasgow the NEW idea that Valery Tkachenko and I pitched was to take advantage of our knowledge of cheminformatics, chemical data handling in ChemSpider and the increasing activities in blogging and microblogging and apply them to something called “ChemSpider Syntheses”. The ChemSpider Journal of Chemistry had been run as an experiment already, and is still online. We had already shown that Open Access articles such as those from MDPI Molecules could be hosted in the ChemMantis platform and marked up with interactive chemical widgets. We were already aware of the great work done by the SyntheticPages group and we chose to collaborate to create ChemSpider SyntheticPages (CSSP) as announced here.

Since then CSSP has accepted many articles and became the host of all of the Olympicene synthetic steps. The story of Olympicene is in this YouTube video and the list of synthetic steps is here. Peter Scott has told his story about CSSP and submissions have continued.

I took a look at some of these articles and if I exclude the Title, data such as NMR list of shifts and Chemicals Used then MANY ChemSpider SyntheticPages articles are about 200-250 words (i.e. the Procedure and the Authors Comments). All articles submitted to CSSP go through a fairly light review process from one of the editorial team, generally in about 24 hours, then are published and the community can comment on them – open peer review.

I also believe in the possibilities associated with Nanopublishing and nanopublications and there is work afoot to unveil some of these from text-mining efforts.

While our micropublishing efforts are focused on chemistry and syntheses specifically I believe there are other opportunities. Certainly Figshare, Slideshare and Dryad can all host micropublications already. The efforts of the Journal of Brief Ideas is a new approach and an experiment worth watching!  Good luck to them!

My Spoiler Alert about Netflix and the House of Cards

This is NOT about chemistry. If you are expecting chemistry stop here and move on. If you are a watcher of House of Cards on Netflix I may ruin your enjoyment of the show if you read further. You choose…

I work a lot of hours. I generally start my day at around 6am, work a 8-9 hour day during normal working hours and then when my boys are in bed (or on the days I don’t have them)  I commonly get back in front of my computer between 8pm-midnight/1am. Doing this I have an opportunity to get involved outside of my normal work acitivities to work with collaborators to do some exciting science (specifically with people like Sean Ekins, Alex Clark and Gary E. Martin). I also spend a lot of time looking at what is going on with social networking tools and data sharing platforms that may be of value to scientists.

While I am working at night I consume a lot of movies and series via Netflix and Amazon Prime. I have cut off cable TV to the house, put up a digital antenna (well, my very skilled friend did) and use cable internet almost exclusively for entertainment now.

Both Netflix and Amazon now have specific programming that doesn’t make it to general TV channels. For example Orange is the New Black and House of Cards on Netflix and Alpha House and Bosch on Amazon Prime. I have enjoyed them all. Amusing, intelligent, edgy and shocking covers the general flavor of what these programs have as entertainment value.

It took me a couple of episodes of House of Cards to really get into it but with Season 3 just around the corner I am excited to see whether or not my interpretation of many scenes, acts and  breadcrumb clues throughout the show are right. The relationship between Frank and Claire Underwood is really what this show is about…everything else around it, the manipulations, the espionage, the nastiness and the supposed harsh realities of Washington pale in comparison to the farce that is the relationship between Frank and Claire. And it makes for addictive TV for sure.

I am not going to drag out why I have come to this conclusion as there are sooooooooo many clues. I may be totally off base here but here’s throwing out my judgement of the shocker that will show up one day for House of Cards…and I find no evidence that this is suggested anywhere on the internet.

Frank and Claire are brother and sister (or at least family)

Think about all their “rendezvous” with partners outside of their marriage, and even one with their security man/driver. But notice they never rendezvous with each other…EVER (that I recall). They definitely love and respect each other, they are both hungry for power and for what it brings, and there are lots of stories about their history but with no evidence for how they met, their backgrounds etc. I think they are working together to get to ultimate power…and the Presidency is that for sure.

I am either way off base but it would be a great ending to a show…or maybe I applied my analytical skills in a good way. Time will tell..I love a good puzzler, a good story and a shocking ending. It happens in some of the science I do occasionally! Always fun.



Amazon and the management of my Author Profile

I have authored or co-authored a number of book chapters with friends and collaborators over the years and have been privileged to do so (see the book chapter section in my CV). I have also been involved with authoring or editing a number of books and one is presently in press and two will hopefully be finished by end of year. They are listed below.

IN PRESS Computer-based Structure Elucidation from Spectral Data: The Art of Solving Problems with ACD/Structure Elucidator, Mikhail E. Elyashberg, Antony J. Williams, Springer Link

Modern NMR Approaches for the Structure Elucidation of Natural Products PART 1 by Gary E. Martin, David Rovnyak and Antony J. Williams, in preparation, Royal Society of Chemistry Link

Applications of Modern NMR Approaches to the Structure Elucidation of Natural Products PART 2 by Gary E. Martin, David Rovnyak and Antony J. Williams, in preparation, Royal Society of Chemistry Link

I really like the Amazon Author Page for keeping all of this information integrated. My page is here: http://www.amazon.com/Antony-J.-Williams/e/B004YRPRV2 and I update it with book chapters and books as they come out.

Yesterday I commented on the issues of poor name searching related to the Chemtrove article published recently. When I added the Springer book to my profile today I took a look at the capabilities of Amazon assuming that things would likely be a lot better. In order to do this I found the Springer book here: http://www.amazon.com/Computer-Based-Structure-Elucidation-Spectral-Data/dp/3662464012/

and clicked on the name Antony J. Williams listed as shown here:

Click on the name Antony J. Williams to find related books

The result set is shown here. Only two of these books are mine, the rest are shown below and yes, the one entitled “I Hate Sex” remains on my list. I did NOT write it! I couldn’t see why the Granular Materials book is listed against me until I looked at all 30 editors/contributors for the book here and saw that Antony and Williams (very much separated) were shown in the list. The All Hallows book makes more sense as it has David Williams, Antony Oldknow, but the last one about dinoflagellates has the same issue as the granular materials book…Antony and Williams in the list of authors.

Four of the books found searching against Antony J. Williams

Having added the book to my profile however made a dramatic difference as it now links to my author page and lists all of my books rather than performing a loose search. This is a good overall solution but DID need my participation to claim my book. To date the evidence suggests that Google really have their act together in terms of being able to automatically associate articles with my profile but for other platforms such as Kudos, ORCID and Amazon Author Pages the author is best to stay involved with CLAIMing their articles to ensure good association and high quality data.

How fast book prices plummet when I am a contributor

I doubt, I hope I am not to blame. At least I hope not…

The reality is that most books plummet in prices very quickly now after release. Actually, so do CDs …when I buy a CD as a pre-order on Amazon it is common for the price guarantee of Amazon to kick in and reduce it below the pre-order price before I even receive it. Amazon really have a good thing going in terms of enabling what is likely the vast majority of the resale market for books (well maybe eBay also).

As an indication of the change in value of books from release to buying a discounted or used version of a book I can look at my authors profile on Amazon and see the price for a new copy, a discounted copy and used copies.

As an example of the discounted books available a partial screenshot is below. The book regarding Collaborative Computational Technologies must have sold really well because there are lots of used copies for sale it seems! At just over a $3 starting price. BARGAIN…grab one for each of your family members….

Discount prices of books on my Amazon Author Profile

Discount prices of books on my Amazon Author Profile

A TERRIBLE implementation of Name Searching on ACS Journals

Yes, I am a Williams. And THAT is an incredibly common surname. But I am an Antony Williams, notice no H in the name, i.e. NOT Anthony. In the field of chemistry there are not many of us around…a couple I know of, but not many overall. Google Scholar does an extremely good job of automatically associating my newly published articles with my Citations profile here: https://scholar.google.com/citations?user=O2L8nh4AAAAJ

The last five articles automatically associated with my profile. I do NOT make any associations manually at this point.

The last five articles automatically associated with my profile. I do NOT make any associations manually at this point.

I am assuming that this is done by understanding the type of work I publish on, some of the co-author network maps that have been established as my profile has developed etc. I assume that there approach is very intelligent relative to some of the more commonplace searches that have been implemented….certainly the results are GOOD.

I noticed one disastrous example today when our article “ChemTrove: Enabling a Generic ELN to Support Chemistry Through the Use of Transferable Plug-ins and Online Data Sources” was published on the Journal of Chemical Information and Modeling here. Right there to the left of the abstract is an offer to look at other content by the authors.

Look for related content by the authors on JCIM

Look for related content by the authors on JCIM

I was interested to see what else ACS knew about my content so I clicked on my name…which performed this search: http://pubs.acs.org/action/doSearch?ContribStored=Williams%2C+A  and provided me with 96 articles by Andrew Williams (mostly), by Aaron Williams, by Anthony Williams (not me) and Allan Williams (to name a few). Eventually I managed to find 3 that were associated with me by searching the list for Antony Williams but none of those I published as Antony J. Williams were recovered.

Also, my colleague Valery Tkachenko is listed as an author with a misspelling as Valery Tkachenkov. What is simply inappropriate in my opinion is how the process involved taking the list of our submitted names..copied below directly from the submitted manuscript and changing them to their own interpretation of how we would want to see our names listed.

From this:

Aileen E. Day*†, Simon J. Coles, Colin L. Bird, Jeremy G. Frey, Richard J. Whitby, Valery E. Tkachenko§, Antony J. Williams§

To This:

Names changed from the original manuscript to those produced at submission

Names changed from the original manuscript to those produced at submission

Notice that for Aileen and Jeremy the middle initials were expanded, Colin had his middle initial changed from L. to I.,  Richard, Valery and I had our middle initials dropped and Valery had a v added to his surname. Why not simply copy and paste the names from the manuscript?

I will point out that this is a “Just Accepted” manuscript and likely the changes in names will be caught and edited, especially now I have just pointed them out. “Just accepted” does have some disclaimers:

The disclaimers regarding Just Accepted manuscripts

The disclaimers regarding Just Accepted manuscripts

While they can edit the names to match what we originally provided I don’t think it will fix the issue regarding finding all of my articles on ACS journals as when  navigated to one of my other articles here, http://pubs.acs.org/doi/abs/10.1021/es0713072, and did the search from my listed name it found exactly the same 96 hits.

Maybe a thought to use my ORCID profile http://orcid.org/0000-0002-2668-4821 to look for ACS journal articles associated with my name?

Unfortunately the data is already out in the wild as when I claimed the article on Kudos all of the name spelling issues had clearly spilled over via the DOI: https://www.growkudos.com/articles/10.1021%252Fci5005948

Names transferred via DOI to the Grow Kudos Platform

Names transferred via DOI to the Grow Kudos Platform

Ah…the things that surprise me….or not.

Hosting public domain chemicals data online for the community – the challenges of handling materials

This is the presentation I gave at the Opportunities in Material Informatics meeting in Madison, Wisconsin.

Hosting public domain chemicals data online for the community – the challenges of handling materials

The Royal Society of Chemistry hosts one of the worlds’ richest collections of online chemistry data that is free-to-access for the community. ChemSpider presently hosts over 30 million unique chemical compounds together with associated data and accessible via a number of search techniques. With almost 50,000 unique users per day from around the world the site offers scientists the ability to investigate the world of small molecules via property searches, analytical data and predictive models. The challenges associated with providing a similar platform for “materials” are manifold but, if they could be addressed, would offer a valuable service to the materials community. This presentation will provide an overview of how ChemSpider was built, our efforts to expand the capabilities to a more encompassing data repository and some of the challenges faced to embrace the diverse world of materials informatics and online data access.

Speaking at “Opportunities in Materials Informatics” workshop

Those of you who have been following my blog over the years are likely very aware of some of the work that we have done working on ChemSpider, ChemSpider SyntheticPages, OpenPHACTS, PharmaSea and the Chemical Database Service. We have many years of investment in the ChemSpider projects of course but our primary focus now is working on the data repository project that we published an early outline of last year. The majority of projects that we have focused on have focused on small organic molecules primarily and, I judge, we have been successful in addressing many of the challenges around the informatics challenges…many, but certainly not all! For sure, when we started ChemSpider over 7 years ago we were not aware of some of the issues we would face as we developed ChemSpider and there has been so much learned in the process. For sure we have been able to spend time with, collaborate with and push forward with some of the best minds in cheminformatics…and to them we are grateful.

Tomorrow I will give an overview of our work in the field of building small molecule databases for the community – with a focus on ChemSpider of course but also knitting together many of the other areas I have been connected to. I am privileged to have been invited to participate in the “Opportunities in Materials Informatics” workshop in Madison, Wisconsin. There are many experts in the field attending (see list) and I am sure I will learn WAY MORE than I will be able to share in terms of wisdom! In some ways I will be out of my depth and hearing about some of the approaches people are taking to develop informatics platforms for materials data. I will put up my slide deck when it is finished in the usual place on Slideshare when its ready. Follow the twitterfeed for the workshop at #matinformatics2015.

Informatics 2.0 for the Analytical Sciences: Big Data, the Semantic Web, and Metadata – Invitation for Abstracts

Stuart Chalk and I will be hosting a symposium at the 250th ACS Meeting in Boston, MA August 16-20, 2015 in the ACS Analytical Chemistry Division entitled “Informatics 2.0 for the Analytical Sciences: Big Data, the Semantic Web, and Metadata” at t.  The theme for the 250th meeting is “Innovation from Discovery to Application”.

Symposium Synopsis – Informatics 2.0 for the Analytical Sciences: Big Data, the Semantic Web, and Metadata

Many disciplines in the sciences are rapidly moving into the informatics realm as a way to accelerate scientific progress.  While the analytical sciences has been in informatics for a long time (by its nature) the majority of the attention has been focused on LIMS systems (v1.0) and the information needs around those products.  This session will highlight research activities in informatics that will move the analytical sciences into the semantic web arena though the development of; data standards, controlled vocabularies, ontologies and semantic annotation.

To submit (by March 16th) an abstract go to http://www.acs.org/content/acs/en/meetings/abstract-submissions/acsnm250/division-of-analytical-chemistry.html .


