Our dire need to mandate data standards and expectations for scientific publishing

This is a presentation that I delivered at the ACS Division of Chemical Information meeting regarding “Reproducibility, Reporting, Sharing & Plagiarism” at ACS Denver on 23rd March 2015.

I took the opportunity to remove my hat that has me be the VP of Strategic Development at RSC, and a member of the cheminformatics group that built ChemSpider and works on other RSC projects related to it. Instead I presented on how a LACK OF MANDATES from publishers on me in terms of submission of data accompanying articles I am involved with writing is actually weakening my scientific record as data is not getting shared in the most useful forms possible to the benefit of the community. I think there would be benefits for publishers to start pushing me for MORE data, in fairly general standards, and allowing me (and others) to download the data in the form of molecules (and collections), spectral data, CSV files etc.

 

No Comments

Providing Access to a Million NMR Spectra via the web

This presentation was given at the ACS Denver meeting on March 22nd 2015 in a CHED Division symposium

Providing Access to a Million NMR Spectra via the web

Antony Williams, Alexey Pshenichnov, Peter Corbett, Daniel Lowe, Carlos Coba

Access to large scale NMR collections of spectral data can be used for a number of purposes in terms of teaching spectroscopy to students. The data can be used for teaching purposes in lectures, as training data sets for spectral interpretation and structure elucidation, and to underpin educational resources such as the Royal Society of Chemistry’s Learn Chemistry. These resources have been available for a number of years but have been limited to rather small collections of spectral data and specifically only about 3000 spectra. In order to expand the data collection and provide richer resources for the community we have been gathering data from various laboratories and, as part of a research project, we have used text-mining approaches to extract spectral data from articles and patents in the form of textual strings and utilized algorithms to convert the data into spectral representations. While these spectra are reconstructions of text representations of the original spectral data we are investigating their value in terms of utilizing for the purpose of structure identification. This presentation will report on the processes of extracting structure-spectral pairs from text, approaches to performing automated spectral verification and our intention to assemble a spectral collection of a million NMR spectra and make them available online.

 

No Comments

Give me kudos for taking responsibility for self-marketing my scientific publications and increase impact

This presentation was given at the ACS Denver meeting on March 22nd 2015 in a CINF Division symposium

Give me kudos for taking responsibility for self-marketing my scientific publications and increase impact.

Antony Williams, Will Russell, Melinda Kenneway and Louise Peck

The authoring of a scientific publication can represent the culmination of many tens if not 100s of hours of data collection and analysis. The authoring and peer-review process itself often represents a major undertaking in terms of assembling the publication and passing through review. Considering the amount of work invested in the production of a scientific article it is therefore quite surprising that authors, post-publication, invest very little effort in communicating the value and potential impact of their article to the community. Social networking has clearly demonstrated the ability to self-market and drive attention. At the same time, the increasing volume of literature (over a million new articles are published every year), requires authors to take on a more direct role in ensuring their work gets read and cited. This requirement may grow with the emergence of a range of metrics at the article level, shifting attention away from where a researcher publishes to the performance of their individual articles. Therefore, a separate platform to facilitate social networking and other discovery tools to communicate the value of published science to the community would be of value. In parallel the possibility to enhance an article by linking to additional information (presentations, videos, blog posts etc) allows for enrichment of the article post-publication, a capability not available via the publishers platform. This presentation will provide a personal overview of the experiences of using the Kudos Platform and how it ultimately benefits my ability to communicate an integrated view of my research to the community.

 

No Comments

PITTCON poster: Dealing with the complex challenge of managing diverse analytical chemistry data online

This is a talk I presented at Pittcon on Wednesday March 13th, 2015

Dealing with the complex challenge of managing diverse analytical chemistry data online

The Royal Society of Chemistry provides open access to data associated with tens of millions of chemical compounds. The richness and complexity of the data has continued to expand dramatically and the original vision for providing an integrated hub for structure-centric data has been delivered across the world to hundreds of thousands of users. With an intention of expanding the reach to cover more diverse aspects of chemistry-related data including compounds, reactions and analytical data, to name just a few data-types, we are in the process delivering a Chemistry Data Repository. The data repository will manage the challenges of associated metadata, the various levels of required security (private, shared and public) and exposing the data as appropriate using semantic web technologies. Ultimately this platform will become the host for all chemicals, reactions and analytical data contained within RSC publications and specifically supplementary information. This presentation will report on the challenges of managing “Big Data” for chemists around the world and providing access to tools for structure dereplication, spectral database searching and the crowdsourcing of the worlds’ largest spectral database.

 

No Comments

PITTCON Poster: Using an online database of chemical compounds for the purpose of structure identification

This is a poster I presented at Pittcon on Wednesday March 9th, 2015

Using an online database of chemical compounds for the purpose of structure identification

Online databases can be used for the purposes of structure identification. The Royal Society of Chemistry provides access to an online database containing tens of millions of compounds and this has been shown to be a very effective platform for the development of tools for structure identification. Since in many cases an unknown to an investigator is known in the chemical literature or reference database, these “known unknowns” are commonly available now on aggregated internet resources. The identification of these types of compounds in commercial, environmental, forensic, and natural product samples can be identified by searching against these large aggregated databases querying by either elemental composition or monoisotopic mass. Searching by elemental composition is the preferred approach as it is often difficult to determine a unique elemental composition for compounds with molecular weights greater than 600 Da. In these cases, searching by the monoisotopic mass is advantageous. In either case, the search results can be refined by appropriate filtering to identify the compounds. We will report on integrated filtering and search approaches on our aggregated compound database for the purpose of structure identification and review our progress in using the platform for natural product dereplication purposes.

 

No Comments

PITTCON Poster: ChemSpider – building an online database of open spectra

This is a poster I presented at Pittcon on Wednesday March 11th, 2015

ChemSpider – building an online database of open spectra

ChemSpider is an online database of over 30 million chemical compounds sourced from over 500 different sources including government laboratories, chemical vendors, public resources and publications. Developed with the intention of building community for chemists ChemSpider allows its users to deposit data including structures, properties, links to external resources and various forms of spectral data. Over the past few years ChemSpider has aggregated almost 20000 high quality NMR and IR spectra and continues to expand as the community deposits additional types of data. The majority of spectral data is licensed as Open Data allowing it to be downloaded and reused in presentations, lesson plans and for teaching purposes. This poster will present our existing technology and our plans to host a million spectra in our developing online data repository.

No Comments

Presentations at the ACS Meeting in Denver

Having just returned from Pittcon late last night I am now turning my attention to the next set of presentations to be given at the ACS Denver meeting. These are listed below. If any of the blog readers will be at the ACS meeting it would be great to catch-up. See you there.

PAPER TITLE: Importance of data standards for large scale data integration in chemistry (final paper number: CINF 39)
DAY & TIME OF PRESENTATION: Wednesday, March, 25, 2015 from 11:20 AM – 11:50 AM
ROOM & LOCATION: Room 110 – Colorado Convention Center

ABSTRACT
Increasingly online databases are being used for the purpose of structure identification. In many cases an unknown to an investigator is known in the chemical literature or online database and these “known unknowns” are commonly available in these aggregated internet resources. The identification of these types of compounds in commercial, environmental, forensic, and natural product samples can be identified by searching against these large aggregated databases querying by either elemental composition or monoisotopic mass. We will report on the search approaches that we offer on aggregated compound databases hosted by the Royal Society of Chemistry and how these resources can be used for the purpose of structure identification. We will also report on our progress in the area of hosting interactive spectral data, including assignments, on our data repository and how we are using our analytical data platform for the purpose of natural product dereplication.

 

PAPER TITLE: Give me kudos for taking responsibility for self-marketing my scientific publications and increase impact (final paper number: CINF 8)
DAY & TIME OF PRESENTATION: Sunday, March, 22, 2015 from 2:15 PM – 2:40 PM
ROOM & LOCATION: Room 110 – Colorado Convention Center

ABSTRACT
The authoring of a scientific publication can represent the culmination of many tens if not 100s of hours of data collection and analysis. The authoring and peer-review process itself often represents a major undertaking in terms of assembling the publication and passing through review. Considering the amount of work invested in the production of a scientific article it is therefore quite surprising that authors, post-publication, invest very little effort in communicating the value and potential impact of their article to the community. Social networking has clearly demonstrated the ability to self-market and drive attention. At the same time, the increasing volume of literature (over a million new articles are published every year), requires authors to take on a more direct role in ensuring their work gets read and cited. This requirement may grow with the emergence of a range of metrics at the article level, shifting attention away from where a researcher publishes to the performance of their individual articles. Therefore, a separate platform to facilitate social networking and other discovery tools to communicate the value of published science to the community would be of value. In parallel the possibility to enhance an article by linking to additional information (presentations, videos, blog posts etc) allows for enrichment of the article post-publication, a capability not available via the publishers platform. This presentation will provide a personal overview of the experiences of using the Kudos Platform and how it ultimately benefits my ability to communicate an integrated view of my research to the community.

 

 

PAPER TITLE: Providing access to a million NMR spectra via the web (final paper number: CHED 91)
SESSION: NMR Spectroscopy in the Undergraduate Curriculum
DAY & TIME OF PRESENTATION: Sunday, March, 22, 2015 from 4:15 PM – 4:35 PM
ROOM & LOCATION: Gold – Sheraton Denver Downtown Hotel

ABSTRACT
Access to large scale NMR collections of spectral data can be used for a number of purposes in terms of teaching spectroscopy to students. The data can be used for teaching purposes in lectures, as training data sets for spectral interpretation and structure elucidation, and to underpin educational resources such as the Royal Society of Chemistry’s SpectralGame (www.spectralgame.com). These resources have been available for a number of years but have been limited to rather small collections of spectral data and specifically only about 3000 spectra. In order to expand the data collection and provide richer resources for the community we have been gathering data from various laboratories and, as part of a research project, we have used text-mining approaches to extract spectral data from articles and patents in the form of textual strings and utilized algorithms to convert the data into spectral representations. While these spectra are reconstructions of text representations of the original spectral data we are investigating their value in terms of utilizing for the purpose of structure identification. This presentation will report on the processes of extracting structure-spectral pairs from text, approaches to performing automated spectral verification and our intention to assemble a spectral collection of a million NMR spectra and make them available online.

 

PAPER TITLE: Using online chemistry databases to facilitate structure identification in mass spectral data (final paper number: ANYL 45)
SESSION: Advances in Mass Spectrometry
DAY & TIME OF PRESENTATION: Tuesday, March, 24, 2015 from 8:45 AM – 9:05 AM
ROOM & LOCATION: Aspen Room A – Embassy Suites Denver – Downtown Convention Center

ABSTRACT
The Royal Society of Chemistry hosts large scale data collections and provides access to the data to the chemistry community. The largest RSC data set of wide scale interest to the community offers access to tens of millions of compounds. The host platform, ChemSpider, is limited as it is a structure centric hub only. A new architecture, the RSC data repository, has been developed that extends support to reactions, spectral data, crystallography data and related property data. It is also the architecture underlying a series of exemplar projects for managing data for a number of diverse laboratories. The adoption of data standards for the integration and distribution of data has been essential. Specific standards include molecular structure formats such as molfiles and InChIs, and spectral data formats such as JCAMP. This presentation will report on our development of the data repository, the importance of utilizing standards for data integration, the flexible nature of the architecture to deliver solutions for various laboratories and our efforts to develop new large data collections. This includes text-mining efforts to extract large spectrum-structure collections from large corpuses.

No Comments

Micropublishing of 200 words isn’t new but the Journal of Brief Ideas is

Nature recently posted about a Journal that Publishes 200 Word Articles. The reporter commented “it is the latest online journal promises to bring a little brevity to science by accepting submissions of 200 words or less”. Initially I thought it was a Nature experiment but it isn’t. The intention around this new Journal of Brief Ideas is outlined here : http://beta.briefideas.org/about.

Some of the comments on the Nature post are interesting. This one from Bob Buntrock, who I know well from the Chemical Information list server probably represents a large number of people:

“200 words is not even a good abstract in most cases. Sop to the Twitter crowd. Since I do not nor plan to use social media for scientific communication, I’ll never use it and I’ll tend not to respect it.”

Personally, I BELIEVE in micropublishing. That’s why when I joined RSC over 5 years ago and we unveiled ChemSpider at our first conference in Glasgow the NEW idea that Valery Tkachenko and I pitched was to take advantage of our knowledge of cheminformatics, chemical data handling in ChemSpider and the increasing activities in blogging and microblogging and apply them to something called “ChemSpider Syntheses”. The ChemSpider Journal of Chemistry had been run as an experiment already, and is still online. We had already shown that Open Access articles such as those from MDPI Molecules could be hosted in the ChemMantis platform and marked up with interactive chemical widgets. We were already aware of the great work done by the SyntheticPages group and we chose to collaborate to create ChemSpider SyntheticPages (CSSP) as announced here.

Since then CSSP has accepted many articles and became the host of all of the Olympicene synthetic steps. The story of Olympicene is in this YouTube video and the list of synthetic steps is here. Peter Scott has told his story about CSSP and submissions have continued.

I took a look at some of these articles and if I exclude the Title, data such as NMR list of shifts and Chemicals Used then MANY ChemSpider SyntheticPages articles are about 200-250 words (i.e. the Procedure and the Authors Comments). All articles submitted to CSSP go through a fairly light review process from one of the editorial team, generally in about 24 hours, then are published and the community can comment on them – open peer review.

I also believe in the possibilities associated with Nanopublishing and nanopublications and there is work afoot to unveil some of these from text-mining efforts.

While our micropublishing efforts are focused on chemistry and syntheses specifically I believe there are other opportunities. Certainly Figshare, Slideshare and Dryad can all host micropublications already. The efforts of the Journal of Brief Ideas is a new approach and an experiment worth watching!  Good luck to them!

No Comments

My Spoiler Alert about Netflix and the House of Cards

This is NOT about chemistry. If you are expecting chemistry stop here and move on. If you are a watcher of House of Cards on Netflix I may ruin your enjoyment of the show if you read further. You choose…

I work a lot of hours. I generally start my day at around 6am, work a 8-9 hour day during normal working hours and then when my boys are in bed (or on the days I don’t have them)  I commonly get back in front of my computer between 8pm-midnight/1am. Doing this I have an opportunity to get involved outside of my normal work acitivities to work with collaborators to do some exciting science (specifically with people like Sean Ekins, Alex Clark and Gary E. Martin). I also spend a lot of time looking at what is going on with social networking tools and data sharing platforms that may be of value to scientists.

While I am working at night I consume a lot of movies and series via Netflix and Amazon Prime. I have cut off cable TV to the house, put up a digital antenna (well, my very skilled friend did) and use cable internet almost exclusively for entertainment now.

Both Netflix and Amazon now have specific programming that doesn’t make it to general TV channels. For example Orange is the New Black and House of Cards on Netflix and Alpha House and Bosch on Amazon Prime. I have enjoyed them all. Amusing, intelligent, edgy and shocking covers the general flavor of what these programs have as entertainment value.

It took me a couple of episodes of House of Cards to really get into it but with Season 3 just around the corner I am excited to see whether or not my interpretation of many scenes, acts and  breadcrumb clues throughout the show are right. The relationship between Frank and Claire Underwood is really what this show is about…everything else around it, the manipulations, the espionage, the nastiness and the supposed harsh realities of Washington pale in comparison to the farce that is the relationship between Frank and Claire. And it makes for addictive TV for sure.

I am not going to drag out why I have come to this conclusion as there are sooooooooo many clues. I may be totally off base here but here’s throwing out my judgement of the shocker that will show up one day for House of Cards…and I find no evidence that this is suggested anywhere on the internet.

Frank and Claire are brother and sister (or at least family)

Think about all their “rendezvous” with partners outside of their marriage, and even one with their security man/driver. But notice they never rendezvous with each other…EVER (that I recall). They definitely love and respect each other, they are both hungry for power and for what it brings, and there are lots of stories about their history but with no evidence for how they met, their backgrounds etc. I think they are working together to get to ultimate power…and the Presidency is that for sure.

I am either way off base but it would be a great ending to a show…or maybe I applied my analytical skills in a good way. Time will tell..I love a good puzzler, a good story and a shocking ending. It happens in some of the science I do occasionally! Always fun.

 

2 Comments

Amazon and the management of my Author Profile

I have authored or co-authored a number of book chapters with friends and collaborators over the years and have been privileged to do so (see the book chapter section in my CV). I have also been involved with authoring or editing a number of books and one is presently in press and two will hopefully be finished by end of year. They are listed below.

IN PRESS Computer-based Structure Elucidation from Spectral Data: The Art of Solving Problems with ACD/Structure Elucidator, Mikhail E. Elyashberg, Antony J. Williams, Springer Link

Modern NMR Approaches for the Structure Elucidation of Natural Products PART 1 by Gary E. Martin, David Rovnyak and Antony J. Williams, in preparation, Royal Society of Chemistry Link

Applications of Modern NMR Approaches to the Structure Elucidation of Natural Products PART 2 by Gary E. Martin, David Rovnyak and Antony J. Williams, in preparation, Royal Society of Chemistry Link

I really like the Amazon Author Page for keeping all of this information integrated. My page is here: http://www.amazon.com/Antony-J.-Williams/e/B004YRPRV2 and I update it with book chapters and books as they come out.

Yesterday I commented on the issues of poor name searching related to the Chemtrove article published recently. When I added the Springer book to my profile today I took a look at the capabilities of Amazon assuming that things would likely be a lot better. In order to do this I found the Springer book here: http://www.amazon.com/Computer-Based-Structure-Elucidation-Spectral-Data/dp/3662464012/

and clicked on the name Antony J. Williams listed as shown here:

Click on the name Antony J. Williams to find related books

Click on the name Antony J. Williams to find related books

The result set is shown here. Only two of these books are mine, the rest are shown below and yes, the one entitled “I Hate Sex” remains on my list. I did NOT write it! I couldn’t see why the Granular Materials book is listed against me until I looked at all 30 editors/contributors for the book here and saw that Antony and Williams (very much separated) were shown in the list. The All Hallows book makes more sense as it has David Williams, Antony Oldknow, but the last one about dinoflagellates has the same issue as the granular materials book…Antony and Williams in the list of authors.

Four of the books found searching against Antony J. Williams

Four of the books found searching against Antony J. Williams

Having added the book to my profile however made a dramatic difference as it now links to my author page and lists all of my books rather than performing a loose search. This is a good overall solution but DID need my participation to claim my book. To date the evidence suggests that Google really have their act together in terms of being able to automatically associate articles with my profile but for other platforms such as Kudos, ORCID and Amazon Author Pages the author is best to stay involved with CLAIMing their articles to ensure good association and high quality data.

No Comments