Archive for category Data Quality

Comparing the EPA CompTox Dashboard with ChemSpider for MS-based Structure Identification

It’s almost ten years, this April, since ChemSpider was released to the public at the 233rd ACS meeting in Chicago. For two years, prior to being acquired by RSC in May 2009, we worked very closely with a number of mass spectrometry vendors including Waters (Micromass), Thermo and Agilent. I always considered that the work that we did with ChemSpider could be highly valued by the mass spectrometry community. This was especially true after we published the work for the identification of known unknowns with James Little (  Certainly ChemSpider has become highly recognized, and used, by an increasing number of mass spectrometry vendors (through the ChemSpider Web Services).

A few months ago Andrew McEachran joined our team as a postdoc. Combining my experience with bringing ChemSpider to bear for the purpose of structure identification, his mass spectrometry skills and experience, and our tremendous development team to the development of the CompTox Chemistry Dashboard, we were able to make some further advances in the “identification known unknowns”. Our efforts were recently reported in this publication “Identifying known unknowns using the US EPA’s CompTox Chemistry Dashboard” ( Readers are pointed to the summary tables in the article (results) demonstrating the improved performance of the CompTox Chemistry Dashboard based on high quality data sources and new approaches to rank ordering results based on formula and mass searching.

We recently rolled out new functionality and “MS-Ready structure batch-based searching” to offer even greater support for MS-structure identification . We will report on further extensions to this work at the Spring ACS Meeting.

The AltMetrics for the Article are shown below

No Comments

The EPA Online Prediction Physicochemical Prediction Platform to Support Environmental Scientists

This poster was presented at the American Chemical Society in Philadelphia in August 2016 at the Sci-Mix gathering and at the ENVR section on Wednesday.

August 22, 2016 from 8:00 PM to 10:00 PM


SESSION TIME: Wednesday, August, 24, 2016, 6:00 PM – 8:00 PM
Hall D – Pennsylvania Convention Center

Poster Title: The EPA Online Prediction Physicochemical Prediction Platform to Support Environmental Scientists

As part of our efforts to develop a public platform to provide access to predictive models we have attempted to disentangle the influence of the quality versus quantity of data available to develop and validate QSAR models.  Using a thorough manual review of the data underlying the well-known EPI Suite software, we developed automated processes for the validation of the data using a KNIME workflow. This includes: approaches to validate different chemical structure representations (e.g. molfile and SMILES), identifiers (chemical names and registry numbers), and methods to standardize the data into QSAR-consumable formats for modeling. Our efforts to quantify and segregate data into various quality categories has allowed us to thoroughly investigate the resulting models developed from these data slices, as well as allowing us to examine whether or not efforts into the development of large high-quality datasets has the expected pay-off in terms of prediction performance. Machine-learning approaches have been applied to create a series of models that have been used to generate predicted physicochemical and environmental parameters for over 700,000 chemicals. These data are available online via the EPA’s iCSS Chemistry Dashboard. This abstract does not reflect U.S. EPA policy.

No Comments

BIA-10-2474, confusions in chemical structure and the need for EARLY clarity in chemical structures

My blog has been fairly inactive for the past few months, driven primarily by my move from working on cheminformatics at the Royal Society of Chemistry to working at the National Center for Computational Toxicology at the Environmental Protection Agency. While I stopped working on ChemSpider about 18 months before I left RSC (to focus on the developing RSC Data Repository) my interest and focus on data quality and a long-standing interest in “accuracy in chemical structure representations” has never dwindled. At the EPA-NCCT we are very focused on working to produce high quality chemical structure databases, following on from the work of my colleague Ann Richard who initiated work on DSSTox over a decade ago.

It was therefore with great interest that I became aware of the confusion in regards to the chemical structure of BIA-10-2474, a drug that has attracted a lot of interest because of a clinical trial with negative outcomes. I am entering the story late compared to my many time collaborators and friends Sean Ekins, Chris Southan and ALex Clark, but more about their work later. The news to date is best summarized at Derek’s In the Pipeline blog and on David Kroll’s post on Forbes.

Based on my previous history and work with helping to curate chemical structures on Wikipedia (starting one Christmas in 2008) my experience would be that Wikipedia is a GOOD PLACE to source high quality structures, especially after the work invested in curating chemical data over the years. The first structure for BIA-10-2474 that was reported on Wikipedia is shown below.

ORIGINAL BIA structure

On January 16th Chris performed his usually thorough examination of structure integrity and links to public sources (he is a master in this domain!) but commented specifically ” The molecular identity of BIA-10-2474 can only be formally verified directly by BIAL or indirectly from regulatory documentation they may have submitted” as the chemical structure itself was inferred from the name.

Nevertheless my friends Sean Ekins and Alex Clark were already investigating what OPEN MODELS may be able to predict about the chemical: See here, here and here. You should be impressed regarding what is possible when running a molecular structure through several Bayesian models in Alex’s mobile app called PolyPharma!

By January 21st Chris was commenting that the structure had changed and highlighted the extract from what was exposed by Figaro and listing the chemical name: 3-(1-(cyclohexyl(methyl)carbamoyl)-1H-imidazol-4-yl)pyridine 1-oxide. Want to know what that name means as a structure? Take the name “3-(1-(cyclohexyl(methyl)carbamoyl)-1H-imidazol-4-yl)pyridine 1-oxide” and paste it into the free online service OPSIN. The results are shown below.

OPSIN BIA Structure

That structure has now found its way to Wikipedia (updated on the 21st January – check out the edits between the two forms of the article here).

FINAL BIA structure

Sean Ekins has maintained a running series of blog posts here. Using a stack of openly accessible algorithms and websites Sean has now produced a whole series of predictions for the “final molecule”. Chris Southan has also continued to expand his work and I direct you to his latest blogpost for more information. Nice stuff Chris.

It took days following the news starting to show up regarding the results of the drug trial before the chemical structure was actually identified (i.e. the structure was blinded). How much work, how much confusion was created by having the drug structures blind? We have to imagine that the authorities had faster access to the details!

It is understandable that companies keep their chemical structures hidden. Patents are intentionally obfuscating (with a compound going into a trial commonly hidden among hundreds if not tens of thousands of chemicals that could be enumerated from a Markush structure). Until then Chris Southan will continue to educate the world about how competitive intelligence investigations.


Our dire need to mandate data standards and expectations for scientific publishing

This is a presentation that I delivered at the ACS Division of Chemical Information meeting regarding “Reproducibility, Reporting, Sharing & Plagiarism” at ACS Denver on 23rd March 2015.

I took the opportunity to remove my hat that has me be the VP of Strategic Development at RSC, and a member of the cheminformatics group that built ChemSpider and works on other RSC projects related to it. Instead I presented on how a LACK OF MANDATES from publishers on me in terms of submission of data accompanying articles I am involved with writing is actually weakening my scientific record as data is not getting shared in the most useful forms possible to the benefit of the community. I think there would be benefits for publishers to start pushing me for MORE data, in fairly general standards, and allowing me (and others) to download the data in the form of molecules (and collections), spectral data, CSV files etc.


No Comments

Providing Access to a Million NMR Spectra via the web

This presentation was given at the ACS Denver meeting on March 22nd 2015 in a CHED Division symposium

Providing Access to a Million NMR Spectra via the web

Antony Williams, Alexey Pshenichnov, Peter Corbett, Daniel Lowe, Carlos Coba

Access to large scale NMR collections of spectral data can be used for a number of purposes in terms of teaching spectroscopy to students. The data can be used for teaching purposes in lectures, as training data sets for spectral interpretation and structure elucidation, and to underpin educational resources such as the Royal Society of Chemistry’s Learn Chemistry. These resources have been available for a number of years but have been limited to rather small collections of spectral data and specifically only about 3000 spectra. In order to expand the data collection and provide richer resources for the community we have been gathering data from various laboratories and, as part of a research project, we have used text-mining approaches to extract spectral data from articles and patents in the form of textual strings and utilized algorithms to convert the data into spectral representations. While these spectra are reconstructions of text representations of the original spectral data we are investigating their value in terms of utilizing for the purpose of structure identification. This presentation will report on the processes of extracting structure-spectral pairs from text, approaches to performing automated spectral verification and our intention to assemble a spectral collection of a million NMR spectra and make them available online.


No Comments

PITTCON poster: Dealing with the complex challenge of managing diverse analytical chemistry data online

This is a talk I presented at Pittcon on Wednesday March 13th, 2015

Dealing with the complex challenge of managing diverse analytical chemistry data online

The Royal Society of Chemistry provides open access to data associated with tens of millions of chemical compounds. The richness and complexity of the data has continued to expand dramatically and the original vision for providing an integrated hub for structure-centric data has been delivered across the world to hundreds of thousands of users. With an intention of expanding the reach to cover more diverse aspects of chemistry-related data including compounds, reactions and analytical data, to name just a few data-types, we are in the process delivering a Chemistry Data Repository. The data repository will manage the challenges of associated metadata, the various levels of required security (private, shared and public) and exposing the data as appropriate using semantic web technologies. Ultimately this platform will become the host for all chemicals, reactions and analytical data contained within RSC publications and specifically supplementary information. This presentation will report on the challenges of managing “Big Data” for chemists around the world and providing access to tools for structure dereplication, spectral database searching and the crowdsourcing of the worlds’ largest spectral database.


No Comments

PITTCON Poster: Using an online database of chemical compounds for the purpose of structure identification

This is a poster I presented at Pittcon on Wednesday March 9th, 2015

Using an online database of chemical compounds for the purpose of structure identification

Online databases can be used for the purposes of structure identification. The Royal Society of Chemistry provides access to an online database containing tens of millions of compounds and this has been shown to be a very effective platform for the development of tools for structure identification. Since in many cases an unknown to an investigator is known in the chemical literature or reference database, these “known unknowns” are commonly available now on aggregated internet resources. The identification of these types of compounds in commercial, environmental, forensic, and natural product samples can be identified by searching against these large aggregated databases querying by either elemental composition or monoisotopic mass. Searching by elemental composition is the preferred approach as it is often difficult to determine a unique elemental composition for compounds with molecular weights greater than 600 Da. In these cases, searching by the monoisotopic mass is advantageous. In either case, the search results can be refined by appropriate filtering to identify the compounds. We will report on integrated filtering and search approaches on our aggregated compound database for the purpose of structure identification and review our progress in using the platform for natural product dereplication purposes.


No Comments

A TERRIBLE implementation of Name Searching on ACS Journals

Yes, I am a Williams. And THAT is an incredibly common surname. But I am an Antony Williams, notice no H in the name, i.e. NOT Anthony. In the field of chemistry there are not many of us around…a couple I know of, but not many overall. Google Scholar does an extremely good job of automatically associating my newly published articles with my Citations profile here:

The last five articles automatically associated with my profile. I do NOT make any associations manually at this point.

The last five articles automatically associated with my profile. I do NOT make any associations manually at this point.

I am assuming that this is done by understanding the type of work I publish on, some of the co-author network maps that have been established as my profile has developed etc. I assume that there approach is very intelligent relative to some of the more commonplace searches that have been implemented….certainly the results are GOOD.

I noticed one disastrous example today when our article “ChemTrove: Enabling a Generic ELN to Support Chemistry Through the Use of Transferable Plug-ins and Online Data Sources” was published on the Journal of Chemical Information and Modeling here. Right there to the left of the abstract is an offer to look at other content by the authors.

Look for related content by the authors on JCIM

Look for related content by the authors on JCIM

I was interested to see what else ACS knew about my content so I clicked on my name…which performed this search:  and provided me with 96 articles by Andrew Williams (mostly), by Aaron Williams, by Anthony Williams (not me) and Allan Williams (to name a few). Eventually I managed to find 3 that were associated with me by searching the list for Antony Williams but none of those I published as Antony J. Williams were recovered.

Also, my colleague Valery Tkachenko is listed as an author with a misspelling as Valery Tkachenkov. What is simply inappropriate in my opinion is how the process involved taking the list of our submitted names..copied below directly from the submitted manuscript and changing them to their own interpretation of how we would want to see our names listed.

From this:

Aileen E. Day*†, Simon J. Coles, Colin L. Bird, Jeremy G. Frey, Richard J. Whitby, Valery E. Tkachenko§, Antony J. Williams§

To This:

Names changed from the original manuscript to those produced at submission

Names changed from the original manuscript to those produced at submission

Notice that for Aileen and Jeremy the middle initials were expanded, Colin had his middle initial changed from L. to I.,  Richard, Valery and I had our middle initials dropped and Valery had a v added to his surname. Why not simply copy and paste the names from the manuscript?

I will point out that this is a “Just Accepted” manuscript and likely the changes in names will be caught and edited, especially now I have just pointed them out. “Just accepted” does have some disclaimers:

The disclaimers regarding Just Accepted manuscripts

The disclaimers regarding Just Accepted manuscripts

While they can edit the names to match what we originally provided I don’t think it will fix the issue regarding finding all of my articles on ACS journals as when  navigated to one of my other articles here,, and did the search from my listed name it found exactly the same 96 hits.

Maybe a thought to use my ORCID profile to look for ACS journal articles associated with my name?

Unfortunately the data is already out in the wild as when I claimed the article on Kudos all of the name spelling issues had clearly spilled over via the DOI:

Names transferred via DOI to the Grow Kudos Platform

Names transferred via DOI to the Grow Kudos Platform

Ah…the things that surprise me….or not.

No Comments

A chemistry data repository to serve them all

A presentation that I am giving around UK universities in September/October 2014

A chemistry data repository to serve them all

Over the past five years the Royal Society of Chemistry has become world renowned for its public domain compound database that integrates chemical structures with online resources and available data. ChemSpider regularly serves over 50,000 users per day who are seeking chemistry related data. In parallel we have used ChemSpider and available software services to underpin a number of grant-based projects that we have been involved with: Open PHACTS – a semantic web project integrating chemistry and biology data, PharmaSea – seeking out new natural products from the ocean and the National Chemical Database Service for the United Kingdom. We are presently developing a new architecture that will offer broader scope in terms of the types of chemistry data that can be hosted. This presentation will provide an overview of our Cheminformatics activities at RSC, the development of a new architecture for a data repository that will underpin a global chemistry network, and the challenges ahead, as well as our activities in releasing software and data to the chemistry community.

No Comments

Open innovation and chemistry data management contributions from the Royal Society of Chemistry resulting from the Open PHACTS project at #ACSsanfran

This is my presentation on Thursday 14th August at the ACS Meeting in San Francisco

Open innovation and chemistry data management contributions from the Royal Society of Chemistry resulting from the Open PHACTS project

The Royal Society of Chemistry was pleased to contribute to the Open PHACTS project, a 3 year project funded by the Innovative Medicines Initiative fund from the European Union. For three years we developed our existing platforms, created new and innovative widgets and data platforms to handle chemistry data, extended existing chemistry ontologies and embraced the semantic web open standards. As a result RSC served as the centralized chemistry data hub for the project. With the conclusion of the Open PHACTS project we will report on our experiences resulting from our participation in the project and provide an overview of what tools, capabilities and data have been released into the community as a result of our participation and how this may influence future projects. This will include the Open PHACTS open chemistry data dump including the chemistry related data in chemistry and semantic web consumable formats as well as some of the resulting chemistry software released to the community. The Open PHACTS project resulted in significant contributions to the chemistry community as well as the supporting pharmaceutical companies and biomedical community.

No Comments

%d bloggers like this: