Author Archives: tony

About tony

Antony (Tony) J. Williams received his BSc in 1985 from the University of Liverpool (UK) and PhD in 1988 from the University of London (UK). His PhD research interests were in studying the effects of high pressure on molecular motions within lubricant related systems using Nuclear Magnetic Resonance. He moved to Ottawa, Canada to work for the National Research Council performing fundamental research on the electron paramagnetic resonance of radicals trapped in single crystals. Following his postdoctoral position he became the NMR Facility Manager for Ottawa University. Tony joined the Eastman Kodak Company in Rochester, New York as their NMR Technology Leader. He led the laboratory to develop quality control across multiple spectroscopy labs and helped establish walk-up laboratories providing NMR, LC-MS and other forms of spectroscopy to hundreds of chemists across multiple sites. This included the delivery of spectroscopic data to the desktop, automated processing and his initial interests in computer-assisted structure elucidation (CASE) systems. He also worked with a team to develop the worlds’ first web-based LIMS system, WIMS, capable of allowing chemical structure searching and spectral display. With his developing cheminformatic skills and passion for data management he left corporate America to join a small start-up company working out of Toronto, Canada. He joined ACD/Labs as their NMR Product Manager and various roles, including Chief Science Officer, during his 10 years with the company. His responsibilities included managing over 50 products at one time prior to developing a product management team, managing sales, marketing, technical support and technical services. ACD/Labs was one of Canada’s Fast 50 Tech Companies, and Forbes Fast 500 companies in 2001. His primary passions during his tenure with ACD/Labs was the continued adoption of web-based technologies and developing automated structure verification and elucidation platforms. While at ACD/Labs he suggested the possibility of developing a public resource for chemists attempting to integrate internet available chemical data. He finally pursued this vision with some close friends as a hobby project in the evenings and the result was the ChemSpider database ( Even while running out of a basement on hand built servers the website developed a large community following that eventually culminated in the acquisition of the website by the Royal Society of Chemistry (RSC) based in Cambridge, United Kingdom. Tony joined the organization, together with some of the other ChemSpider team, and became their Vice President of Strategic Development. At RSC he continued to develop cheminformatics tools, specifically ChemSpider, and was the technical lead for the chemistry aspects of the Open PHACTS project (, a project focused on the delivery of open data, open source and open systems to support the pharmaceutical sciences. He was also the technical lead for the UK National Chemical Database Service ( and the RSC lead for the PharmaSea project ( attempting to identify novel natural products from the ocean. He left RSC in 2015 to become a Computational Chemist in the National Center of Computational Toxicology at the Environmental Protection Agency where he is bringing his skills to bear working with a team on the delivery of a new software architecture for the management and delivery of data, algorithms and visualization tools. The “Chemistry Dashboard” was released on April 1st, no fooling, at, and provides access to over 700,000 chemicals, experimental and predicted properties and a developing link network to support the environmental sciences. Tony remains passionate about computer-assisted structure elucidation and verification approaches and continues to publish in this area. He is also passionate about teaching scientists to benefit from the developing array of social networking tools for scientists and is known as the ChemConnector on the networks. Over the years he has had adjunct roles at a number of institutions and presently enjoys working with scientists at both UNC Chapel Hill and NC State University. He is widely published with over 200 papers and book chapters and was the recipient of the Jim Gray Award for eScience in 2012. In 2016 he was awarded the North Carolina ACS Distinguished Speaker Award.

PRESENTATION ACS SPRING 2018: Overview of open resources to support automated structure verification and elucidation

Overview of open resources to support automated structure verification and elucidation

Cheminformatics methods form an essential basis for providing analytical scientists with access to data, algorithms and workflows. There are an increasing number of free online databases (compound databases, spectral libraries, data repositories) and a rich collection of software approaches that can be used to support automated structure verification and elucidation, specifically for Nuclear Magnetic Resonance (NMR) and Mass Spectrometry (MS). This presentation will provide an overview of freely available data, tools, databases and approaches available to support chemical structure verification and elucidation and highlight some of the known issues regarding data quality and suggest approaches for resolving some of the issues. The importance of structure and spectral standards for data exchange will be discussed, especially with regard to how spectral data can be made openly available to the community via online tools and through scientific publishing. This work does not necessarily reflect U.S. EPA policy.

Leave a comment

Posted by on March 26, 2018 in ACS Meetings


PRESENTATION ACS SPRING 2018: Sharing chemical structures with peer-reviewed publications. Are we there yet?

Sharing chemical structures with peer-reviewed publications. Are we there yet?

In the domain of chemistry one of the greatest benefits to publishing research is that data can be shared. Unfortunately, the vast majority of chemical structure data associated with scientific publications remain locked up in document form, primarily in PDF files or trapped on webpages. Despite the explosive growth of online chemical databases and the overall maturity of cheminformatics platforms, many barriers stifle the exchange of chemical structures via publications. These challenges include incomplete support by accepted standards (especially InChI) for advanced stereochemistry, organometallic compounds and generic “Markush” representations, the difference between human-readable and computer-readable forms of data, and challenges with the computer representation of chemical structures. To address these obstacles to chemical structure sharing, US EPA National Center for Computational Toxicology scientists are using a combination of cheminformatics applications and online repositories to distribute chemical structure data associated with their publications. This presentation will describe how EPA-NCCT chemical structure data that is amenable to indexing and distribution are shared and highlight the benefit of open data sharing for modeling, data integration, and increasing research impact. This abstract does not reflect U.S. EPA policy.


PRESENTATION ACS SPRING 2018: Using the US EPA’s CompTox Chemistry Dashboard for structure identification and non-targeted analyses

Using the US EPA’s CompTox Chemistry Dashboard for structure identification and non-targeted analyses

Antony J. Williams, Andrew D. McEachran, Seth Newton, Kristin Isaacs, Katherine Phillips, Nancy Baker, Christopher Grulke and Jon R. Sobus

High resolution mass spectrometry (HRMS) and non-targeted analysis (NTA) are advancing the identification of emerging contaminants in environmental matrices, improving the means by which exposure analyses can be conducted. However, confidence in structure identification of unknowns in NTA presents challenges to analytical chemists. Structure identification requires integration of complementary data types such as reference databases, fragmentation prediction tools, and retention time prediction models. The goal of this research is to optimize and implement structure identification functionality within the US EPA’s CompTox Chemistry Dashboard, an open chemistry resource and web application containing data for ~760,000 substances. Rank-ordering the number of sources associated with chemical records within the Dashboard (Data Source Ranking) improves the identification of unknowns by bringing the most likely candidate structures to the top of a search results list. Database searching has been further optimized with the generation of MS-Ready Structures. MS-Ready structures are de-salted, stripped of stereochemistry, and mixture separated to replicate the form of a chemical observed via HRMS. Functionality to conduct batch searching of molecular formulae and monoisotopic masses was designed and released to improve searching efforts. Finally, a scoring-based identification scheme was developed, optimized, and surfaced via the Dashboard using multiple data streams contained within the database underlying the Dashboard. The scoring-based identification scheme improved the identification of unknowns over previous efforts using data source ranking alone. Combining these steps within an open chemistry resource provides a freely available software tool for structure identification and NTA. This abstract does not necessarily represent the views or policies of the U.S. Environmental Protection Agency.

Leave a comment

Posted by on March 25, 2018 in ACS Meetings


PRESENTATION ACS SPRING 2018: Adding Complex Expert Knowledge into Chemical Database and Transforming Surfactants in Wastewater

Adding Complex Expert Knowledge into Chemical Databases: Transforming Surfactants in Wastewater

PRESENTED by Emma Schymanski

The increasing popularity of high mass accuracy non-target mass spectrometry methods has yielded extensive identification efforts based on chemical compound databases. Candidate structures are often retrieved with either exact mass or molecular formula from large resources such as PubChem, ChemSpider or the EPA CompTox Chemistry Dashboard. Additional data (e.g. fragmentation, physicochemical properties, reference and data source information) is then used to select potential candidates, depending on the experimental context. However, these strategies require the presence of substances of interest in these compound databases, which is often not the case as no database can be fully inclusive. A prominent example with clear data gaps are surfactants, used in many products in our daily lives, yet often absent as discrete structures in compound databases. Linear alkylbenzene sulfonates (LAS) are a common, high use and high priority surfactant class that have highly complex transformation behaviour in wastewater. Despite extensive reports in the environmental literature, few of the LAS and none of the related transformation products were reported in any compound databases during an investigation into Swiss wastewater effluents, despite these forming the most intense signals. The LAS surfactant class will be used to demonstrate how the coupling of environmental observations with high resolution mass spectrometry and detailed literature data (expert knowledge) on the transformation of these species can be used to progressively “fill the gaps” in compound databases. The LAS and their transformation products have been added to the CompTox Chemistry Dashboard ( using a combination of “representative structures” and “related structures” starting from the structural information contained in the literature. By adding this information into a centralized open resource, future environmental investigations can now profit from the expert knowledge previously scattered throughout the literature. Note: This abstract does not reflect US EPA policy.



PRESENTATION ACS Spring 2018: Curating “Suspect Lists” for International Non-target Screening Efforts

Curating “Suspect Lists” for International Non-target Screening Efforts

Emma L. Schymanski, Reza Aalizadeh, Nikolaos S. Thomaidis, Juliane Hollender, Jaroslav Slobodnik, Antony J. Williams5

1Luxembourg Centre for Systems Biomedicine (LCSB), University of Luxembourg, Campus Belval, Luxembourg.
2National and Kapodistrian University of Athens, Department of Chemistry, Panepistimiopolis Zografou, 157 71 Athens, Greece.
3Eawag: Swiss Federal Institute of Aquatic Science and Technology, Dübendorf, Switzerland.
4Environmental Institute, Okružná 784/42, 972 41 Koš, Slovak Republic.
5National Center for Computational Toxicology, US EPA, Research Triangle Park, Durham, NC, USA.

PRESENTED by Emma Schymanski

The NORMAN Network ( is a unique network of reference laboratories, research centres and related organisations for monitoring of emerging environmental substances, through European and across the world. Key activities of the network include prioritization of emerging substances and non-target screening. A recent collaborative trial revealed that suspect screening (using specific lists of chemicals to find “known unknowns”) was a very common and efficient way to expedite non-target screening (Schymanski et al. 2015, DOI: 10.1007/s00216-015-8681-7). As a result, the NORMAN Suspect Exchange was founded ( and members were encouraged to submit their suspect lists. To date 20 lists of highly varying substance numbers (between 52 and 30,418), quality and information content have been uploaded, including valuable information previously unavailable to the public. All preparation and curation was done within the network using open access cheminformatics toolkits. Additionally, members expressed a desire for one merged list (“SusDat”). However, as a small network with very limited resources (member contributions only), the burden of curating and merging these lists into a high quality, curated dataset went beyond the capacity and expertise of the network. In 2017 the NORMAN Suspect Exchange and US EPA CompTox Chemistry Dashboard ( pooled resources in curating and uploading these lists to the Dashboard ( This talk will cover the curation and annotation of the lists with unique identifiers (known as DTXSIDs), plus the advantages and drawbacks of these for NORMAN (e.g. creating a registration/resource inter-dependence). It will cover the use of “MS-ready structure forms” with chemical substances provided in the form observed by the mass spectrometer (e.g. desalted, as separate components of mixtures) and how these efforts will support other NORMAN activities. Finally, limitations of existing cheminformatics approaches and future ideas for extending this work will be covered. Note: This abstract does not reflect US EPA policy.



PRESENTATION ACS Spring 2018: Curating and Sharing Structures and Spectra for the Environmental Community

Curating and sharing structures and spectra for the environmental community

Presented by Emma Schymanski

The increasing popularity of high mass accuracy non-target mass spectrometry methods has yielded extensive identification efforts based on spectral and chemical compound databases in the environmental community and beyond. Increasingly, new methods are relying on open data resources. Candidate structures are often retrieved with either exact mass or molecular formula from large resources such as PubChem, ChemSpider or the EPA CompTox Chemistry Dashboard. Smaller, selective lists of chemicals (also called “suspect lists”) can be used to perform more efficient annotation. Mass spectral libraries can then be used to increase the confidence in tentative identification. Additional metadata (e.g. exposure and hazard information, reference and data source information) can be extremely useful to prioritize substances of high environmental interest. Exchanging information and “sharing structural linkages” between these resources requires extensive curation to ensure that the correct information is shared correctly, yet many valuable datasets arise from scientists and regulators with little official cheminformatics training. This talk will cover curation efforts undertaken to map spectral libraries (e.g. MassBank.EU, mzCloud) and suspect lists from the NORMAN Suspect Exchange ( to unique chemical identifiers associated with the US EPA CompTox Chemistry Dashboard. The curation workflow takes advantage of years of experience, as well as contact with the original data providers, to enable open access to valuable, curated datasets to support environmental scientists and the broader research community (e.g.  Note: This abstract does not reflect US EPA policy.



PRESENTATION ACS Spring 2018: Automated Structure Annotation and Curation for MassBank: Potential and Pitfalls

Automated Structure Annotation and Curation for MassBank: Potential and Pitfalls 

Presented by Emma Schymanski

The European MassBank server ( was founded in 2012 by the NORMAN Network ( to provide open access to mass spectra of substances of environmental interest contributed by NORMAN members. The automated workflow RMassBank was developed as a part of this effort ( This workflow included automated processing of the mass spectral data, as well as automated annotation using the SMILES, Names and CAS numbers provided by the user. Cheminformatics toolkits (e.g. Open Babel, rcdk) and web services (e.g. the CACTUS Chemical Identifier Resolver, Chemical Translation Services (CTS), ChemSpider, PubChem) were then used to convert and/or retrieve the remaining information for completion of the MassBank records (additional names, InChIs, InChIKeys, several database identifiers, mol files), to avoid excessive burden on the users and reduce the chance of errors. To date, approximately 16,000 MS/MS spectra (61 % of all open data as of Nov. 2016) corresponding with 1,269 (18 %) unique chemicals have been uploaded to MassBank.EU via RMassBank. Curating the MassBank.EU records, as part of efforts to provide EPA CompTox Dashboard identifiers (DTXSIDs) for each record, revealed several conflicts in the chemical metadata arising from varying sources. In addition, the representation of “ambiguous substances”, for example complex surfactant mixtures of various chain lengths and branching or incompletely-defined structures of transformaton products, is an ongoing challenge. In this work, we report on proof-of-concept solutions for “ambiguous structure” representation, currently unavailable in the majority of cheminformatics tools. This presentation reflects on the effectiveness of the original RMassBank concept but also identifies pitfalls that automated structure annotation with open resources offers to streamline spectra contributions from external laboratories and users with widely ranging cheminformatics experience. Note: this work does not necessarily reflect U.S. EPA policy.



Bias and Data-Driven and Probability-Based Decision Making in Trivia Crack

I’ve been playing Trivia Crack for a few years now and, as of today, I am a level 326. I am an iPhone/iPad user so grabbed it from the AppStore. In playing I have filled my brain with some useless information, learned a lot of history, geography and sports, and taken advantage of my Science background to win more than a few challenges. In terms of Entertainment its clear I stopped learning about new music a few years ago. I’m stuck in my musical history with little interest in the new music scene really. I have enjoyed playing Trivia Crack against my girlfriend for over three years and we continue to have regular periods where we actively engage with the game.

Trivia Crack has a lot of downloads with Forbes reporting on over 300 million downloads and I can hear the theme tune while sitting in restaurants as people get their dose of the day.

There is a lot of advice online for people to try and beat the game. Much of this advice is tactics based. Hackers have even been taking their pokes at it. Looking at some of the analyses that have been made I am at least in the top 1% for Science, and with a category performance of 86 am better than 99.8% of the people playing Trivia Crack. My weakest category is Sports…not a surprise as I prefer to do sports rather than watch it or read about it. I am generally flat across the board for Entertainment, Art and Literature and Geography.

As a scientist I am data driven. However, as Louis Pasteur once commented, “In the fields of observation chance favors only the prepared mind” ( So, while playing over 3500 games I started noticing patterns that helped me play the game. There were numerous patterns I noticed over the years but I will summarize them here and then share the data.

  • If I did not know the answer to a question and HAD to guess, my guesses generally worked out best when I always guessed that the first (top) answer was correct. If I guessed the fourth (bottom) answer I was generally, but not always wrong.
  • With the observation, that I could reproduce over and over, I decided to gather the data and analyze it statistically. The data is shown below and represents the number of times that the answer is in each position 1 to 4, top to bottom. Each column corresponds to the frequency of correct answers for a particular grouping. position of the correct answer out of the four possible answers. I gathered data over a number of days in five different groups. I chose to distribute the groupings into different sizes also, stopping the gathering of data when there were 25,50 or 100 answers in position 1.

The data speaks for itself (and is available for download on FigShare here). In all five groupings the majority of answers are in position 1, commonly the chances of the answers being in position 1 versus position 4 is about double. This means that if you are lost in terms of answering a question, and have no idea which answer to choose, you should select position 1. The results, over time, will be that you will be right more often than not. If you know that positions 2 and 3 are not the correct answers and are trying to choose between positions 1 and 4 then choose position 1. You will be correct 2x more often than if you chose position 4.

While I believe the data speaks for itself a statistical analysis is certainly in order. I’ve done a lot of stats over the years but I am fortunate enough to know people who are way more proficient than I am. So, I approached my friend John Wambaugh and asked for him to apply his most preferred approach to analyze data that I would provide. He wrote a little bit of code in R and produced the analysis below which he concluded as “So, if you don’t know the answer, always guess A.” I agree – it’s a useful strategy and worth trying out for your own Trivia Crack game. That said I would expect that they would have a random distribution of the correct answers in the game and maybe something they should address?

“If we assume that each time you answer a question one of the four answers must be right, then there are four probabilities describing the chance that each answer is right. These four probabilities must add up to 100 percent. The number of correct answers observed should follow what is called a Dirichlet distribution. The simplest thing would be for all the answers to be equally likely (25 percent) but we have Tony’s data from 6 groupings in which he got “A” 275 times, “B” 193 times, “C” 166 times, and “D” 134 times.

The probability density for Total given observed likelihood is 23200

While the probability density for Total assuming equal odds is 4.61e-08

But it is unlikely that even 768 questions gives us the exact odds. Instead, lets construct a hypothesis.

We observe that answer “A” was correct 35.8 percent of the time instead of 25 percent (even odds for all answers).

We can hypothesize that 35.8 percent is roughly the “right” number and that the other three answers are equally likely.

The probability density for Total assuming only “A” is more likely 101.

Our hypothesis that “A” is right 35.8 percent of the time is 2.19e+09 times more likely than “A” being right only 25 percent of the time.

Among the individual games, the hypothesis is not necessarily always more likely:

For Game.1 the hypothesis that “A” is right 35.8 percent of the time is 129 times more likely.

For Game.2 the hypothesis that “A” is right 35.8 percent of the time is 1910 times more likely.

For Game.3 the hypothesis that “A” is right 35.8 percent of the time is 5.25 times more likely.

For Game.4 the hypothesis that “A” is right 35.8 percent of the time is 0.754 times more likely.

This value being less than one indicates that even odds are more likely for Game.4 .

For Game.5 the hypothesis that “A” is right 35.8 percent of the time is 32 times more likely.

For Game.6 the hypothesis that “A” is right 35.8 percent of the time is 99.2 times more likely.

So, we might want to consider a range of possible probabilities for “A”.

Unsurprisingly, the density is maximized for probability of “A” being 36 percent.

However, we are 95 percent confident that the true value lies somewhere between 33 and 39 percent.

So, if you don’t know the answer, always guess “A”.


Leave a comment

Posted by on February 19, 2018 in Uncategorized



GUEST POST by Emma Schymanski: Suspect Screening with MetFrag and the CompTox Chemistry Dashboard

Identifying “known unknowns” via suspect and non-target screening of environmental samples with the in silico fragmenter MetFrag ( typically relies on the large compound databases ChemSpider and PubChem (see e.g. Ruttkies et al 2016). The size of these databases (over 50 and 90 million structures, respectively), yield many false positive hits of structures that were never produced in sufficient amounts to be realistically found in the environment (e.g. McEachran et al 2016). One motivation behind the US EPA’s CompTox Chemistry Dashboard is to provide access to compounds of environmental relevance – currently approx. 760,000 chemicals. While the web services are not yet available to incorporate the Dashboard in MetFrag as a database like ChemSpider and PubChem, there are a number of features in MetFragBeta that enables users to use the CompTox Chemistry Dashboard to perform “known unknown” identification with MetFrag. This post highlights the Suspect Screening Functionality.

First we have our (charged) mass. Take m/z = 256.0153. This was measured in positive mode and we assume (correctly) that it’s [M+H]+. Make sure you set this correctly in MetFrag.


Then retrieve your candidates, e.g. using ChemSpider or PubChem and a 5 ppm error margin:

Take the peak list from MassBank here: and copy into the Fragmentation settings:

You could now process the candidates … but we have not done anything with the Dashboard! This is hidden in the middle in the “Candidate Filter & Score Settings” tab:

You can use the Candidate Filter to process ONLY candidates that are in the CompTox Chemistry Dashboard, excluding all other candidates, by clicking on “Suspect Inclusion Lists” and selecting the “DSSTox” box (see screenshot), which retains (currently) 11 of the 156 ChemSpider candidates:

Once finished the processing, the plot in the “Statistics” tab should look something like this – depending on what additional scores you selected:

It is also possible to use one (or more!) suspect lists to SCORE the different candidates without excluding any matches from ChemSpider or PubChem, by selecting the same box under the “MetFrag Scoring Terms” part instead (see screenshot). Additional lists like the Swiss Pharma list shown below can be downloaded from the NORMAN Suspect Exchange ( and also viewed under the lists tab in the CompTox Chemistry Dashboard ( MetFrag only needs a text file containing InChIKeys of the substances for the upload – which can be obtained from the Dashboard or Suspect Exchange downloads.

Using the Suspect Lists as a “Scoring term”, along with some other criteria and restrictions, will give you a results plot looking more like this:

Curious to find out more? MetFrag comes with a built-in example and you can try this exact example yourself by visiting and using the peak list copied from the bottom of the spectrum available at

There are many more features to discover: try the website, read the paper (Ruttkies et al 2016) and if you have any questions, please comment below!

Author: Emma Schymanski, 21/11/2017

Leave a comment

Posted by on December 8, 2017 in MS Structure Identification


Tags: ,

The National Chemical Database Service Allowing Depositions

The UK National Chemical Database Service (available here) has been online a few years now, since 2012. When I worked at RSC I was intimately involved in writing the technical response to the EPSRC call for the service and, in this blog, I outlined a lot of intentions for the project. A key part of the project from my point of view was to deliver a repository to store structures, spectra, reactions, CIF files etc as I outlined in the blog post.

“Our intention is to allow the repository to host data including chemicals, syntheses, property data, analytical data and various other types of chemistry related data. The details of this will be scoped out with the user-community, prioritized and delivered to the best of our abilities during the lifetime of the tender. With storage of structured data comes the ability to generate models, to deliver reference data as the community contributes to its validation, and to integrate and disseminate the data, as allowed by both licensing and technology, to a growing internet of the chemical sciences.”

In March 2014 at the ACS Meeting in Dallas I presented on our progress towards providing the repository (see this Slidedeck). ChemSpider has been online for over ten years and we were accepting structure depositions in the first 3 months and spectra a few weeks later (see blogpost). The ability to deposit structures as molfiles or SDF files has been available on ChemSpider for a long time and we delivered the ability to validate and standardize using the CVSP platform ( that we submitted for publication three years ago (October 28th, 2014) and is published here: With structure and spectra deposition in place for over a decade, a validation and standardization platform made public three years ago, and a lot of experience with depositing data onto ChemSpider, all building blocks have been in place for the repository.

Today I received an email into my inbox announcing “Compound and Spectra Deposition into ChemSpider“. I read it with interest as I guess it meant it was “going mainstream” in some way as it’s been around for a decade as capability. Refactoring for any mature platform should be a constant so my expectation was that this would show a more seamless process of depositing various types of data, a more beautiful interface, new whizz-bang visualization widgets building on a decade of legacy development and taking the best of what we built as data registration, structure validation and standardization (and all of its lessons!) and rebuilds of some of the spectral display components that we had. It’s not quite what I found when I tested it.

Here’s my review.

My expectations would be to go to and deposit data to ChemSpider. The website is simply a blue button with “Log in with your ORCID”. There is language recognizing that the OpenPHACTS project funded the validation and standardization platform work which is definitely appropriate but some MORE GUIDANCE as to what the site is would be good!

“Validation and standardisation of the chemical structures was developed as part of the Open PHACTS project and received support from the Innovative Medicines Initiative Joint Undertaking under grant agreement no. 115191, resources of which are composed of financial contribution from the European Union’s Seventh Framework Programme (FP7/2007-2013) and EFPIA companies’ in-kind contribution.”

This means that it should be possible to deposit a molfile, have it checked (validated) and standardized then deposited into ChemSpider, having passed through CVSP. So what happened?

I downloaded the structure of Chlorothalonil from our dashboard and loaded it. The result is shown below. The structure was standardized and correctly recognized as a V3000 molfile. The original structure was not visible, there were no errors or warnings and the structure DID standardize.

Deposition into ChemSpider failed with an Oops

Next I tried a structure from ChemSpider, because if the structures are going INTO ChemSpider then I should be able to load one that comes FROM ChemSpider. I wanted to get something fun so grabbed one of the many Taxol-related structures. There are 61 Taxol-related structures in total. I downloaded the version with multiple C13-labels. It looked like this:

When I uploaded this, a V2000 molfile, the result is as shown below.

The original isotope labels were removed, the layout was recognized as congested and partially defined stereo recognized. But it wouldn’t deposit. I tried many others and they would not deposit and was going to give up but tried Benzene, V2000, downloaded from ChemSpider. And….YAY….it went in. The result is below.

A unique DOI is issued to the record, associated with my name. It is NOT deposited into ChemSpider as far as I can tell because the structure is already in ChemSpider. There is also no link from ChemSPider back to my deposition, that I can find. My next try was to find a chemical NOT in ChemSpider and to deposit that. That failed. I tried Benzene again and it worked a second time. I judged that maybe a simple alkyl chain would work for deposition. The result is below.

The warning “Contains completely undefined stereo: mixtures” does not make sense at all for this chemical. PLUS it wouldn’t deposit.

I then tried to register a sugar as a projection with the result shown below. I consider this one to have some real errors and do not AT ALL like the standardized version.

I tried a simple inorganic. I think KCl should be recognized as an ionic compound as K+Cl-, at least SOME warning!?

The testing I did took about an hour overall. I identified a LOT of issues. I think this release, while it may be a beta release for feedback, is way premature and needs a lot more testing. I am hopeful that more people will fully test the platform as the ABILITY to deposit data, get a DOI, and associate it with your ORCID account, but it’s not obvious that anything is linked back to ORCID and it is nothing more than being used for login.

I did NOT test spectral deposition but am concerned that the request seems to be for original data. In binary vendor file format? Uh-oh. That’s not a good idea!

I hope this blog motivates the community to test, give feedback and push the deposition system to deal with complex chemistries so at least the boundary conditions of performance for Deposit.ChemSpider.Com, which appears to be more of writing a chemical to some other repository as there is no real connection to ChemSpider I can find (?), can be defined, the system can be improved and a community can be built around the functionality.

Building public domain chemistry databases is hard work. User feedback and guidance is essential. Please give your feedback and test the system.


Posted by on October 20, 2017 in Uncategorized