The National Chemical Database Service Allowing Depositions   4 comments

The UK National Chemical Database Service (available here) has been online a few years now, since 2012. When I worked at RSC I was intimately involved in writing the technical response to the EPSRC call for the service and, in this blog, I outlined a lot of intentions for the project. A key part of the project from my point of view was to deliver a repository to store structures, spectra, reactions, CIF files etc as I outlined in the blog post.

“Our intention is to allow the repository to host data including chemicals, syntheses, property data, analytical data and various other types of chemistry related data. The details of this will be scoped out with the user-community, prioritized and delivered to the best of our abilities during the lifetime of the tender. With storage of structured data comes the ability to generate models, to deliver reference data as the community contributes to its validation, and to integrate and disseminate the data, as allowed by both licensing and technology, to a growing internet of the chemical sciences.”

In March 2014 at the ACS Meeting in Dallas I presented on our progress towards providing the repository (see this Slidedeck). ChemSpider has been online for over ten years and we were accepting structure depositions in the first 3 months and spectra a few weeks later (see blogpost). The ability to deposit structures as molfiles or SDF files has been available on ChemSpider for a long time and we delivered the ability to validate and standardize using the CVSP platform (http://cvsp.chemspider.com/) that we submitted for publication three years ago (October 28th, 2014) and is published here: https://jcheminf.springeropen.com/articles/10.1186/s13321-015-0072-8. With structure and spectra deposition in place for over a decade, a validation and standardization platform made public three years ago, and a lot of experience with depositing data onto ChemSpider, all building blocks have been in place for the repository.

Today I received an email into my inbox announcing “Compound and Spectra Deposition into ChemSpider“. I read it with interest as I guess it meant it was “going mainstream” in some way as it’s been around for a decade as capability. Refactoring for any mature platform should be a constant so my expectation was that this would show a more seamless process of depositing various types of data, a more beautiful interface, new whizz-bang visualization widgets building on a decade of legacy development and taking the best of what we built as data registration, structure validation and standardization (and all of its lessons!) and rebuilds of some of the spectral display components that we had. It’s not quite what I found when I tested it.

Here’s my review.

My expectations would be to go to http://deposit.chemspider.com and deposit data to ChemSpider. The website is simply a blue button with “Log in with your ORCID”. There is language recognizing that the OpenPHACTS project funded the validation and standardization platform work which is definitely appropriate but some MORE GUIDANCE as to what the site is would be good!

“Validation and standardisation of the chemical structures was developed as part of the Open PHACTS project and received support from the Innovative Medicines Initiative Joint Undertaking under grant agreement no. 115191, resources of which are composed of financial contribution from the European Union’s Seventh Framework Programme (FP7/2007-2013) and EFPIA companies’ in-kind contribution.”

This means that it should be possible to deposit a molfile, have it checked (validated) and standardized then deposited into ChemSpider, having passed through CVSP. So what happened?

I downloaded the structure of Chlorothalonil from our dashboard and loaded it. The result is shown below. The structure was standardized and correctly recognized as a V3000 molfile. The original structure was not visible, there were no errors or warnings and the structure DID standardize.

Deposition into ChemSpider failed with an Oops

Next I tried a structure from ChemSpider, because if the structures are going INTO ChemSpider then I should be able to load one that comes FROM ChemSpider. I wanted to get something fun so grabbed one of the many Taxol-related structures. There are 61 Taxol-related structures in total. I downloaded the version with multiple C13-labels. It looked like this:

When I uploaded this, a V2000 molfile, the result is as shown below.

The original isotope labels were removed, the layout was recognized as congested and partially defined stereo recognized. But it wouldn’t deposit. I tried many others and they would not deposit and was going to give up but tried Benzene, V2000, downloaded from ChemSpider. And….YAY….it went in. The result is below.

A unique DOI is issued to the record, associated with my name. It is NOT deposited into ChemSpider as far as I can tell because the structure is already in ChemSpider. There is also no link from ChemSPider back to my deposition, that I can find. My next try was to find a chemical NOT in ChemSpider and to deposit that. That failed. I tried Benzene again and it worked a second time. I judged that maybe a simple alkyl chain would work for deposition. The result is below.

The warning “Contains completely undefined stereo: mixtures” does not make sense at all for this chemical. PLUS it wouldn’t deposit.

I then tried to register a sugar as a projection with the result shown below. I consider this one to have some real errors and do not AT ALL like the standardized version.

I tried a simple inorganic. I think KCl should be recognized as an ionic compound as K+Cl-, at least SOME warning!?

The testing I did took about an hour overall. I identified a LOT of issues. I think this release, while it may be a beta release for feedback, is way premature and needs a lot more testing. I am hopeful that more people will fully test the platform as the ABILITY to deposit data, get a DOI, and associate it with your ORCID account, but it’s not obvious that anything is linked back to ORCID and it is nothing more than being used for login.

I did NOT test spectral deposition but am concerned that the request seems to be for original data. In binary vendor file format? Uh-oh. That’s not a good idea!

I hope this blog motivates the community to test, give feedback and push the deposition system to deal with complex chemistries so at least the boundary conditions of performance for Deposit.ChemSpider.Com, which appears to be more of writing a chemical to some other repository as there is no real connection to ChemSpider I can find (?), can be defined, the system can be improved and a community can be built around the functionality.

Building public domain chemistry databases is hard work. User feedback and guidance is essential. Please give your feedback and test the system.

Posted October 20, 2017 by tony in Uncategorized

4 responses to The National Chemical Database Service Allowing Depositions

Subscribe to comments with RSS.

  1. I received an email from my husband, sitting in an Open Science workshop, pointing me to this service. He knows I have been a huge ChemSpider fan and user over many years now – and wondered if I knew about it.
    I found the entry screen off-putting (just the login, no clarification what the service offers, what format etc) and could only see a contact form, no details about who was involved (I was curious). I didn’t have my ORCID details on hand and waited until a few days later, on a laptop with the ID integrated, to try deposition. Only one mol file at a time?!?! As a heavy user, I would at least like SDF or SMILES or InChI or some choice. I rarely have just mol files lying around, so also tried depositing a structure Tony and I helped get into the Dashboard, because I knew it wasn’t in ChemSpider. I received the same “Oops”, tried a couple of times, then used the contact button to report the issue. I received a nice reply saying that the issue is their side and they are working on it and will let me know when they know more. A good response.
    One mol at a time will not serve my deposition needs. There are so many other options that could make this more efficient, I would wish for some ability to add more at once.

    PS: I first tried to comment this ~10 hrs ago and failed twice during identification authorisation (two different methods). Not only structure deposition has its hiccups!

  2. With regards to this comment:
    I did NOT test spectral deposition but am concerned that the request seems to be for original data. In binary vendor file format? Uh-oh. That’s not a good idea!

    If the choice is to offer one or the other as a first step, probably asking for the raw (original data) is the best choice. From our experience as a workflow using mzML, it is often best for us to have the raw data to validate and do the conversion ourselves – with the converter we know and trust. That way one avoids potential losses if the user chose a different converter that lost information. Plus you still have the raw data if converters get better in the future. However, that being said, there are still some cases where specific vendors offer better conversion options or data quality within their vendor software. Thus, the ability to upload either raw or open (or both) would be beneficial. Requiring both by default is a burden and will put some people off.

  3. Tony (and Emma), thanks for the comprehensive feedback. As you alluded to, it is a pilot system with only basic features. Community feedback will help us to prioritise which features to add in the future.

    We’re working on the deposition problems.

    Mark Archibald (Royal Society of Chemistry)
  4. Hi Mark, it would be great if you would also allow a user to comment with the upload, in addition to the DOI or a spectrum as a file. I would have liked to have been able to add a URL pointing to more resources for the compound I wished to upload, but there was no place for this. Sometimes we as a user also need to communicate something with the deposition. Furthermore, please add the ability to select multiple files at once for the spectra. MS/MS is best communicated with multiple collision energies and thus multiple files…

    [WORDPRESS HASHCASH] The poster sent us ‘0 which is not a hashcash value.

Leave a Reply

%d bloggers like this: