The UK National Chemical Database Service (available here) has been online a few years now, since 2012. When I worked at RSC I was intimately involved in writing the technical response to the EPSRC call for the service and, in this blog, I outlined a lot of intentions for the project. A key part of the project from my point of view was to deliver a repository to store structures, spectra, reactions, CIF files etc as I outlined in the blog post.
“Our intention is to allow the repository to host data including chemicals, syntheses, property data, analytical data and various other types of chemistry related data. The details of this will be scoped out with the user-community, prioritized and delivered to the best of our abilities during the lifetime of the tender. With storage of structured data comes the ability to generate models, to deliver reference data as the community contributes to its validation, and to integrate and disseminate the data, as allowed by both licensing and technology, to a growing internet of the chemical sciences.”
In March 2014 at the ACS Meeting in Dallas I presented on our progress towards providing the repository (see this Slidedeck). ChemSpider has been online for over ten years and we were accepting structure depositions in the first 3 months and spectra a few weeks later (see blogpost). The ability to deposit structures as molfiles or SDF files has been available on ChemSpider for a long time and we delivered the ability to validate and standardize using the CVSP platform (http://cvsp.chemspider.com/) that we submitted for publication three years ago (October 28th, 2014) and is published here: https://jcheminf.springeropen.com/articles/10.1186/s13321-015-0072-8. With structure and spectra deposition in place for over a decade, a validation and standardization platform made public three years ago, and a lot of experience with depositing data onto ChemSpider, all building blocks have been in place for the repository.
Today I received an email into my inbox announcing “Compound and Spectra Deposition into ChemSpider“. I read it with interest as I guess it meant it was “going mainstream” in some way as it’s been around for a decade as capability. Refactoring for any mature platform should be a constant so my expectation was that this would show a more seamless process of depositing various types of data, a more beautiful interface, new whizz-bang visualization widgets building on a decade of legacy development and taking the best of what we built as data registration, structure validation and standardization (and all of its lessons!) and rebuilds of some of the spectral display components that we had. It’s not quite what I found when I tested it.
Here’s my review.
My expectations would be to go to http://deposit.chemspider.com and deposit data to ChemSpider. The website is simply a blue button with “Log in with your ORCID”. There is language recognizing that the OpenPHACTS project funded the validation and standardization platform work which is definitely appropriate but some MORE GUIDANCE as to what the site is would be good!
“Validation and standardisation of the chemical structures was developed as part of the Open PHACTS project and received support from the Innovative Medicines Initiative Joint Undertaking under grant agreement no. 115191, resources of which are composed of financial contribution from the European Union’s Seventh Framework Programme (FP7/2007-2013) and EFPIA companies’ in-kind contribution.”
This means that it should be possible to deposit a molfile, have it checked (validated) and standardized then deposited into ChemSpider, having passed through CVSP. So what happened?
I downloaded the structure of Chlorothalonil from our dashboard and loaded it. The result is shown below. The structure was standardized and correctly recognized as a V3000 molfile. The original structure was not visible, there were no errors or warnings and the structure DID standardize.
Deposition into ChemSpider failed with an Oops
Next I tried a structure from ChemSpider, because if the structures are going INTO ChemSpider then I should be able to load one that comes FROM ChemSpider. I wanted to get something fun so grabbed one of the many Taxol-related structures. There are 61 Taxol-related structures in total. I downloaded the version with multiple C13-labels. It looked like this:
When I uploaded this, a V2000 molfile, the result is as shown below.
The original isotope labels were removed, the layout was recognized as congested and partially defined stereo recognized. But it wouldn’t deposit. I tried many others and they would not deposit and was going to give up but tried Benzene, V2000, downloaded from ChemSpider. And….YAY….it went in. The result is below.
A unique DOI is issued to the record, associated with my name. It is NOT deposited into ChemSpider as far as I can tell because the structure is already in ChemSpider. There is also no link from ChemSPider back to my deposition, that I can find. My next try was to find a chemical NOT in ChemSpider and to deposit that. That failed. I tried Benzene again and it worked a second time. I judged that maybe a simple alkyl chain would work for deposition. The result is below.
The warning “Contains completely undefined stereo: mixtures” does not make sense at all for this chemical. PLUS it wouldn’t deposit.
I then tried to register a sugar as a projection with the result shown below. I consider this one to have some real errors and do not AT ALL like the standardized version.
I tried a simple inorganic. I think KCl should be recognized as an ionic compound as K+Cl-, at least SOME warning!?
The testing I did took about an hour overall. I identified a LOT of issues. I think this release, while it may be a beta release for feedback, is way premature and needs a lot more testing. I am hopeful that more people will fully test the platform as the ABILITY to deposit data, get a DOI, and associate it with your ORCID account, but it’s not obvious that anything is linked back to ORCID and it is nothing more than being used for login.
I did NOT test spectral deposition but am concerned that the request seems to be for original data. In binary vendor file format? Uh-oh. That’s not a good idea!
I hope this blog motivates the community to test, give feedback and push the deposition system to deal with complex chemistries so at least the boundary conditions of performance for Deposit.ChemSpider.Com, which appears to be more of writing a chemical to some other repository as there is no real connection to ChemSpider I can find (?), can be defined, the system can be improved and a community can be built around the functionality.
Building public domain chemistry databases is hard work. User feedback and guidance is essential. Please give your feedback and test the system.