Tag Archives: Peter Murray-Rust

An Invitation to Collaborate on Open Notebook Science for an NMR Study

Peter Murray-Rust and Henry Rzepa have started on an Open Notebook Science project around calculating NMR

In Peter’s own words:

“We are starting an experiment on Open Notebook Science. <..> ONS seems to be the generally agreed term for scientific endeavour where the experiments are rapidly posted in public view, possibly before being exhaustively checked. It takes bravery as it isn’t fun if you goof publicly.“
“The recent controversy over hexacyclinol – where a published structure seems to be “wrong” – has sparked one good development – the realisation that high-quality QM calculations can predict experimental data well enough to show whether the published structure is “correct”.
We’re now starting to do this for NMR spectra. Henry Rzepa has taken Scott Rychnovksy’s methods for calculating 13C spectra and refined the protocol. Christoph Steinbeck has helped us get 20, 000 spectra from NMRShiftDB and Nick Day (of crystalEye fame) has amended the protocols so we can run hundreds of jobs per day..”

I posted comments to Peter’s post commenting “My intuition is that the HOSE-code approach, neural network approach and LSR approach will outperform the GIAO approach. Certainly these approaches would be much faster I believe. It would be good to compare the outcome of your studies with these other prediction algorithms and if the data is open then it will make for a good study.”

Peter comments “I am a supporter of the HOSE code and NN approach, but I have also been impressed with the GIAO method. The time taken is relatively unimportant. A 20-atom molecule takes a day or so, smaller ones are faster. We can run 100 jobs a day – so 30,000 a month. That’s larger than NMRShiftDB.”

He also comments “I have extended George Whitesides’ ideas of writing papers that in doing research one should write the final paper first (and then of course modify it as you go along). So here’s what we will have accomplished (please correct mistakes)”.

The abstract is below. Rather than correct mistakes I have added a paragraph (NON-bolded). I believe this project offers the ability to help build a bridge between the Open Data community, the academic community and the commercial software community for the benefit of science. There has never been a study of the magnitude being discussed here comparing quantum-mechanical NMR prediction methods with the methods represented by commercial software products. I look forward to it!

PMR> “We adapted Rychnovksy’s method of calculating 13C NMR shifts by adding (XXX) basis set and functionals (Henry has done this). We extracted 1234 spectra with predicted 3D geometries for rigid molecules in NMRShiftDB (no acyclic-acyclic bond nodes for heavy atoms). Molecules had < = 21 heavy atoms (<= Cl). These were optimised using Gaussian XXX and the isotropic magnetic tensors calculated using correction for the known solvent. The shift was subtracted from the calculated TMS shift (in the same solvent) and the predicted shift compared with the observed.
Initially the RMS deviation was xxx. This was due to a small number of structures where there appeared to be gross errors of assignment. These were exposed to the community who agreed that these should be removed. The RMS dropped to yyy. The largest deviations were then due to Y-C-X systems, where a correction was applied (with theoretical backing). The RMS then dropped to zzz. The main outliers then appeared to be from laboratory AAA to whom we wrote and they agreed that their output format introduced systematic errors. They have now corrected this. The RMS was now zzz. The deviations were analysed by standard chemoinformatics methods and were found to correlate with the XC(=Z)Y group which probably has two conformations. A conformational analysis of the system was undertaken for any system with this group and the contributions from different conformers averaged. The RMS now dropped to vvv.
This established a protocol for predicting NMR spectra to 99.3% confidence. We then applied this to spectra published in 2007 in major chemical journals. We found that aa% of spectra appeared to be misassigned, and that bb% of suggested structures were “wrong” – i.e. the reported chemical shifts did not fit the reported spectra values.

    The results from the GIAO calculations were compared with three other prediction approaches provided by Advanced Chemistry Development. These algorithms were not limited in the number of heavy atoms that could be handled by the algorithm, The algorithms were a HOSE-code based approach, a neural network approach and an “increment approach”. A distinct advantage of these approaches is the time for prediction relative to the quantum-mechanical calculations. The QM calculation took a number of weeks to perform on the dataset of 23475 structures on a cluster of computers. However, a standard PC enabled the HOSE code based predictions to be performed in a few hours, the Neural Net predictions in about 4 minutes and the Increment based predictions in less than 3 minutes.
    A comparison of the approaches gave statistics for the non-QM approaches superior to those of the QM approach. Outliers were observed in both cases and were traced to misassignments. QM approaches were generally more capable of predicting exotic structures while for the majority of the NMRShiftDB made up of general organic chemicals non-QM approaches were superior.

We argue that if spectra and compounds were published in CMLSpect in the supplemental data it would be possible for reviewers and editors to check the “correctness” on receipt of the manuscript. We wrote to all major editors. aa% agreed this was a good idea and asked us to help. bb% said they had no plans and the community liked things the way it is. cc% said that if we extracted data they would sue us. dd% failed to reply.

This has the potential to be a very exciting project. While I wouldn’t write the paper myself without doing the work I’ll certainly try the approach. Let’s see what the truth is. The challenge now is to get to agreement on how to compare the performance of the algorithms. We are comparing very different beasts with the QM vs. non-QM approaches so, in many ways, this should be much easier than the challenges discussed so far around comparing non-QM approaches between vendors.

Leave a comment

Posted by on October 18, 2007 in ChemSpider Chemistry, Community Building


Tags: , ,

Another Response to Constructive Feedback from Peter Murray-Rust…

Since ChemSpider went live in March of this year we have received a lot of feedback and questions regarding our understanding of science, our purpose and our passions. We have an excellent Advisory Group who participate in dialogs and constructive discussions. Much of the feedback we have received has been from one individual , Peter Murray-Rust (PMR).

Before proceeding with this post I want to clarify my perceptions. I believe PMR brings a lot of value to the Chemistry Blogosphere. Over the past decade I have watched Peter’s activities with interest as he has participated with many other evangelists to pursue the cause of ODOSOS (Open Data, Open Source and Open Standards). Over the years I will confess a level of hero-worship. I had enjoyed watching what he was doing in regards to enabling the web for chemists. He is prolific..I don’t know where he finds the time to write so much. He travels the world and informs us all of what is going on “out there”. He does a great service. In contrast to these positive traits which I honor I am of the opinion that Peter is overly harsh and judgmental in some cases. Often he posts without necessary research and his perceptions become the “truth”. This is dangerous when he has such a public profile and such influence. For evidence of influence visit the graph here and notice the incredible spike in traffic resulting from his post about the Monkeys at ChemZoo in April of this year. It is unlikely those visitors ever returned to our site or blog to hear our comments. Potential damage was done.This blog post is in regards to his most recent judgments of ChemSpider.

When ChemSpider was set up for the benefit of the chemistry community I had assumed that this humble effort by a small group of dedicated individuals would be welcomed by PMR and other Open Access advocates. In general I believe that’s true. Our actions, policies and status have drawn a significant amount of feedback from PMR on his blog. New feedback was posted late last week and I’ll get to that shortly. As a review, in keeping with the trend being set by Rich Apodaca (1,2,3), I am listing what’s happened to date.

“Constructive Feedback” for Newbies

The Challenge to ChemSpider Chemistry

When Sodium chloride dimers are bad science..but are on NIST Webbook and PubChem

Calcium Carbonate is not soluble and can’t have a logP PLUS Lipinski says Calcium Carbonate CAN have a logP

Prussian Blue on ChemSpider is Terrible…but still as good as Pubchem and Emolecules.

Is Stereochemistry on Taxol important? Should the public data be curated?

ChemSpider VERSUS PubChem or ChemSpider SUPPORTS PubChem

ChemSpider ripped off PubChem…damn them.

ChemSpider and Their Openness and non-Web 2.0

ChemSpider don’t understand what Web 2.0 is.

ChemSpider contribute to the community…and support PubChem

Spectral Data are Declared Open Data

Helping out the community with Web Services

There are a lot more…and so to the latest. I’ll identify the recent post comments in italics.

PMR> Recently the Chemspider company has announced an “Open Chemistry Web” which in my opinion misuses the word “Open”.

Open Chemistry Web is the name of a new blog set up and hosted by Will Griffiths. It’s not ODOSOS. It’s a NAME of a blog. If we are in an environment where the name of a blog cannot include the word “Open” then we are living in sad times. Will’s passion is in text-mining OPEN ACCESS Chemistry Articles..or others if people will allow it. Can he not name his own blog? Hmmm….

PMR> and its associates are commercial organization which have aggregated a large number of chemical connection tables and have started by calculating their properties and extracting literature references which they make freely accessible but not Open. The freedom is for an unspecified timescale and you cannot download significant amounts of the data and you cannot re-use it without permission. ”

Yes we are “commercial”. I dealt with this same comment previously. If you have interest in this please browse it. A later post outlines the present status of the project and whether or not it will survive.

Yes, we have aggregated a large number of connection tables and have started by calculating their properties and extracting literature references which they make freely accessible.We have done a lot more. We have made multiple services available to the community (1,2,3,4) but, with no surprise, have received no acknowledgment.

Regarding “not open“. We are giving away the ChemSpider database to those who ask for it. It will be published in PubChem. We USE Open Source components (1,2,3,4). We have not generated any Open Source components yet and our source code is not Open. We index Open Access articles on ChemRefer. We work with the Open Source data community to help.

Regarding “you cannot download significant amounts of the data and you cannot re-use it without permission“. We are giving away the ChemSpider database to those who ask for it. We do NOT have a server farm to support downloads. The FAQ page says

May I download the data and use it in my own database(s)?

You have limited rights in this regard. You can only assemble a database of 5000 structures or less, and their associated properties, from our database without our permission. You can download up to 1000 structures per day from the website. Please contact us at feedbackATchemspiderDOTcom to request an extension outside this constraint. We are willing to provide the ENTIRE database of ChemSpider structures at your request – the file will consist of InChI Strings, InChIKeys and ChemSpider IDs. These constraints are under regular review so please feel free to engage us in conversation.”

PMR>”Initially I was concerned about the complete lack of quality in these calculations and said so – I believe there has been some improvement in quality but I do not check and do not intend to do so. I do not follow Chemspider regularly but they appear to have added the ability for anyone to add annotations and curation. I have serious concerns about the lack of thought given to metadata and I do not expect Chemspider to be able to scale or to compete against modern approaches.”

I acknowledge the judgments and opinions. A question…in terms of online data sources for chemistry I believe that approximately 20 million structures ranks in the top 3. We have about 1500 chemists per day using the site with thousands of transactions including text and structre/substructure searching. Please compare with other services in this domain and, if you do this, provide quantitative information. We welcome any feedback on metadata. We are presently working on RDF’ing ChemSpider thanks to the guidance and support of Egon Willighagen. I have dealt with the metadata discussion previously here and abstracted below.

“Other comments include “I see very little difference between Chemfinder and Chemspider. They are both closed, proprietary, do not expose data, or metadata, or algorithms; have closed code, do not allow downloads or re-use. They lose metadata in their aggregation process. I have nothing personal against Chemspider (or, if they are associated, ACDLabs) – I just think the Web 1.0 model is out of date for chemistry.”

To respond…yes, the code is proprietary and closed..we don’t know of any Open Source code that would quickly search >10 million structures by structure and substructure (that will be covered in a separate blog as I have the utmost respect for the commercial entities that do this well! It’s DIFFICULT.) Oh…but Open Source isn’t part of the Web 2.0 definition. We don’t expose algorithms…correct…many are provided by collaborators and we do not have the right to expose their code. But that isn’t part of Web 2.0 either.

And next…the beloved “metadata” term. What exactly IS metadata? Let’s refer again to our web-friendly Wikipedia regarding metadata. In brief it’s “data about data” and a perfect example is an XML schema vs XML. An XML schema is metadata. According to my interpretation this means InChI and SMILES are not metadata since these data can be interchanged with the structure itself. I may be wrong. The hypothetical entity describing what data can be bound to a structure would be metadata not necessarily data related somehow to the structure, but rather more general data describing the datamodel – for example the source of the data – this IS metadata. ChemSpider doesn’t lose the metadata…we retain the only metadata currently available, the data source, and use it as our link out to the provider. Our primary role again, for now, is to connect silos of information via chemical structures.”

PMR> Chemspider also encourages Uploading Spectra Onto ChemSpider. These spectra by default all belong to Chemspider. They are not Open. If you can convince the world at large to donate IPR to you for free, you deserve some form of congratulations for sheer bravado. Note that even if you upload data and metadata you are not allowed to download it (there is a limit of 100 structures).

Thanks, again, for the judgments. We have been testing out the system with two of our advisory group and myself. Only JC Bradley’s Lab and Bob Lancashire have deposited and with the understanding, I believe, that the data would be “Open”. Since PMR’s blog posts continue to do damage to our reputation we have no choice but to respond. We do this with coding. Within 24 hours of his comments Open Data was declared, spectra can be downloaded. The intention was always there to do this…just we have higher priorities.

PMR>”We have ca. 250,000 calculations on molecules and 130,000 crystal structures which Chemspider have suggested we upload to them. I’m not yet sure why we should do this.”

Well, if they are Open Data, as marked at the CrystalEye website, and seeing as though people would like to access the data via ChemSpider, we should just be able to download. But, we don’t want all the data..we just want the structures and the appropriate URL structure to link back to CrystalEye. This is what we do with all data sources including NMRShiftDB.
PMR>”Chemrefer appears to allow searching of Open chemistry articles by keyword. Unexceptional, but why shouldn’t we simply use Pubchem? AFAIK it will index all these journals.”

PubChem indexes these journals? No, I think it’s PubMed. We’ll check on whether everything ChemRefer indexes is in PubMed. However, what they don’t do, yet, or ever, is connect the chemical names in those journals to chemical structures. That’s what’s been done for patents.

“PMR> The IPR model of Chemspider seems clear. No data, metadata and author contributions are Open.


“PMR>That allows them, at some stage in the future to close some or all of the site and to charge for data and services”

The site, as it exists today, is intended to stay free for all. We may, OPENLY acknowledged, open services that are for charge.

“PMR> and – like eMolecules and their tie-up with Wiley (Wiley and eMolecules: unacceptable; an explanation would be welcome) – I predict this will happen within 5 years (unless Chemspider fails to survive in its current form).

I have posted on what I believe is an inappropriate judgment by Peter that the data on Chemgate is extracted from the journals. I put a trackback to Peter’s original post. He never responded. He did comment separately though about busyness and commenting. Unfortunately Wiley and Chemgate now show up again…with no effort to clean up the previous comments and, unfortunately, more incorrect information about ChemSpider.

“PMR> So all the authors who are contributing metadata are, in effect, donating IP to Chemspider. I have no moral objection to this – it just seems retrograde when we have Open collections of molecules such as PubChem and our own crystalEye.”

ChemSpider data will all go onto PubChem shortly. This was announced at the recent PubChem meeting. I have asked PMR to point me to where I can download the CrystalEye collection if it is indeed Open Data.

“PMR>But a number of my friends in the Open Chemistry area are on the Chemspider advisory board, so I must be missing something. Perhaps they can show how donating IP to a commercial closed company advances the cause of Open Chemistry.”

I hope they discuss with you. This group is a powerful team of intellect, capabilities, insight and support. I value the opportunity to work with them.

“PMR> And I applaud Chemspiderman’s efforts to clean up chemistry. Sometimes this gets muddled with the association with a commercial organisation based on possessing chemical IP so sometimes my messages have been less than generous and I apologized.”

Yes, you did. And I accept it willingly. It was very gentleman like.

“PMR> I am not anti-capitalist – I do not attack companies per se. But I do attack people who use the word “Open” incorrectly and to promote themselves. I have done this when publishers come up with “Open Access” offerings which appear to be less than satisfactory ( see “open access products” at Nature obscures the debate, Why Open Access metrics are necessary) and for which the community has to pay. “Open” is now used by commercial organisations in the same way as “healthy” – please feel good about us and our activities as we use the word “Open”. We know it’s meaningless, but it makes us look good. Well, it isn’t meaningless. A number of people are trying carefully to describe what is meant by Open access, Open Data, Open source and Open Services. And when others use it to mean something less, I take exception. If nothing else it makes our job much harder.”

I will comment on this in a couple of later posts. I do not support the “marketing” use of Open and do not believe we are doing so. However, I want to comment more on this, but at a later date. Marketing statements bug me too. You’d think that “…the world’s most comprehensive openly accessible search engine for chemical structures” would be PubChem. But it’s not according to this marketing statement …who is it?

There have been comments about PubChem being the model of Openness. I think the effort is great. FULLY support it. But let’s wake up. If funding ceases then PubChem could go away. The data is Open. The software is NOT. PubChem is built around some home-built services and on top of commercial modules such as CACTVS and OpenEye. I discussed it here and it has not been challenged. Am I wrong?

“PMR>: There is nothing Open about this. Even the blog is not Open (it does not carry a CC licence). The services may be free, and they may be useful, but they are not Open. The text that they index may indeed be Open Access in its own right (and probably is because otherwise the publishers will sue them) but this is no especial credit to Chemrefer. We also index Open resources but we make our results Open.Chemrefer could disappear tomorrow. Only if the data, and the source code are made Openly available under licence can they be called Open.”

There is a CC license on the page. Peter acknowledged this. Who said the services were Open? if we did, point me to it and we will rectify. I have asked Peter separately whether all articles linked to CrystalEye are Open Access or some with permission from the publishers. This is very interesting.

This has been a long post. I understand I have likely added fuel to the fire. I have done it in a public way. I judge that ChemSpider is being harmed by the ongoing misinformation. I wish it to stop. What I want is advice and support to make this a better service for our users. However, I refuse to make it my personal mission to satiate PMR’s requests and objectives. ChemSpider is developed for its users and the community in general NOT for it’s non-users. PMR is not a user. Not everything has to be Open for it to be of high-value. I believe we deliver value.


Posted by on October 15, 2007 in ChemSpider Chemistry, Community Building


Tags: , , , ,