Tag Archives: Open Notebook Science

Beautifying Data in the Real World: Beautiful Data and O’Reilly

I was recently privileged to co-author a book chapter entitled “Beautifying Data in the Real World” in a book called Beautiful Data, available shortly from O’Reilly. The list of authors is likely known to readers interested in Open Data and Open Notebook Science: Jean-Claude Bradley, Rajarshi Guha, Andrew Lang, Pierre Lindenbaum, Cameron Neylon, Egon Willighagen and myself.

This was a great example of “distant collaboration”. We didn’t get on the phone to talk about the manuscript. We didn’t connect via a conferencing system. Cameron brought us together as a group of interested individuals, interested in contributing to a chapter regarding the work we’d done together on crowdsourced solubility measurements and handling of the data. We collaborated via a wiki with a few emails here and there. I believe the result speaks for itself. It’s an excellent article regarding “Beautifying Data in the Real World”.

The book can be pre-ordered here. I’ve browsed through some of the articles already and I don’t think you’ll be disappointed by the contents of the book. The content is diverse “Join 39 contributors as they explain how they developed simple and elegant solutions on projects ranging from the Mars lander to a Radiohead video.”.

1 Comment

Posted by on June 14, 2009 in Book Reviews


Tags: , , ,

An Invitation to Collaborate on Open Notebook Science for an NMR Study

Peter Murray-Rust and Henry Rzepa have started on an Open Notebook Science project around calculating NMR

In Peter’s own words:

“We are starting an experiment on Open Notebook Science. <..> ONS seems to be the generally agreed term for scientific endeavour where the experiments are rapidly posted in public view, possibly before being exhaustively checked. It takes bravery as it isn’t fun if you goof publicly.“
“The recent controversy over hexacyclinol – where a published structure seems to be “wrong” – has sparked one good development – the realisation that high-quality QM calculations can predict experimental data well enough to show whether the published structure is “correct”.
We’re now starting to do this for NMR spectra. Henry Rzepa has taken Scott Rychnovksy’s methods for calculating 13C spectra and refined the protocol. Christoph Steinbeck has helped us get 20, 000 spectra from NMRShiftDB and Nick Day (of crystalEye fame) has amended the protocols so we can run hundreds of jobs per day..”

I posted comments to Peter’s post commenting “My intuition is that the HOSE-code approach, neural network approach and LSR approach will outperform the GIAO approach. Certainly these approaches would be much faster I believe. It would be good to compare the outcome of your studies with these other prediction algorithms and if the data is open then it will make for a good study.”

Peter comments “I am a supporter of the HOSE code and NN approach, but I have also been impressed with the GIAO method. The time taken is relatively unimportant. A 20-atom molecule takes a day or so, smaller ones are faster. We can run 100 jobs a day – so 30,000 a month. That’s larger than NMRShiftDB.”

He also comments “I have extended George Whitesides’ ideas of writing papers that in doing research one should write the final paper first (and then of course modify it as you go along). So here’s what we will have accomplished (please correct mistakes)”.

The abstract is below. Rather than correct mistakes I have added a paragraph (NON-bolded). I believe this project offers the ability to help build a bridge between the Open Data community, the academic community and the commercial software community for the benefit of science. There has never been a study of the magnitude being discussed here comparing quantum-mechanical NMR prediction methods with the methods represented by commercial software products. I look forward to it!

PMR> “We adapted Rychnovksy’s method of calculating 13C NMR shifts by adding (XXX) basis set and functionals (Henry has done this). We extracted 1234 spectra with predicted 3D geometries for rigid molecules in NMRShiftDB (no acyclic-acyclic bond nodes for heavy atoms). Molecules had < = 21 heavy atoms (<= Cl). These were optimised using Gaussian XXX and the isotropic magnetic tensors calculated using correction for the known solvent. The shift was subtracted from the calculated TMS shift (in the same solvent) and the predicted shift compared with the observed.
Initially the RMS deviation was xxx. This was due to a small number of structures where there appeared to be gross errors of assignment. These were exposed to the community who agreed that these should be removed. The RMS dropped to yyy. The largest deviations were then due to Y-C-X systems, where a correction was applied (with theoretical backing). The RMS then dropped to zzz. The main outliers then appeared to be from laboratory AAA to whom we wrote and they agreed that their output format introduced systematic errors. They have now corrected this. The RMS was now zzz. The deviations were analysed by standard chemoinformatics methods and were found to correlate with the XC(=Z)Y group which probably has two conformations. A conformational analysis of the system was undertaken for any system with this group and the contributions from different conformers averaged. The RMS now dropped to vvv.
This established a protocol for predicting NMR spectra to 99.3% confidence. We then applied this to spectra published in 2007 in major chemical journals. We found that aa% of spectra appeared to be misassigned, and that bb% of suggested structures were “wrong” – i.e. the reported chemical shifts did not fit the reported spectra values.

    The results from the GIAO calculations were compared with three other prediction approaches provided by Advanced Chemistry Development. These algorithms were not limited in the number of heavy atoms that could be handled by the algorithm, The algorithms were a HOSE-code based approach, a neural network approach and an “increment approach”. A distinct advantage of these approaches is the time for prediction relative to the quantum-mechanical calculations. The QM calculation took a number of weeks to perform on the dataset of 23475 structures on a cluster of computers. However, a standard PC enabled the HOSE code based predictions to be performed in a few hours, the Neural Net predictions in about 4 minutes and the Increment based predictions in less than 3 minutes.
    A comparison of the approaches gave statistics for the non-QM approaches superior to those of the QM approach. Outliers were observed in both cases and were traced to misassignments. QM approaches were generally more capable of predicting exotic structures while for the majority of the NMRShiftDB made up of general organic chemicals non-QM approaches were superior.

We argue that if spectra and compounds were published in CMLSpect in the supplemental data it would be possible for reviewers and editors to check the “correctness” on receipt of the manuscript. We wrote to all major editors. aa% agreed this was a good idea and asked us to help. bb% said they had no plans and the community liked things the way it is. cc% said that if we extracted data they would sue us. dd% failed to reply.

This has the potential to be a very exciting project. While I wouldn’t write the paper myself without doing the work I’ll certainly try the approach. Let’s see what the truth is. The challenge now is to get to agreement on how to compare the performance of the algorithms. We are comparing very different beasts with the QM vs. non-QM approaches so, in many ways, this should be much easier than the challenges discussed so far around comparing non-QM approaches between vendors.

Leave a comment

Posted by on October 18, 2007 in ChemSpider Chemistry, Community Building


Tags: , ,