Previously I blogged about “An Invitation to Collaborate on Open Notebook Science for an NMR Study“. I judged it was a great opportunity to “help build a bridge between the Open Data community, the academic community and the commercial software community for the benefit of science.” In particular I believe the project offers an opportunity to answer a longstanding question I have had. Specifically, I have seen a lot of publications in recent years utilizing complex, time-consuming GIAO NMR predictions. Having been involved with the development of NMR prediction algorithms for the past few years (while working with the scientists at ACD/Labs) my judgment is that these complex calculations can be replaced by calculations which can take just a couple of seconds on a standard PC. I believe this to be true for most organic molecules. I do not believe such calculations would outperform GIAO predictions for inorganic molecules or organometallic complexes or solid state shift tensors. However, there has never been a rigorous examination comparing performance differences. I believe this project offered an excellent opportunity to validate the hypothesis that HOSE code/Neural Network/Increment based predictions could, in general, outperform GIAO predictions.
The study was to be performed on the NMRShiftDB now available on ChemSpider. I’ve blogged previously about the validation of the database (1,2). The conversation about the NMR project has continued and Peter has talked about some of the challenges about open Notebook Science based on Cameron Neylon’s comments. I’ve posted the comments below to the post and they will likely be moderated in shortly. I post them here for the purpose of conclusion since I don’t think my original hopes will come to fruition. Thanks to those of you who have been engaged both on and off blog. I suggest we all help with Peter’s intention to help explain identifiers that are being extracted in the work.
“Can you provide some more details regarding your concerns here:”it would be possible for someone to replicate the whole work in a day and submit it for publication (on the same day) and ostensibly legitimately claim that they had done this independently. They might, of course use a slightly different data set, and slightly different tweaks.”
I have two interpretations:
1) Someone could repeat the GIAO calculations in a day and identify outliers and submit for publication
2) Someone could do the calculations using other algorithms and identify outliers etc and submit for publication
Maybe you mean something else?
For 1) the GIAO calculations CANNOT be repeated since no one has access to Henry
For 2)using other algorithms on the same dataset is valid and appropriate science. THis is what people do with logP prediction (or MANY other parameters)..they validate their algorithms on the same dataset many times over. Its one of the most common activities in the QSAR and modeling world in my opinion. And people do use slightly different tweaks…it
Returning to the comment “it would be possible for someone to replicate the whole work in a day and submit it for publication (on the same day) and ostensibly legitimately claim that they had done this independently.”
Wolfgang Robien has taken the NMRShiftDB dataset and performed an analysis. It
This was heated and opinionated for sure. STRONG scientific wills and GREAT scientists defending their approaches and performance. Wolfgang is NOT an enemy for ACD/Labs…he has made some of the greatest contributions to the domain of NMR prediction and, in many ways, has been one to emulate in terms of his approach to quality and innovation to create breakthroughs in performance. He is a worthy colleague and drives improvement by his ongoing search for improvements in his own algorithms. I honor him.
The bottom line is this: approaches for the identification of outliers in NMRShiftDB have been DONE already. It
1) To perform Open Notebook Science
2) quote “To show that the philosophy works, that the method works, and that NMRShiftDB has a measurable high-quality.”
1) has already changed and is an appropriate outcome from the work.(http://wwmm.ch.cam.ac.uk/blogs/murrayrust/?p=743)
2) The method of NMR prediction applied to NMRShiftDB to prove quality..high or not…has been done already. Wolfgang and ACD/labs did it already. I judge you
Stated here http://wwmm.ch.cam.ac.uk/blogs/murrayrust/?p=737 is “We shall continue on the project, one of whose purposes is to investigate the hypothesis that QM calculations can be used to evaluate the quality of NMR spectra to a useful level.” It
1) Using NMR predictions to identify outliers – already done (Robien and ACD/Labs)
2) Validating that GIAO predictions are useful to validate structures – already done (hexacylinol study)
3) Validating the quality of NMRSHiftDB – already done (Robien, ACD/Labs)
All this brings me down to what I “think” are the intentions or outcomes for the project at this point..but I likely have missed something..
1) Identify more outliers that were not identified by the studies of others
2) Deliver back to Christoph and the NMRShiftDB team a list of outliers/concerns/errors with annotations/metadata in order to improve the Open Data source of NMRShiftDB
3) Allow Nick Day to use a lot of what was learned delivering CrystalEye for a second application around NMR and useful for his thesis (A VERY valid goal..good luck Nick)
4) Show the power of blogging to drive Collaboration via OPen Collaborative NMR
SOme additional project deliverables I think include:
1) make online GIAO NMR predictions available
The project deliverables you are working on are defined here and I believe are consistent: http://wwmm.ch.cam.ac.uk/blogs/murrayrust/?p=742
* create a small subset of NMRShiftDB which has been freed from the main errors we – and hopefull the community – can identify.
* Use this to estimate the precision and variance of our QM-based protocol for calculating shifts.
* refine the protocol in the light of variance which can be scientifically explained.
What I still would like to see, BUT this project belongs to you/Henry/Nick of course and you define what it is, is:
1) to help build a bridge between the Open Data community, the academic community and the commercial software community for the benefit of science.” Wolfgang is in academia, so are you, ACD/Labs is commercial and I
2) To validate the performance of GIAO vs HOSE/NN/Inc by providing the final dataset that you used and statistics of performance for GIAO on that datatset. I
3) To identify where GIAO can outperform the HOSE/NN/Inc approaches
Wolfgang also has thoughts based on http://wwmm.ch.cam.ac.uk/blogs/murrayrust/?p=742#comment-63089 where he says “What would be great to the scientific community: Do calculations on compounds where sophisticated NMR-techniques either fail or are very difficult to perform – e.g. proton-poor compounds or simply ask for a list of compounds which are really suspicious (either the structure is wrong or the assignment is strange, but the puzzle canâ€™t be solved, because the compound is not available for additional measurements).