RSS

An Invitation to Collaborate on Open Notebook Science for an NMR Study

18 Oct

Peter Murray-Rust and Henry Rzepa have started on an Open Notebook Science project around calculating NMR

In Peter’s own words:

“We are starting an experiment on Open Notebook Science. <..> ONS seems to be the generally agreed term for scientific endeavour where the experiments are rapidly posted in public view, possibly before being exhaustively checked. It takes bravery as it isn’t fun if you goof publicly.“
“The recent controversy over hexacyclinol – where a published structure seems to be “wrong” – has sparked one good development – the realisation that high-quality QM calculations can predict experimental data well enough to show whether the published structure is “correct”.
We’re now starting to do this for NMR spectra. Henry Rzepa has taken Scott Rychnovksy’s methods for calculating 13C spectra and refined the protocol. Christoph Steinbeck has helped us get 20, 000 spectra from NMRShiftDB and Nick Day (of crystalEye fame) has amended the protocols so we can run hundreds of jobs per day..”

I posted comments to Peter’s post commenting “My intuition is that the HOSE-code approach, neural network approach and LSR approach will outperform the GIAO approach. Certainly these approaches would be much faster I believe. It would be good to compare the outcome of your studies with these other prediction algorithms and if the data is open then it will make for a good study.”

Peter comments “I am a supporter of the HOSE code and NN approach, but I have also been impressed with the GIAO method. The time taken is relatively unimportant. A 20-atom molecule takes a day or so, smaller ones are faster. We can run 100 jobs a day – so 30,000 a month. That’s larger than NMRShiftDB.”

He also comments “I have extended George Whitesides’ ideas of writing papers that in doing research one should write the final paper first (and then of course modify it as you go along). So here’s what we will have accomplished (please correct mistakes)”.

The abstract is below. Rather than correct mistakes I have added a paragraph (NON-bolded). I believe this project offers the ability to help build a bridge between the Open Data community, the academic community and the commercial software community for the benefit of science. There has never been a study of the magnitude being discussed here comparing quantum-mechanical NMR prediction methods with the methods represented by commercial software products. I look forward to it!

PMR> “We adapted Rychnovksy’s method of calculating 13C NMR shifts by adding (XXX) basis set and functionals (Henry has done this). We extracted 1234 spectra with predicted 3D geometries for rigid molecules in NMRShiftDB (no acyclic-acyclic bond nodes for heavy atoms). Molecules had < = 21 heavy atoms (<= Cl). These were optimised using Gaussian XXX and the isotropic magnetic tensors calculated using correction for the known solvent. The shift was subtracted from the calculated TMS shift (in the same solvent) and the predicted shift compared with the observed.
Initially the RMS deviation was xxx. This was due to a small number of structures where there appeared to be gross errors of assignment. These were exposed to the community who agreed that these should be removed. The RMS dropped to yyy. The largest deviations were then due to Y-C-X systems, where a correction was applied (with theoretical backing). The RMS then dropped to zzz. The main outliers then appeared to be from laboratory AAA to whom we wrote and they agreed that their output format introduced systematic errors. They have now corrected this. The RMS was now zzz. The deviations were analysed by standard chemoinformatics methods and were found to correlate with the XC(=Z)Y group which probably has two conformations. A conformational analysis of the system was undertaken for any system with this group and the contributions from different conformers averaged. The RMS now dropped to vvv.
This established a protocol for predicting NMR spectra to 99.3% confidence. We then applied this to spectra published in 2007 in major chemical journals. We found that aa% of spectra appeared to be misassigned, and that bb% of suggested structures were “wrong” – i.e. the reported chemical shifts did not fit the reported spectra values.

    The results from the GIAO calculations were compared with three other prediction approaches provided by Advanced Chemistry Development. These algorithms were not limited in the number of heavy atoms that could be handled by the algorithm, The algorithms were a HOSE-code based approach, a neural network approach and an “increment approach”. A distinct advantage of these approaches is the time for prediction relative to the quantum-mechanical calculations. The QM calculation took a number of weeks to perform on the dataset of 23475 structures on a cluster of computers. However, a standard PC enabled the HOSE code based predictions to be performed in a few hours, the Neural Net predictions in about 4 minutes and the Increment based predictions in less than 3 minutes.
    A comparison of the approaches gave statistics for the non-QM approaches superior to those of the QM approach. Outliers were observed in both cases and were traced to misassignments. QM approaches were generally more capable of predicting exotic structures while for the majority of the NMRShiftDB made up of general organic chemicals non-QM approaches were superior.

We argue that if spectra and compounds were published in CMLSpect in the supplemental data it would be possible for reviewers and editors to check the “correctness” on receipt of the manuscript. We wrote to all major editors. aa% agreed this was a good idea and asked us to help. bb% said they had no plans and the community liked things the way it is. cc% said that if we extracted data they would sue us. dd% failed to reply.

This has the potential to be a very exciting project. While I wouldn’t write the paper myself without doing the work I’ll certainly try the approach. Let’s see what the truth is. The challenge now is to get to agreement on how to compare the performance of the algorithms. We are comparing very different beasts with the QM vs. non-QM approaches so, in many ways, this should be much easier than the challenges discussed so far around comparing non-QM approaches between vendors.

 

About tony

Antony (Tony) J. Williams received his BSc in 1985 from the University of Liverpool (UK) and PhD in 1988 from the University of London (UK). His PhD research interests were in studying the effects of high pressure on molecular motions within lubricant related systems using Nuclear Magnetic Resonance. He moved to Ottawa, Canada to work for the National Research Council performing fundamental research on the electron paramagnetic resonance of radicals trapped in single crystals. Following his postdoctoral position he became the NMR Facility Manager for Ottawa University. Tony joined the Eastman Kodak Company in Rochester, New York as their NMR Technology Leader. He led the laboratory to develop quality control across multiple spectroscopy labs and helped establish walk-up laboratories providing NMR, LC-MS and other forms of spectroscopy to hundreds of chemists across multiple sites. This included the delivery of spectroscopic data to the desktop, automated processing and his initial interests in computer-assisted structure elucidation (CASE) systems. He also worked with a team to develop the worlds’ first web-based LIMS system, WIMS, capable of allowing chemical structure searching and spectral display. With his developing cheminformatic skills and passion for data management he left corporate America to join a small start-up company working out of Toronto, Canada. He joined ACD/Labs as their NMR Product Manager and various roles, including Chief Science Officer, during his 10 years with the company. His responsibilities included managing over 50 products at one time prior to developing a product management team, managing sales, marketing, technical support and technical services. ACD/Labs was one of Canada’s Fast 50 Tech Companies, and Forbes Fast 500 companies in 2001. His primary passions during his tenure with ACD/Labs was the continued adoption of web-based technologies and developing automated structure verification and elucidation platforms. While at ACD/Labs he suggested the possibility of developing a public resource for chemists attempting to integrate internet available chemical data. He finally pursued this vision with some close friends as a hobby project in the evenings and the result was the ChemSpider database (www.chemspider.com). Even while running out of a basement on hand built servers the website developed a large community following that eventually culminated in the acquisition of the website by the Royal Society of Chemistry (RSC) based in Cambridge, United Kingdom. Tony joined the organization, together with some of the other ChemSpider team, and became their Vice President of Strategic Development. At RSC he continued to develop cheminformatics tools, specifically ChemSpider, and was the technical lead for the chemistry aspects of the Open PHACTS project (http://www.openphacts.org), a project focused on the delivery of open data, open source and open systems to support the pharmaceutical sciences. He was also the technical lead for the UK National Chemical Database Service (http://cds.rsc.org/) and the RSC lead for the PharmaSea project (http://www.pharma-sea.eu/) attempting to identify novel natural products from the ocean. He left RSC in 2015 to become a Computational Chemist in the National Center of Computational Toxicology at the Environmental Protection Agency where he is bringing his skills to bear working with a team on the delivery of a new software architecture for the management and delivery of data, algorithms and visualization tools. The “Chemistry Dashboard” was released on April 1st, no fooling, at https://comptox.epa.gov, and provides access to over 700,000 chemicals, experimental and predicted properties and a developing link network to support the environmental sciences. Tony remains passionate about computer-assisted structure elucidation and verification approaches and continues to publish in this area. He is also passionate about teaching scientists to benefit from the developing array of social networking tools for scientists and is known as the ChemConnector on the networks. Over the years he has had adjunct roles at a number of institutions and presently enjoys working with scientists at both UNC Chapel Hill and NC State University. He is widely published with over 200 papers and book chapters and was the recipient of the Jim Gray Award for eScience in 2012. In 2016 he was awarded the North Carolina ACS Distinguished Speaker Award.
Leave a comment

Posted by on October 18, 2007 in ChemSpider Chemistry, Community Building

 

Tags: , ,

0 Responses to An Invitation to Collaborate on Open Notebook Science for an NMR Study

  1. Brent Lefebvre

    October 19, 2007 at 10:35 am

    Peter and Tony,

    I think this is a fantastic project and am very keen to see how accurate the QM techniques prove to be for the subset of structures that you choose from the NMRShiftDB, and then how helpful they can be in improving the accuracy of experimental shifts in this wonderful resource.

    For the purposes of this work, we would be willing to provide the chemical shift predictions from the ACD/Labs software if you would like to use them in your comparison. If, for instance, they prove to be accurate enough to find many of these problems without the need for time consuming QM calculations, it may be preferrable to use the faster calculation algorithms that are available in our software. It may turn out that the ACD/Labs predictions could serve as a pre-filter to define which structures need the QM calculations and which don’t. Many variations on this theme come to mind, but we won’t know which are useful until we do the work.

    Sincerely,

    Brent Lefebvre
    NMR Product Manager
    Advanced Chemistry Development, Inc.

     
  2. Antony Williams

    October 23, 2007 at 8:45 pm

    Response to blog posting by Peter Murray-Rust posted (http://wwmm.ch.cam.ac.uk/blogs/murrayrust/?p=715) here for completeness

    Peter, I’ll comment in more detail on this after your readers have a chance to comment. But, for clarity, I point your readers to http://www.chemspider.com/blog/?p=213 where I took your “George Whitesides approach to writing papers” and added to the conclusion. Here’s the piece I added.

    “The results from the GIAO calculations were compared with three other prediction approaches provided by Advanced Chemistry Development. These algorithms were not limited in the number of heavy atoms that could be handled by the algorithm, The algorithms were a HOSE-code based approach, a neural network approach and an “increment approach”. A distinct advantage of these approaches is the time for prediction relative to the quantum-mechanical calculations. The QM calculation took a number of weeks to perform on the dataset of 23475 structures on a cluster of computers. However, a standard PC enabled the HOSE code based predictions to be performed in a few hours, the Neural Net predictions in about 4 minutes and the Increment based predictions in less than 3 minutes.
    A comparison of the approaches gave statistics for the non-QM approaches superior to those of the QM approach. Outliers were observed in both cases and were traced to misassignments. QM approaches were generally more capable of predicting exotic structures while for the majority of the NMRShiftDB made up of general organic chemicals non-QM approaches were superior.”

    There should be no surprise to you that ACD/Labs stepped forward to participate. I declared it explicitly in my blog posting.

    I also posted in that blog the following statement “I believe this project offers the ability to help build a bridge between the Open Data community, the academic community and the commercial software community for the benefit of science. There has never been a study of the magnitude being discussed here comparing quantum-mechanical NMR prediction methods with the methods represented by commercial software products. I look forward to it!”

    I fully acknowledge your stance on commercial software companies. Also on publishers. And many other areas. You’re not shy with your judgments. Having worked in academia, a Fortune 500 company and in a commercial software company I can comment that all three have good science going on, some excellent people in their organizations and certainly people committed to their roles and to science. I beg the question why not help build the bridge rather than maintain the distance.

    Further clarification..ChemSpider does not have access to any NMR prediction algorithms. However, they would be willing to work on this project for the science. The hypothesis under question is whether HOSE, Neural Net or Increment based algorithms can outperform GIAO predictions. It is already known that they are faster …these are real numbers generated already “on the dataset of 23475 structures on a cluster of computers a standard PC enabled the HOSE code based predictions to be performed in a few hours, the Neural Net predictions in about 4 minutes and the Increment based predictions in less than 3 minutes.” What is the statement on accuracy? I believe it’s a valid scientific question to be answered.

    We have just submitted a publication regarding one aspect of this validated on NMRShiftDB with your collaborator, Christoph Steinbeck, as our collaborator. The title and authors are below..should be in JCIM shortly. I have already sent you a copy I believe.

    The Performance Validation of Neural Network Based 13C NMR Prediction
    Using a Publicly Available Data Source.
    K.A. Blinov§, Y.D. Smurnyy§, M.E. Elyashberg§, T.S. Churanova§, M. Kvasha§, C. Steinbeck#, B.A. Lefebvre† and A.J Williams‡
    § Advanced Chemistry Development, Moscow Department, 6 Akademik Bakulev Street, Moscow 117513, Russian Federation
    † Advanced Chemistry Development, Inc., 110 Yonge Street, 14th floor, Toronto, Ontario, Canada, M5C 1T4
    # Steinbeck Molecular Informatics, Franz-John-Str. 10, 77855 Achern, Germany.
    ‡ ChemZoo Inc., 904 Tamaras Circle, Wake Forest, NC-27587

     
  3. Wolfgang Robien

    October 24, 2007 at 7:30 am

    Dear Tony;

    there is a searchable database of 16.4 millions of calculated C13-NMR spectra available since approx. 1 year on http://nmrpredict.orc.univie.ac.at/identify
    (Peter: it’s free of charge ! ;-)) )

    The spectra have been calculated for 16,4 millions of the PUBCHEM-structures using the CSEARCH NN-approach. The search technology used, is a modified SAHO-approach as implemented in CSEARCH.

    If there is more interest in using this, no problem to upgrade the data file to the actual size of the PUBCHEM-collection. The calculation of approx. 40 millions of spectra can be done in less than one week on a 4-processor box.

    Best regards, Wolfgang Robien

     

Leave a Reply to Wolfgang Robien Cancel reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

 
Stop SOPA