RSS

Further Comments on the Quality of NMRShiftDB and NMR Prediction Algorithm Validation

31 May

Following on from a recent post regarding an estimate of the number of potential errors in the NMRShiftDB a recent report was issued regarding a deep analysis of the data rather than the cursory estimates that I suggested. The details of that analysis have been exposed on Ryan Sasaki’s blog and in a separate report. By definition an objective examination of performance of algorithms by a software vendor is generally challenged. So be it. However, if the assumption that the examination is objective and driven by quality science then any other flavors should be distasteful and the science is what it is.

The report analyzing the performance of both the ACD/CNMR predictor and Robiens’ algorithm provided the statistics shown below:
Comparison of predictors

Previously I commented in the blog that based on an early analysis “The quality is excellent and there are “large errors” but minimum in number.” This comment seemed to cause some confusion.

So, what do I mean by large error?

What I do NOT mean is that a chemical shift at 120ppm is predicted to be at 80ppm and therefore there is a large error. No, the chemical shift at 120ppm could be experimentally correct but the prediction algorithm could fail to predict it correctly.

What I DO mean is that an assignment of a particular nucleus to 120ppm may be entered into the database but the ACTUAL shift should be 12ppm….that additional zero just showing up as an error during the submission process. So, the errors I am pointing to are those of incorrectly drawn structures, mis-assignments, transcription errors and other potential sources of error. My estimates refer to the number of significant assignment or structural errors that were glaringly incorrect and I was subjectively thinking of situations where the difference between the actual experimental shift value and the one assigned to nucleus was >20ppm….this does not mean that mis-assignments of even 1ppm are any less importance, just not necessarily as easy to detect and not part of my subjective criteria.

At present, the data have been examined in more detail and I believe I overestimated…a report of potential glaring errors has been returned to Christoph for him to examine and make changes to the database as he sees fit. Glaring errors are less than 250 in number based on my subjective criteria. Again, this does not mean that there aren’t hundreds or thousands of errors buried in the data…they are not obvious errors and require more manual examination.

Now, let’s return to some of the other comments about potential errors in the database. In his post Robien has highlighted some errors that are contained within the data. With his examination he has highlighted the necessary level of care and caution required to create the cleanest database of data possible and willingly admits there are likely still errors in his own database even after rigorous checking procedures. It would be naïve to not admit this for a large database and certainly this is true for some of the NMR data contained in the millions of shifts used as the basis of the ACD/Labs NMR predictions. There are errors in there…we are just not aware of them yet.

Robien comments that “the total number of incorrect assignments exceeds the above mentioned limit of 250 significantly. The intermediate number is at the moment around 300, but about ca. 1,000 pages of printouts are waiting for visual inspection.” This is the nature of building high quality databases..they need to be manually curated and appropriate processes and tools need to be applied to ensure rigorous examination of the data. Ryan’s already commented on the difference between 250 and 300…as has PMR in terms of 1% error.

Robien also comments that “ From my experience with C13-NMR databases a reasonable estimate for the number of mis-assigned shift values will be between 500 and 600, which corresponds to ca. 0.25-0,3%. Afterwards the data will reach the usual quality as within CSEARCH and NMRPredict.” Compare this with our experience of building databases of NMR shifts over the past decade. Year after year we have added somewhere close to 200,000 chemical shifts to the carbon NMR database. Each of the structures and individual chemical shifts passes through processes outlined elsewhere (this is an OLD presentation and processes have tweaked for even greater control of data quality…) . Over the years we have gathered statistics on the quality of data contained within the literature. This is almost exclusively peer-reviewed literature. Our statistics, on an annual basis, are that 8%, EIGHT PERCENT, of the chemical structures and associated shift assignments are in error in the literature. This can mean incorrectly elucidated structures or incorrectly assigned structures, transcription errors and so on. These are the errors we catch.

I reported previously on an example of the type of errors perpetuating in the literature. For a tosyl group as represented below, consider C-1 and C-4.

Tosyl structure

Without going into detail, when chemical shifts are predicted it is possible to examine what structures are used to generate the predictions There are TWO distinct columns of related structures displayed, one around 132ppm and one around 145ppm. See below.

Calculation Protocols

Conclusions from these data include the fact that ongoing assignments for C-1 and C-4 have been confused in many cases. These confusions did not arise from one particular publication that we can identify. We had to clean these up in our database as Wolfgang likely did! “Error propagation” as he comments, is dangerous! Finding the source of the errors is challenging.

Ryan Sasaki’s post and Peter Murray Rust’s post have both commented on the fact that NMRShiftDB is a valuable resource. I’ve got my hand up the air waving wildly with a “Yessir..it is”.

Christoph Steinbeck and the NMRShiftDB team have outlined their intentions admirably in at least two articles (J. Chem. Inf. Comput. Sci. 2003, 43, 1733-1739 and Phytochemistry 65 (2004) 2711–2717). They are well on their way to delivering on their vision. A detailed analysis of the performance of the ACD/Labs C-13 NMR NEURAL NETWORK based predictor using the NMRShiftDB dataset has been performed and a manuscript is presently in preparation. That work included a review of the impacts of overlap in the training set. The numbers are impressive and a teaser is available.

PMR commented on one of the capabilities of NMRShiftDB – “It contains mechanisms for assessing data quality automatically. For example software can be run that will indicate whether values are seriously in error.” NMRShiftDB is a research project and therefore “in development”. I’ve already provided a document outlining what appears to be a series of serious errors in the data (Robien identified some too). There is some work required on these algorithms. An example is shown below…notice that 8.30ppm shift on the carbonyl?

Assignment error

The assignments for this structure have already been reported, twice. Once in the solid state, once in DMSO. The screenshot below shows the values. EVERY shift except for one as submitted to NMRShiftDB is in error. But, this is ONE structure from one submitter. Maybe one approach is to have “faith criteria” for submitters after they have proven themselves with the submission of data? Maybe something untoward happened with the system this one time during submission. It does not take away from the effort….it’s just data that there is more work to be done. Here at ChemSpider…we know that world!

Database entry

 

About tony

Antony (Tony) J. Williams received his BSc in 1985 from the University of Liverpool (UK) and PhD in 1988 from the University of London (UK). His PhD research interests were in studying the effects of high pressure on molecular motions within lubricant related systems using Nuclear Magnetic Resonance. He moved to Ottawa, Canada to work for the National Research Council performing fundamental research on the electron paramagnetic resonance of radicals trapped in single crystals. Following his postdoctoral position he became the NMR Facility Manager for Ottawa University. Tony joined the Eastman Kodak Company in Rochester, New York as their NMR Technology Leader. He led the laboratory to develop quality control across multiple spectroscopy labs and helped establish walk-up laboratories providing NMR, LC-MS and other forms of spectroscopy to hundreds of chemists across multiple sites. This included the delivery of spectroscopic data to the desktop, automated processing and his initial interests in computer-assisted structure elucidation (CASE) systems. He also worked with a team to develop the worlds’ first web-based LIMS system, WIMS, capable of allowing chemical structure searching and spectral display. With his developing cheminformatic skills and passion for data management he left corporate America to join a small start-up company working out of Toronto, Canada. He joined ACD/Labs as their NMR Product Manager and various roles, including Chief Science Officer, during his 10 years with the company. His responsibilities included managing over 50 products at one time prior to developing a product management team, managing sales, marketing, technical support and technical services. ACD/Labs was one of Canada’s Fast 50 Tech Companies, and Forbes Fast 500 companies in 2001. His primary passions during his tenure with ACD/Labs was the continued adoption of web-based technologies and developing automated structure verification and elucidation platforms. While at ACD/Labs he suggested the possibility of developing a public resource for chemists attempting to integrate internet available chemical data. He finally pursued this vision with some close friends as a hobby project in the evenings and the result was the ChemSpider database (www.chemspider.com). Even while running out of a basement on hand built servers the website developed a large community following that eventually culminated in the acquisition of the website by the Royal Society of Chemistry (RSC) based in Cambridge, United Kingdom. Tony joined the organization, together with some of the other ChemSpider team, and became their Vice President of Strategic Development. At RSC he continued to develop cheminformatics tools, specifically ChemSpider, and was the technical lead for the chemistry aspects of the Open PHACTS project (http://www.openphacts.org), a project focused on the delivery of open data, open source and open systems to support the pharmaceutical sciences. He was also the technical lead for the UK National Chemical Database Service (http://cds.rsc.org/) and the RSC lead for the PharmaSea project (http://www.pharma-sea.eu/) attempting to identify novel natural products from the ocean. He left RSC in 2015 to become a Computational Chemist in the National Center of Computational Toxicology at the Environmental Protection Agency where he is bringing his skills to bear working with a team on the delivery of a new software architecture for the management and delivery of data, algorithms and visualization tools. The “Chemistry Dashboard” was released on April 1st, no fooling, at https://comptox.epa.gov, and provides access to over 700,000 chemicals, experimental and predicted properties and a developing link network to support the environmental sciences. Tony remains passionate about computer-assisted structure elucidation and verification approaches and continues to publish in this area. He is also passionate about teaching scientists to benefit from the developing array of social networking tools for scientists and is known as the ChemConnector on the networks. Over the years he has had adjunct roles at a number of institutions and presently enjoys working with scientists at both UNC Chapel Hill and NC State University. He is widely published with over 200 papers and book chapters and was the recipient of the Jim Gray Award for eScience in 2012. In 2016 he was awarded the North Carolina ACS Distinguished Speaker Award.
Leave a comment

Posted by on May 31, 2007 in ChemSpider Chemistry

 

0 Responses to Further Comments on the Quality of NMRShiftDB and NMR Prediction Algorithm Validation

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

 
Stop SOPA