Following on from a recent post regarding an estimate of the number of potential errors in the NMRShiftDB a recent report was issued regarding a deep analysis of the data rather than the cursory estimates that I suggested. The details of that analysis have been exposed on Ryan Sasakiâ€™s blog and in a separate report. By definition an objective examination of performance of algorithms by a software vendor is generally challenged. So be it. However, if the assumption that the examination is objective and driven by quality science then any other flavors should be distasteful and the science is what it is.
Previously I commented in the blog that based on an early analysis â€œThe quality is excellent and there are â€œlarge errorsâ€ but minimum in number.â€ This comment seemed to cause some confusion.
So, what do I mean by large error?
What I do NOT mean is that a chemical shift at 120ppm is predicted to be at 80ppm and therefore there is a large error. No, the chemical shift at 120ppm could be experimentally correct but the prediction algorithm could fail to predict it correctly.
What I DO mean is that an assignment of a particular nucleus to 120ppm may be entered into the database but the ACTUAL shift should be 12ppmâ€¦.that additional zero just showing up as an error during the submission process. So, the errors I am pointing to are those of incorrectly drawn structures, mis-assignments, transcription errors and other potential sources of error. My estimates refer to the number of significant assignment or structural errors that were glaringly incorrect and I was subjectively thinking of situations where the difference between the actual experimental shift value and the one assigned to nucleus was >20ppmâ€¦.this does not mean that mis-assignments of even 1ppm are any less importance, just not necessarily as easy to detect and not part of my subjective criteria.
At present, the data have been examined in more detail and I believe I overestimatedâ€¦a report of potential glaring errors has been returned to Christoph for him to examine and make changes to the database as he sees fit. Glaring errors are less than 250 in number based on my subjective criteria. Again, this does not mean that there arenâ€™t hundreds or thousands of errors buried in the dataâ€¦they are not obvious errors and require more manual examination.
Now, letâ€™s return to some of the other comments about potential errors in the database. In his post Robien has highlighted some errors that are contained within the data. With his examination he has highlighted the necessary level of care and caution required to create the cleanest database of data possible and willingly admits there are likely still errors in his own database even after rigorous checking procedures. It would be naÃ¯ve to not admit this for a large database and certainly this is true for some of the NMR data contained in the millions of shifts used as the basis of the ACD/Labs NMR predictions. There are errors in thereâ€¦we are just not aware of them yet.
Robien comments that â€œthe total number of incorrect assignments exceeds the above mentioned limit of 250 significantly. The intermediate number is at the moment around 300, but about ca. 1,000 pages of printouts are waiting for visual inspection.â€ This is the nature of building high quality databases..they need to be manually curated and appropriate processes and tools need to be applied to ensure rigorous examination of the data. Ryanâ€™s already commented on the difference between 250 and 300â€¦as has PMR in terms of 1% error.
Robien also comments that â€œ From my experience with C13-NMR databases a reasonable estimate for the number of mis-assigned shift values will be between 500 and 600, which corresponds to ca. 0.25-0,3%. Afterwards the data will reach the usual quality as within CSEARCH and NMRPredict.â€ Compare this with our experience of building databases of NMR shifts over the past decade. Year after year we have added somewhere close to 200,000 chemical shifts to the carbon NMR database. Each of the structures and individual chemical shifts passes through processes outlined elsewhere (this is an OLD presentation and processes have tweaked for even greater control of data quality…) . Over the years we have gathered statistics on the quality of data contained within the literature. This is almost exclusively peer-reviewed literature. Our statistics, on an annual basis, are that 8%, EIGHT PERCENT, of the chemical structures and associated shift assignments are in error in the literature. This can mean incorrectly elucidated structures or incorrectly assigned structures, transcription errors and so on. These are the errors we catch.
I reported previously on an example of the type of errors perpetuating in the literature. For a tosyl group as represented below, consider C-1 and C-4.
Without going into detail, when chemical shifts are predicted it is possible to examine what structures are used to generate the predictions There are TWO distinct columns of related structures displayed, one around 132ppm and one around 145ppm. See below.
Conclusions from these data include the fact that ongoing assignments for C-1 and C-4 have been confused in many cases. These confusions did not arise from one particular publication that we can identify. We had to clean these up in our database as Wolfgang likely did! â€œError propagationâ€ as he comments, is dangerous! Finding the source of the errors is challenging.
Christoph Steinbeck and the NMRShiftDB team have outlined their intentions admirably in at least two articles (J. Chem. Inf. Comput. Sci. 2003, 43, 1733-1739 and Phytochemistry 65 (2004) 2711â€“2717). They are well on their way to delivering on their vision. A detailed analysis of the performance of the ACD/Labs C-13 NMR NEURAL NETWORK based predictor using the NMRShiftDB dataset has been performed and a manuscript is presently in preparation. That work included a review of the impacts of overlap in the training set. The numbers are impressive and a teaser is available.
PMR commented on one of the capabilities of NMRShiftDB â€“ â€œIt contains mechanisms for assessing data quality automatically. For example software can be run that will indicate whether values are seriously in error.â€ NMRShiftDB is a research project and therefore â€œin developmentâ€. Iâ€™ve already provided a document outlining what appears to be a series of serious errors in the data (Robien identified some too). There is some work required on these algorithms. An example is shown belowâ€¦notice that 8.30ppm shift on the carbonyl?
The assignments for this structure have already been reported, twice. Once in the solid state, once in DMSO. The screenshot below shows the values. EVERY shift except for one as submitted to NMRShiftDB is in error. But, this is ONE structure from one submitter. Maybe one approach is to have â€œfaith criteriaâ€ for submitters after they have proven themselves with the submission of data? Maybe something untoward happened with the system this one time during submission. It does not take away from the effortâ€¦.itâ€™s just data that there is more work to be done. Here at ChemSpiderâ€¦we know that world!