There are a few areas of cheminformatics that I watch out of professional interest but more out of passion if the truth be known. As an NMR spectroscopist I still watch NMR processing and prediction software, CASE systems (Computer assisted structure elucidation), structure drawing and databasing, and, in regards to our recent interest over at ChemSpider regarding chemical name and structure image recognition, I watch OSR software developments. OSR is Optical Structure Recognition, the equivalent of OCR for chemical structure images. (Egon and I are both interested in OSR it seems…)
Probably the best known OSR system on the market for the past few years is CLiDE and I have had a chance to work with it as discussed here. There are now others available on the market though specifically ChemOCR from the Fraunhofer Institute. There is also OSRA from the National Cancer Institute and ChemReader from the University of Michigan. I can’t find it now but there was also Kekule, also funded by the NCI.
As with all software focusing on a particular problem the intention for these packages is the same but the technology approaches are different. These software packages all have similar intentions…convert structure images into machine readable chemical structure formats. The technology approaches are similar but differ of course in their implementation. This blog isn’t about those differences, it is about how can they be compared?
Recently a gauntlet was thrown down in regards to solubility prediction. The question asked was “Can You Predict Solubilities of Thirty-Two Molecules Using a Database of One Hundred Reliable Measurements? “. The details of the challenge are here. What was nice about this is the fact that the results could be judged by independent parties. What was objective, at least from where I’m sitting, is that experts in the field got to review the data and comment. This is very different from chemistry software vendors comparing each others products and standing with their own opinions. I’ve been involved with this myself in terms of NMR prediction comparisons and these discussions can get rather heated. There was similar “warmth” in the air about a year ago in the OSR domain as discussed here.
So, with so many efforts in the area of OSR how can we get independent testing of multiple OSR packages and get a true representation of the performance characteristics of these packages? Since some packages are commercial while others are Open Source we would need to separate the distinctions of “packaging” from performance. A set of objective criteria separating usability, workflows and interface from algorithms. This doesn’t mean that the former are not important, nay critical to the success of a software package BUT the algorithms, the science, the technology should be the focus of the study.
I suggest taking 100-200 images from different sources and applying the various software packages to validate performance in a neutral way. The study should be conducted by neutral parties…not so neutral that they don’t care about the work but neutral in a way that they are implicitly wed to the outcome of an objective comparison of the OSR algorithms. I have an interest in this so will throw my hat in the ring…I have already done some work on CLiDE and OSRA (1, 2, 3, 4). WHo else would be interested?
The challenges…there are a few:
1) Would all of the OSR producers share their software packages with a neutral panel of reviewers?
2) Who would fund the work? The Solubility challenge appears to have been funded by Pfizer. What immediacy would it be done with without funding…everyone’s busy.
3) How would the panel be selected?
4) Would the work be conducted without all OSR producers participating?
5) About a dozen more concerns….probably Jonathan Goodman, Robert Glen and John Mitchell could give some great advice based on their experience with the Solubility Challenge.
I think this type of comparison needs doing…you?