Optical Structure Recognition, Solubility Prediction and Neutral Parties

There are a few areas of cheminformatics that I watch out of professional interest but more out of passion if the truth be known. As an NMR spectroscopist I still watch NMR processing and prediction software, CASE systems (Computer assisted structure elucidation), structure drawing and databasing, and, in regards to our recent interest over at ChemSpider regarding chemical name and structure image recognition, I watch OSR software developments. OSR is Optical Structure Recognition, the equivalent of OCR for chemical structure images. (Egon and I are both interested in OSR it seems…)

Probably the best known OSR system on the market for the past few years is CLiDE and I have had a chance to work with it as discussed here. There are now others available on the market though specifically ChemOCR from the Fraunhofer Institute. There is also OSRA from the National Cancer Institute and ChemReader from the University of Michigan. I can’t find it now but there was also Kekule, also funded by the NCI.

As with all software focusing on a particular problem the intention for these packages is the same but the technology approaches are different. These software packages all have similar intentions…convert structure images into machine readable chemical structure formats. The technology approaches are similar but differ of course in their implementation. This blog isn’t about those differences, it is about how can they be compared?

Recently a gauntlet was thrown down in regards to solubility prediction. The question asked was “Can You Predict Solubilities of Thirty-Two Molecules Using a Database of One Hundred Reliable Measurements? “. The details of the challenge are here. What was nice about this is the fact that the results could be judged by independent parties. What was objective, at least from where I’m sitting, is that experts in the field got to review the data and comment. This is very different from chemistry software vendors comparing each others products and standing with their own opinions. I’ve been involved with this myself in terms of NMR prediction comparisons and these discussions can get rather heated. There was similar “warmth” in the air about a year ago in the OSR domain as discussed here.

So, with so many efforts in the area of OSR how can we get independent testing of multiple OSR packages and get a true representation of the performance characteristics of these packages? Since some packages are commercial while others are Open Source we would need to separate the distinctions of “packaging” from performance. A set of objective criteria separating usability, workflows and interface from algorithms. This doesn’t mean that the former are not important, nay critical to the success of a software package BUT the algorithms, the science, the technology should be the focus of the study.

I suggest taking 100-200 images from different sources and applying the various software packages to validate performance in a neutral way. The study should be conducted by neutral parties…not so neutral that they don’t care about the work but neutral in a way that they are implicitly wed to the outcome of an objective comparison of the OSR algorithms. I have an interest in this so will throw my hat in the ring…I have already done some work on CLiDE and OSRA (1, 2, 3, 4). WHo else would be interested?

The challenges…there are a few:

1) Would all of the OSR producers share their software packages with a neutral panel of reviewers?

2) Who would fund the work? The Solubility challenge appears to have been funded by Pfizer. What immediacy would it be done with without funding…everyone’s busy.

3) How would the panel be selected?

4) Would the work be conducted without all OSR producers participating?

5) About a dozen more concerns….probably Jonathan Goodman, Robert Glen and John Mitchell could give some great advice based on their experience with the Solubility Challenge.

I think this type of comparison needs doing…you?

Coming Soon - The Journal of Cheminformatics

A cheer went up today in my office when I saw that David Wild and the group at Chemistry Central would soon be releasing an Open Access journal dedicated to Cheminformatics.

The intent of the journal is clear: Chemistry Central announces the imminent launch of the Journal of Cheminformatics.

The Journal of Cheminformatics is devoted to the dissemination of new and original knowledge in all branches of cheminformatics and molecular modelling including:

  • chemical information systems, software and databases, and molecular modelling
  • chemical structure representations and their use in structure, substructure, and similarity searching of chemical substance and chemical reaction databases
  • computer and molecular graphics, computer-aided molecular design, expert systems, QSAR, and data mining techniques

I judge that with David at the helm and with Christoph Steinbeck as Editor in Chief in Europe the quality will be very high and there will not be some of the issues I have commented on previously regarding quality in Open Access journals. I had been thinking of getting together some of my own colleagues to start an Open Access journal about cheminformatics but there is just no time left in my day so I am glad to see that this is being done. Great news for our community.

A New Publication Regarding Indirect Covariance NMR

A new publication for which I am a co-author is out on the application of covariance NMR processing

Multistep correlations via covariance processing of COSY/GCOSY spectra: opportunities and artifacts

 

 Long-range homonuclear coupling pathways can be observed in COSY or GCOSY spectra by the acquisition of spectra with larger numbers of increments of the evolution period, t1, than would normally be used. Alternatively, covariance processing of COSY-type spectra acquired with modest numbers of t1 increments, allows the observation of multistage correlations. In this work results obtained from covariance-processed GCOSY spectra are fully analyzed and compared to normally processed COSY and 80 ms TOCSY spectra.

A Week of Writing NMR Publications

I am a week behind on submitting a Chapter for publication in the Second Edition of the Encyclopedia of Spectroscopy. Too much work…not enough time. In the meantime I’ve co-authored a couple of publications with friends from ACD/Labs…one of these addresses an issue discussed on the ChemSpider Blog…a longstanding wish to compare empirical and quantum-mechanical NMR prediction approaches.

A Systematic Approach for the Generation and Verification of Structural Hypotheses.

During the process of molecular structure elucidation, selection of the most probable structural hypothesis may be based on chemical shift prediction. The prediction is carried out either by empirical or quantum-mechanical (QM) methods. When QM methods are used NMR prediction commonly utilizes the GIAO option of the DFT approximation. In this approach the structural hypotheses are expected to be investigated by the scientist. In this article we hope to show that the most rational manner by which to create structural hypotheses is actually by the application of an expert system capable of deducing all potential structures consistent with the experimental spectral data and specifically using 2D NMR data. When an expert system is used the best structure(s) can be distinguished using chemical shift prediction using either incremental or neural net algorithm. The time-consuming quantum-mechanical calculations can then be applied, if necessary, to one or more of the “best” structures to confirm the suggested solution.

The Application of Empirical Methods of NMR Chemical Shift Prediction to Determine Relative Stereochemistry.

The reliable determination of stereostructures contained within chemical structures usually requires utilization of NMR data, chemical derivatization, molecular modeling, quantum-mechanical calculations and, if available, X-ray analysis. In this article we show that the number of stereoisomers which need to be thoroughly verified can be significantly reduced by the application of NMR chemical shift calculation to the full stereoisomer set of possibilities using a fragmental approach based on HOSE codes. The usefulness of suggested method is illustrated using experimental data published for artarborol.

Chem4Word Project from Microsoft and Murray-Rust

Following on from my presentation regarding text-mining and document mark-up at the ACS meeting in Philly it was interesting to see the announcement about the Chem4Word project from Microsoft. In collaboration with the Unilever School of Informatics at Cambridge university, and specifically working with Peter Murray-Rust and some of his team. From the website announcement it states:  “Microsoft Research is investigating the introduction of chemistry-related features in Microsoft Office Word, including authoring and semantic annotations. Our approach to chemistry authoring will be modeled after the mathematic equation authoring in Word 2007 and will leverage many of the user-interface and XML extensibility options that are provided by Office 2007.

The goal of the Chem4Word project is to enable similar authoring, display, and mining scenarios for chemistry-related information within Office Word. Specifically, we aim to:

  • Provide easy authoring of chemical information within Microsoft Office Word 2007 documents
  • Allow end-user denotation of inline “chemical zones”
  • Render high-quality, print-ready visual depictions of chemical structures
  • Store and expose chemical information in a semantically rich manner to support publishing and mining scenarios, for authors, readers, publishers, and other vendors across the broad chemical information community”

This will be very useful in terms of supporting our efforts to enable the publication process for chemists and we will be watching this project with interest and hope to be engaged in early testing if we are invited.

Retrosynthetic Analysis Presentation at ACS-Philly

I had the pleasure of representing ARChem Route Designer, a retrosynthetic analysis tool from SimBioSys at the American Chemical Society meeting in Philadelphia last week. While I am not an organic synthetic chemist I have done my fair share of syntheses during my BSc and PhD and actually had a bit of a green thumb when it came to purity and yield. When given the opportunity I ran instead at exciting nuclear spins in large magnets (and enjoyed my choice for many years).

Since I was in the commercial sector for over a decade managing the development of chemistry software I have always had an interest in the development of a retrosynthetic analysis tool. It’s a lot of work and requires a deep understanding of organic chemistry. ARChem results from combining the deep understanding of Peter Johnson with the software development skills of SimBioSys. Peter was not able to make it to the ACS meeting in Philadelphia and, since I have had some experience of ARChem as a result of working with SimBioSys over the past few months, I was asked to step in and present on the product.

A link to the presentation is given here. A paper on ARChem has also been submitted if anyone is interested.

The Network of Antony Williams ChemSpiderman

As with most people in the blogosphere I am happily dabbling with different social networking tools and specifically those enabling the scientific community. I have a LinkedIn profile, am starting to participate on ResearchGate and SciLink, as well as others. Tonight I looked at BiomedExperts and created my profile. It’s a very easy to use site, it was simple to bring together my publication history from the past few years and I enjoyed the visualization tools enabling me to see m network (an example shown below). Check it out..

Publication for People Interested in Computer Assisted Structure Elucidation

Recently I took delivery of a box of reprints of a review article written with by our team of Mikhail Elyashberg (ACD/labs), Gary Martin (Schering-Plough) and myself. It was a major undertaking and took two years of work to final release. It is a >100 pages typeset article. The title is

“Computer Assisted Structure verification and elucidation tools in NMR-based structure elucidation” (doi:10.1016/j.pnmrs.2007.04.003)

The article outline is posted here.

If anyone would like a copy of the article please send me an email at antonyDOTwilliamsATchemspiderDOTcom. I will ask you to cover the costs of shipping via paypal.

Reviewing Galley Proofs for a Book on Chemical Information Mining

I’ve reviewed the galley proofs for our book chapter in a future book. The chapter is entitled “Automated Identification and Conversion of Chemical Names to Structure Searchable Information” and was co-authored with my friend Andrey Yerin, the project manager for nomenclature products at ACD/Labs. The book is going to be an excellent contribution to this domain and the list of contributing authors includes some of the leaders in this area. I’ll continue to post informationas the book gets close to press. It is going to be a highly recommended volume in my opinion.

My most colorful NMR publication ever

I’ve had the pleasure of working with my close friend Gary Martin and my old colleagues from ACD/Labs for the past couple of years on a processing technique for NMR called indirect covariance. Our latest publication was just accepted to the Journal of Heterocyclic Chemistry.

Unsymmetrical Indirect Covariance Processing of Hyphenated and Long-​Range Heteronuclear 2D NMR Spectra – Enhanced Visualization of 2JCH and 4JCH Correlation Responses

Abstract

Recent reports have demonstrated the unsymmetrical indirect covariance combination of discretely acquired 2D NMR experiments into spectra that provide an alternative means of accessing the information content of these spectra. The method can be thought of as being analogous to the Fourier transform conversion of time domain data into the more readily interpreted frequency domain. Hyphenated 2D-​NMR spectra such as GHSQC-​TOCSY, when available, provide an investigator with the means of sorting proton-​proton homonuclear connectivity networks as a function of the 13C chemical shift of the carbon directly bound to the proton from which propagation begins. Long-​range heteronuclear chemical shift correlation experiments establish proton-​carbon correlations via heteronuclear coupling pathways, most commonly across three bonds (3JCH), but in more general terms across two (2JCH) to four bonds (4JCH). In many instances 3JCH correlations dominate GHMBC spectra. We demonstrate in this report the improved visualization of 2JCH and 4JCH correlations through the unsymmetrical indirect covariance processing of GHSQC-​TOCSY and GHMBC 2D spectra.

It’s also one of the most colorful publications, with the most correlation arrows (!) I’ve ever been involved with.