Archive Info

You are currently browsing the The ChemConnector Blog by Antony Williams weblog archives for 'Community Building' category

Optical Structure Recognition, Solubility Prediction and Neutral Parties

There are a few areas of cheminformatics that I watch out of professional interest but more out of passion if the truth be known. As an NMR spectroscopist I still watch NMR processing and prediction software, CASE systems (Computer assisted structure elucidation), structure drawing and databasing, and, in regards to our recent interest over at ChemSpider regarding chemical name and structure image recognition, I watch OSR software developments. OSR is Optical Structure Recognition, the equivalent of OCR for chemical structure images. (Egon and I are both interested in OSR it seems…)

Probably the best known OSR system on the market for the past few years is CLiDE and I have had a chance to work with it as discussed here. There are now others available on the market though specifically ChemOCR from the Fraunhofer Institute. There is also OSRA from the National Cancer Institute and ChemReader from the University of Michigan. I can’t find it now but there was also Kekule, also funded by the NCI.

As with all software focusing on a particular problem the intention for these packages is the same but the technology approaches are different. These software packages all have similar intentions…convert structure images into machine readable chemical structure formats. The technology approaches are similar but differ of course in their implementation. This blog isn’t about those differences, it is about how can they be compared?

Recently a gauntlet was thrown down in regards to solubility prediction. The question asked was “Can You Predict Solubilities of Thirty-Two Molecules Using a Database of One Hundred Reliable Measurements? “. The details of the challenge are here. What was nice about this is the fact that the results could be judged by independent parties. What was objective, at least from where I’m sitting, is that experts in the field got to review the data and comment. This is very different from chemistry software vendors comparing each others products and standing with their own opinions. I’ve been involved with this myself in terms of NMR prediction comparisons and these discussions can get rather heated. There was similar “warmth” in the air about a year ago in the OSR domain as discussed here.

So, with so many efforts in the area of OSR how can we get independent testing of multiple OSR packages and get a true representation of the performance characteristics of these packages? Since some packages are commercial while others are Open Source we would need to separate the distinctions of “packaging” from performance. A set of objective criteria separating usability, workflows and interface from algorithms. This doesn’t mean that the former are not important, nay critical to the success of a software package BUT the algorithms, the science, the technology should be the focus of the study.

I suggest taking 100-200 images from different sources and applying the various software packages to validate performance in a neutral way. The study should be conducted by neutral parties…not so neutral that they don’t care about the work but neutral in a way that they are implicitly wed to the outcome of an objective comparison of the OSR algorithms. I have an interest in this so will throw my hat in the ring…I have already done some work on CLiDE and OSRA (1, 2, 3, 4). WHo else would be interested?

The challenges…there are a few:

1) Would all of the OSR producers share their software packages with a neutral panel of reviewers?

2) Who would fund the work? The Solubility challenge appears to have been funded by Pfizer. What immediacy would it be done with without funding…everyone’s busy.

3) How would the panel be selected?

4) Would the work be conducted without all OSR producers participating?

5) About a dozen more concerns….probably Jonathan Goodman, Robert Glen and John Mitchell could give some great advice based on their experience with the Solubility Challenge.

I think this type of comparison needs doing…you?

Chem4Word Project from Microsoft and Murray-Rust

Following on from my presentation regarding text-mining and document mark-up at the ACS meeting in Philly it was interesting to see the announcement about the Chem4Word project from Microsoft. In collaboration with the Unilever School of Informatics at Cambridge university, and specifically working with Peter Murray-Rust and some of his team. From the website announcement it states:  “Microsoft Research is investigating the introduction of chemistry-related features in Microsoft Office Word, including authoring and semantic annotations. Our approach to chemistry authoring will be modeled after the mathematic equation authoring in Word 2007 and will leverage many of the user-interface and XML extensibility options that are provided by Office 2007.

The goal of the Chem4Word project is to enable similar authoring, display, and mining scenarios for chemistry-related information within Office Word. Specifically, we aim to:

  • Provide easy authoring of chemical information within Microsoft Office Word 2007 documents
  • Allow end-user denotation of inline “chemical zones”
  • Render high-quality, print-ready visual depictions of chemical structures
  • Store and expose chemical information in a semantically rich manner to support publishing and mining scenarios, for authors, readers, publishers, and other vendors across the broad chemical information community”

This will be very useful in terms of supporting our efforts to enable the publication process for chemists and we will be watching this project with interest and hope to be engaged in early testing if we are invited.

The Network of Antony Williams ChemSpiderman

As with most people in the blogosphere I am happily dabbling with different social networking tools and specifically those enabling the scientific community. I have a LinkedIn profile, am starting to participate on ResearchGate and SciLink, as well as others. Tonight I looked at BiomedExperts and created my profile. It’s a very easy to use site, it was simple to bring together my publication history from the past few years and I enjoyed the visualization tools enabling me to see m network (an example shown below). Check it out..

Invited Symposium Speaker at a Fortune 500 Company

I’m excited to speak next week at a “by invitation only” symposium at one of the top Fortune 500 Companies. The focus of the gathering for the 350 attendees will be “Networks” and I will be speaking about  “Crowd-sourcing to Build A Structure-centric Community for Chemists”. I will of course talk about ChemSpider but also about my experiences with Wikipedia Chemistry and other general and scientific networks I have become involved with over the years. I will be speaking alongside invited speakers from organizations such as Yahoo, MIT, General Electric, Brookhaven, Harvard University etc so I am quite humbled not only by the invitation  but also by the chance to network (appropriate for a gathering about “networks”) with such a diverse group of people. I’m not sure what the situation is regarding releasing the presentation publicly after the gathering but will do so following discussions with the organizers. I’m sure it will be acceptable.


Permalinks By IIS Permalinks