Archive Info

You are currently browsing the The ChemConnector Blog by Antony Williams weblog archives for 'Software' category

Chemicalize From ChemAxon

If you are a chemist and looking for some useful internet tools to assist you in your work I recommend Chemicalize from ChemAxon. It’s a great addition to the suite of tools that chemists can bring to bear on their problems. The fastest way to learn about Chemicalize is to watch the YouTube video here, and embedded below.

This is a great way to mark up compounds in web pages and then move over to the data pages for predicted properties. The predicted property capabilities is a great offering to the community. The predictions are licensed under Creative Commons “Attribution-NonCommercial-ShareAlike 3.0 Unported”. The site has a few teeting troubles, especially in terms of layout on IE8, but this should not detract from the value of the predictions. I am not aware of any other site that will provide free access to pKa predictions as shown below. This will really commoditize the market at this point and shake up the other vendors in this domain. ChemSpider has recently integrated Chemicalize as discussed on the ChemSpider blog.

The Messy World of Even Curated Chemistry on the Internet

Recently I have been spending my night hours looking into the nature of curated chemistry on the internet. 3 years ago I made some assumptions around the quality of certain online datasets when they were deposited onto ChemSpider. It was clear that a lot of internet chemistry datasets were “impure”…I think messy, untrustworthy and confused would be a fairer statement! However, there were a number of datasets that were manually curated and, at initial viewing, were higher quality. With time however I have become increasingly concerned with some of the datasets that I had originally cited as high quality. Over the next few days/weeks I will examine some of these in detail and highlight some of the issues I am seeing. I want to clarify that all chemical compounds, in terms of  their connection tables, their stereochemistry and the association between the compound and the name(s) are assertions. However, there are “norms” for these structures….we would expect a particular structure for aspirin (acetylsalicylic acid ), a single structure for Cholesterol and a single structure for Taxol. By the way, the links to Wikipedia are not assertions that the structures that are presently on Wikipedia are correct representations…but I can confirm that PREVIOUSLY I did work to confirm that every one of these was consistent with my investigations to assert the association between the chemical name and the structure. SInce then it is possible that someone edited the structure…such is the world of Wikipedia!

Two of the linked data sources I have been investigating of late are DrugBank and the Protein Databank. Both of these are manually curated and are expected to be of high quality. In my discussions with various members of the Life Science industry I have heard many positive comments of these data sources as being trustworthy and high quality. I recently downloaded the drugbank small molecule set and started looking at it. Let’s take one example…

The Drugbank record DB02309 has the chemical name “5-Monophosphate-9-Beta-D-Ribofuranosyl Xanthine“. The structure on Drugbank is shown below.

The chemical name above is inconsistent with the structure…there is no stereochemistry in the molecule displayed despite the “-D-” in the name. The IUPAC name listed in the Drugbank record is “[(2R,3S,4R,5R)-5-(2,6-dioxo-3,7-dihydropurin-9-ium-9-yl)-3,4-dihydroxyoxolan-2-yl]methyl dihydrogen phosphate” and this clearly does not agree with the displayed structure.

The InChI listed on the record does not include a stereo layer (InChI=1/C10H13N4O9P/c15-5-3(1-22-24(19,20)21)23-9(6(5)16)14-2-11-4-7(14)12-10(18)13-8(4)17/h2-3,5-6,9,15-16H,1H2,(H2,19,20,21)(H2,12,13,17,18)/p+1/fC10H14N4O9P/h11-13,19-20H/q+1). The InChIKey is listed as:

DCTLYFZHFGENCW-NIVOTTSGCB

The Drugbank record links to a structure with full explicit stereochemistry on PubChem here and to the ligand on the PDB ligand database hosted by ChEBI here.

The molfile downloaded from DrugBank has no stereochemistry but lists both Canonical and Isomeric SMILES

Isomeric SMILES O[C@H]1[C@H](COP(O)(O)=O)O[C@H]([C@@H]1O)N1C=[NH+]C2=C1NC(=O)NC2=O
Canonical SMILES OC1C(COP(O)(O)=O)OC(C1O)N1C=[NH+]C2=C1NC(=O)NC2=O

It is clear what has happened, I believe….the Drugbank record has used the canonical SMILES to generate the structure image and has neglected the stereochemistry. However, the names carry the original stereochemistry information while the InChI comes from the structure with no stereo. I think that’s what happened.Let’s confirm.

ASSUMING that the isomeric SMILES string is the appropriate stereochemistry I can convert it and get the following InChIKey (generated using ACD/ChemSketch) and using ACD/Name get the name below). I trust ChemSketch and ACD/Name products to generate both appropriately as I managed these products while at ACD/Labs for over a decade.

DCTLYFZHFGENCW-NSVMUQOTBF

9-{(2R,3R,4R,5S)-3,4-dihydroxy-5-[(phosphonooxy)methyl]tetrahydrofuran-2-yl}-2,6-dioxo-2,3,6,9-tetrahydro-1H-purin-7-ium

On Drugbank the chemical name listed is:

[(2R,3S,4R,5R)-5-(2,6-dioxo-3,7-dihydropurin-9-ium-9-yl)-3,4-dihydroxyoxolan-2-yl]methyl dihydrogen phosphate

Okay…the names are subtly different….but there are 3R and 1S centers in each name but they differ, assuming that the nomenclature programs are using consistent numbering schemes. See below.

Name generated from Isomeric SMILES on DrugBank: 2R,3R,4R,5S

Chemical Name on DrugBank: 2R,3S,4R,5R

More on this later. Looking at the linked PubChem record gives the following name: [(2R,3S,4R,5R)-5-(2,6-dioxo-3,7-dihydropurin-9-ium-9-yl)-3,4-dihydroxyoxolan-2-yl]methyl dihydrogen phosphate, exactly the same one as listed on Drugbank….so one assumes that the chemical names on DrugBank come from PubChem. Downloading the molfile from PubChem into the same software used to generate InChIs and chemical names gives:

XHDARDSMKMUDDI-XWTUZWARBP

9-{(2R,3R,4S,5R)-3,4-dihydroxy-5-[(phosphonooxy)methyl]tetrahydrofuran-2-yl}-2,6-dioxo-2,3,6,7-tetrahydro-1H-purin-9-ium

DrugBank is linked out to the PDB ligands hosted by ChEBI and looking at the XMP ligand here we see:

[(2R,3S,4R,5R)-5-(2,6-dioxo-3H-purin-7-ium-9-yl)-3,4-dihydroxy-oxolan-2-yl]methyl dihydrogen phosphate

This is the SAME stereochemistry in the chemical name as on DrugBank, but actually a different chemical name. It is definitely possible, and common, for different systematic names to exist for the same chemical but it does indicate the challenges of linking based on different identifiers.

DrugBank:

[(2R,3S,4R,5R)-5-(2,6-dioxo-3H-purin-7-ium-9-yl)-3,4-dihydroxy-oxolan-2-yl]methyl dihydrogen phosphate

PDBeChem: [(2R,3S,4R,5R)-5-(2,6-dioxo-3,7-dihydropurin-9-ium-9-yl)-3,4-dihydroxyoxolan-2-yl]methyl dihydrogen phosphate

The InChIKeys between the different databases/tools are:

PDBeChem: DCTLYFZHFGENCW-KWDNBKPHDV

DrugBank: DCTLYFZHFGENCW-NIVOTTSGCB

PubChem: XHDARDSMKMUDDI-UUOKFMHZSA-O (this is a StdInChIKey)

ChemSketch: DCTLYFZHFGENCW-NSVMUQOTBF

ALL four are inconsistent.

If I convert the SMILES string listed on the PDBeChem ligand database using ACD/ChemSketch then

O[C@H]1[C@@H](O)[C@@H](O[C@@H]1CO[P](O)(O)=O)n2c[nH+]c3C(=O)NC(=O)Nc23

produces a structure with stereochemistry of 2R,3R,4S,5R and the InChIKey : DCTLYFZHFGENCW-XWTUZWARBW.

The stereochemistry on PDBeChem agrees with that on PubChem (based on the name), the connectivity part of the InChIKey is consistent with all other systems (except PubChem) but is different to all other InChIKeys. It is also possible to download “ideal” and “representative” molfiles from the PDBeChem database.

The InChIKeys between the different databases/tools are now:

PDBeChem: DCTLYFZHFGENCW-KWDNBKPHDV

DrugBank: DCTLYFZHFGENCW-NIVOTTSGCB

PubChem: XHDARDSMKMUDDI-UUOKFMHZSA-O (this is a StdInChIKey)

ChemSketch: DCTLYFZHFGENCW-NSVMUQOTBF

PDBeChem: DCTLYFZHFGENCW-XWTUZWARBW (from Isomeric SMILES converted via ChemSketch)

PDBeCHem: DCTLYFZHFGENCW-RDKRKOOMBN (from molfile from PDBeChem, (2R,3R,4S,5R)

DrugBank also links to the Protein Databank here. XMP is listed as a ligand as shown.

The XMP ligand links here to the detailed page containing the information linked below.

Name XANTHOSINE-5′-MONOPHOSPHATE
5′-xanthylic acid
[(2R,3S,4R,5R)-5-(2,6-dioxo-3H-purin-7-ium-9-yl)-3,4-dihydroxy-oxolan-2-yl]methyl dihydrogen phosphate
Synonyms 5-MONOPHOSPHATE-9-BETA-D-RIBOFURANOSYL XANTHINE
Formula C10 H14 N4 O9 P
Molecular Weight 365.21 g/mol
Type NON-POLYMER
Isomeric SMILES (OpenEye) c1[nH+]c2c(n1[C@H]3[C@@H]([C@@H]([C@H](O3)COP(=O)(O)O)O)O)NC(=O)NC2=O
InChI InChI=1/C10H13N4O9P/c15-5-3(1-22-24(19,20)21)23-9(6(5)16)14-2-11-4-7(14)12-10(18)13-8(4)17/h2-3,5-6,9,15-16H,1H2,(H2,19,20,21)(H2,12,13,17,18)/p+1/t3-,5-,6-,9-/m1/s1/fC10H14N4O9P/h11-13,19-20H/q+1
InChI key DCTLYFZHFGENCW-KWDNBKPHDV

The InChIKeys between the different databases/tools are now:

PDBeChem: DCTLYFZHFGENCW-KWDNBKPHDV

DrugBank: DCTLYFZHFGENCW-NIVOTTSGCB

PubChem: XHDARDSMKMUDDI-UUOKFMHZSA-O (this is a StdInChIKey)

ChemSketch: DCTLYFZHFGENCW-NSVMUQOTBF

PDBeCHem: DCTLYFZHFGENCW-RDKRKOOMBN (from molfile from PDBeChem)

PDB_ligand: DCTLYFZHFGENCW-KWDNBKPHDV

Aagghhhhh…InChIKeys get very convoluted! What we see is that the chemical structure on PDB and on PDBeChem are the same. This is good news at least! There is a difference in the InChIKeys when I download the molfile but this can be explained easily…and in a later blog post.

We believe that the structure on PDB should be expected to be correct. We will assert this.

We expect that DrugBank is sourcing the chemical from PDB to add to their database. The chemical structure on DrugBank should coincide with that from PDB. Unfortunately the SMILES on PDB and DrugBank differ in two stereocenters. We don’t know why. Why the inconsistency? If the DrugBank data aren’t from PDB for the XMP ligand where did they come from?

Did PubChem pick up the structure of XMP from the PDB Database or from DrugBank? Let’s see. If I download the 2D molfile from PubChem and generate the chemical name and InChIs I get consistency…PubChem IS consistent with PDB. It is NOT consistent with DrugBank despite the fact that DrugBank is linked into this PubChem record.

This is a very convoluted, and maybe confusing analysis of ONE compound on DrugBank. I have looked at dozens and see similar issues. Assuming that PDB is the source database for data on DrugBank why are the structures differing so much? There are worse examples to come…the linking together of data on the web between even curated databases is an incredible mess.

Caveat: This is detailed and challenging work. I recommend anyone to check my work and see if I missed anything and confirm or challenge the observations as some of the issues I am seeing can be tool-based…the software tools I use may have issues with SMILEs conversion, molfile or SDF reading etc. It is exacting to check chemical structures…

Fail Fast Despite the Hype – A Model from Google Wave

I’ve been to Scifoo twice. Both times were great. I didn’t get to go this year…and I am sad not angry that I wasn’t invited. It is terrific that other people, new and old attendees, got to share in the wealth of experience that makes up SciFoo. I hope that it continues and I hope I get to go again.

The first time I went the Google Datasets project was announced. It seemed like a great offer to make to the scientific community. There clearly wasn’t enough participation for the effort as the project was promptly killed.

The next time I went back to Scifoo Google Wave received a lot of attention. Cameron Neylon helped integrate ChemSpider into Wave with ChemSpidey and the potential of Google Wave exploded across the internet as Google’s next big win. I thought the technology was “cool”, interesting, technology looking for a problem and “noisy”…it was very distracting, difficult for me personally to adopt into my daily work. I did play with it, worked on a couple of projects with some colleagues and conceived of how we would use some of the functions.

And now Google Wave is winding down….and I take this comment to heart “…despite these wins, and numerous loyal fans, Wave has not seen the user adoption we would have liked. We don’t plan to continue developing Wave as a standalone product.” Basically they have learned some lessons, probably got some very nice capabilities to plug in elsewhere later, and have decided, to stop investing. I’d love to know what their process was to come up with this decision. Wave was a massive story in the media….and well executed in terms of marketing the story up. How many companies are this clean with an announcement in terms of killing a project of this size…making a tight blog post on the company blog. It’s surprising to see it happen this way, but I have to respect them for the style of pulling the plug and, failing fast. There are lots of other companies who would continue to invest, fearful of the fallout of pulling the plug on a high profile project. Good for you Google…it’s a shame it didn’t work…I DID like pieces of the technology but overall I wasn’t an adopter.  But thanks for this “The central parts of the code, as well as the protocols that have driven many of Wave’s innovations, like drag-and-drop and character-by-character live typing, are already available as open source, so customers and partners can continue the innovation we began”. The community will probably take them!

How are NMR Prediction Algorithms and AFM Related?

There’s a really nice News piece over on Nature News regarding “Feeling the Shapes of Molecules“. The work reports on how Atomic Force Microscopy is being used to deduce chemical structure directly, one molecule at a time. It is, quite simply, stunning. This work is an extension of the original work reported on pentacene that many scientists thought was spectacular. This work is even one step closer to the dream of single molecule structure identification. The work is entitled “Organic structure determination using atomic-resolution scanning probe microscopy” and as well as the IBM group responsible for the AFM work involves Marcel Jaspars, someone who’s work I have watched for many years as I am trained as an NMR spectroscopist and have spent a lot of time working on computer-assisted structure elucidation (CASE) approaches to examine natural product structures (see references in here…).

The molecule that they studied was cephalandole A  that had previously been mis-assigned. Interestingly my old colleagues from ACD/Labs, where I worked for over a decade, and myself had published an article in RSC’s Natural Product Reviews where we studied “Structural revisions of natural products by Computer-Assisted Structure Elucidation (CASE) systems“. The basic premise of the article is that there are incorrect structures making it into the literature because of the misinterpretation of the analytical data and that computer algorithms, specifically NMR prediction and CASE algorithms, can be used to rule out structures elucidated by the scientists.It is hard to do justice to the entire review article as we detail the approaches to CASE and NMR prediction and doing it in a blog post is tough. So, I do recommend reading the NPR article. However, I am extracting the part that applies to the elucidation of the structure of cephalandole A and how algorithms would be of value in negating the incorrect structure.

“In 2006 Wu et al isolated a new series of alkaloids, particularly cephalandole A, 16. Using 2D NMR data (not tabulated in the article) they performed a full 13C NMR chemical shift assignment as shown on structure 16.

Mason et al synthesized compound 16 and after inspection of the associated 1H and 13C NMR data concluded that the original structure assigned to cephalandol A was incorrect. The synthetic compound displayed significantly different data from those given by Wu et al. The 13C chemical shifts of the synthetic compound are shown on structure 16A.

Cephalandole A was clearly a closely related structure with the same elemental composition as 16, and structure 17 was hypothesized as the most likely candidate. Compound 17 was described in the mid 1960s and this structure was synthesized by Mason et al. The spectral data of the reaction product fully coincided with those reported by Wu et al. The true chemical shift assignment is shown in structure 17. For clarity the differences between the original and revised structures are shown in Figure 17.


We expect that 13C chemical shift prediction, if originally performed for structure 16, would encourage caution by the researchers (we found dA=3.02 ppm). Figure 18 presents the correlation plots of the 13C chemical shift values predicted for structure 16 by both the HOSE and NN methods versus experimental shift values obtained by Wu et al. The large point scattering, the regression equation, the low R2 =0.932 value (an acceptable value is usually R2 = 0.995) and the significant magnitude of the g-angle between the correlation plot and the 45-grade line (a visual indication for disagreement between the experiment and model) could indicate inconsistencies with the proposed structure and should encourage close consideration of the structure. Our experience has demonstrated that a combination of warning attributes can serve to detect questionable structures even in those cases when the StrucEluc system is not used for structure elucidation.

Figure 18. Correlation plots of the 13C chemical shift values predicted for structure 16 by HOSE and NN methods versus experimental shift values obtained by Wu et al. Extracted statistical parameters: R2(HOSE)=0.932, dHOSE=1.20dexp-25.6.

So, for those NMR jocks who don’t have access to the genius of IBM scientists performing AFM, and yet want to have tools to help in the elucidation process you’d be doing well to use NMR prediction algorithms and CASE systems to help….it’s rather embarrassing to have to issue a retraction on a paper with your name on.

Meanwhile I am in awe of the work reported by Marcel and his colleagues at IBM. Clearly there’s a long way to go before such approaches are mainstream but the flag is in the sand…this is where things will speed up and we are surely destined, I hope (!) to see many more reports of this type of work and how it is progressing. Let’s hope. Feedback on the NPR article welcomed!!!

Organic structure determination using atomic-resolution scanning probe microscopy

Good Science Takes Time: 16 months to examine NMR Prediction Performance

In October 2007 I got involved in an exchange with Peter Murray-Rust from Cambridge University about Open Notebook NMR. The original post is here and my response is here. The basic premise of the exchange was that I believed that quantum-mechanical NMR predictions had a lot of limitations relative to empirical predictions. I made the comment based on over two decades working in NMR – the first decade managing a number of NMR laboratories and the second decade involved in the delivery of commercial software solutions, including NMR predictions, to the marketplace.

In my original response I stated “This has the potential to be a very exciting project. While I wouldn’t write the paper myself without doing the work I’ll certainly try the approach. Let’s see what the truth is. The challenge now is to get to agreement on how to compare the performance of the algorithms. We are comparing very different beasts with the QM vs. non-QM approaches so, in many ways, this should be much easier than the challenges discussed so far around comparing non-QM approaches between vendors.” and asked Peter to participate in a collaboration with us to do the comparison.

I then posted the blogpost below. It is included in its entirety as it defines what my thought process was almost two years ago and the approach that could be taken. In the blogpost I address a post directly to Peter. If you know the story then go past the history to the conclusions where I discuss the conclusion of the work we have done since this discussion started.

“Previously I blogged about “An Invitation to Collaborate on Open Notebook Science for an NMR Study“. I judged it was a great opportunity to “help build a bridge between the Open Data community, the academic community and the commercial software community for the benefit of science.” In particular I believe the project offers an opportunity to answer a longstanding question I have had. Specifically, I have seen a lot of publications in recent years utilizing complex, time-consuming GIAO NMR predictions. Having been involved with the development of NMR prediction algorithms for the past few years (while working with the scientists at ACD/Labs) my judgment is that these complex calculations can be replaced by calculations which can take just a couple of seconds on a standard PC. I believe this to be true for most organic molecules. I do not believe such calculations would outperform GIAO predictions for inorganic molecules or organometallic complexes or solid state shift tensors. However, there has never been a rigorous examination comparing performance differences. I believe this project offered an excellent opportunity to validate the hypothesis that HOSE code/Neural Network/Increment based predictions could, in general, outperform GIAO predictions.

The study was to be performed on the NMRShiftDB now available on ChemSpider. I’ve blogged previously about the validation of the database (1,2). The conversation about the NMR project has continued and Peter has talked about some of the challenges about open Notebook Science based on Cameron Neylon’s comments. I’ve posted the comments below to the post and they will likely be moderated in shortly. I post them here for the purpose of conclusion since I don’t think my original hopes will come to fruition. Thanks to those of you who have been engaged both on and off blog. I suggest we all help with Peter’s intention to help explain identifiers that are being extracted in the work.

“Can you provide some more details regarding your concerns here:”it would be possible for someone to replicate the whole work in a day and submit it for publication (on the same day) and ostensibly legitimately claim that they had done this independently. They might, of course use a slightly different data set, and slightly different tweaks.”

I have two interpretations:

1) Someone could repeat the GIAO calculations in a day and identify outliers and submit for publication

2) Someone could do the calculations using other algorithms and identify outliers etc and submit for publication

Maybe you mean something else?

For 1) the GIAO calculations CANNOT be repeated since no one has access to Henry‘s algorithms and based on your comments he is modifying them on an ongoing basis as a result of this work. Even if they did have their own GIAO calculations unless they have improved the performance dramatically or have access to a “boat load” of computers the calculations will take weeks (based on your own estimates). That said, comparing one GIAO algorithm to another is valid science and absolutely appropriate and publishable. Also, if they had used used the same dataset as you, with an other algorithm to check prediction and identify outliers it WOULD be independent. Related to the work you are doing for sure but independent.

For 2)using other algorithms on the same dataset is valid and appropriate science. THis is what people do with logP prediction (or MANY other parameters)..they validate their algorithms on the same dataset many times over. Its one of the most common activities in the QSAR and modeling world in my opinion. And people do use slightly different tweaks…it‘s one of the primary manners to shift the algorithms. Henry‘s doing this right now to deal with halogens according to your earlier post. Wolfgang Robien at University of Vienna, ACD/Labs and others use their own approaches but both at a minimum can use HOSE code and Neural Networks. Same general approaches with tweaks. They give different results…all is appropriate science.

Returning to the comment “it would be possible for someone to replicate the whole work in a day and submit it for publication (on the same day) and ostensibly legitimately claim that they had done this independently.”

Wolfgang Robien has taken the NMRShiftDB dataset and performed an analysis. It‘s posted here. ACD/Labs performed a similar analysis as discussed on Ryan‘s blog here. One of the outputs is this document. This resulted in further exchanges and dialog. The parties have discussed this on the phone and face to face with Ryan talking with Wolfgang recently in Europe at a conference.

This was heated and opinionated for sure. STRONG scientific wills and GREAT scientists defending their approaches and performance. Wolfgang is NOT an enemy for ACD/Labs…he has made some of the greatest contributions to the domain of NMR prediction and, in many ways, has been one to emulate in terms of his approach to quality and innovation to create breakthroughs in performance. He is a worthy colleague and drives improvement by his ongoing search for improvements in his own algorithms. I honor him.

The bottom line is this: approaches for the identification of outliers in NMRShiftDB have been DONE already. It‘s been discussed online for months…just do a search on “Robien NMRshiftDB” on google or “ACD/Labs nmrshiftdb”. There are hundreds of pages. We/I just published on the validation of the NMRShiftDB. I blogged about it and you posted it here. Feedback on outliers have been returned to Christoph and changes made already. SO in many ways you are doing repeat work – just using a different algorithm and identifying new outliers. Neither ACD/Labs nor Wolfgang‘s work was exhaustive. it was very much a first cut but did help edit many records already. NO DOUBT you will find new outliers.

I‘ve gone back to the original post and extract two purposes to the work:

1) To perform Open Notebook Science

2) quote “To show that the philosophy works, that the method works, and that NMRShiftDB has a measurable high-quality.”

1) has already changed and is an appropriate outcome from the work.(http://wwmm.ch.cam.ac.uk/blogs/murrayrust/?p=743)

2) The method of NMR prediction applied to NMRShiftDB to prove quality..high or not…has been done already. Wolfgang and ACD/labs did it already. I judge you‘ll have similar conclusions…it‘s the same dataset.

Stated here http://wwmm.ch.cam.ac.uk/blogs/murrayrust/?p=737 is “We shall continue on the project, one of whose purposes is to investigate the hypothesis that QM calculations can be used to evaluate the quality of NMR spectra to a useful level.” It‘s a valid investigation and this is testing whether QM can provide good predictions. This is of course known already from the work done by Rychnovsky on hexacyclinol.

To summarize:

1) Using NMR predictions to identify outliers – already done (Robien and ACD/Labs)

2) Validating that GIAO predictions are useful to validate structures – already done (hexacylinol study)

3) Validating the quality of NMRSHiftDB – already done (Robien, ACD/Labs)

All this brings me down to what I “think” are the intentions or outcomes for the project at this point..but I likely have missed something..

1) Identify more outliers that were not identified by the studies of others

2) Deliver back to Christoph and the NMRShiftDB team a list of outliers/concerns/errors with annotations/metadata in order to improve the Open Data source of NMRShiftDB

3) Allow Nick Day to use a lot of what was learned delivering CrystalEye for a second application around NMR and useful for his thesis (A VERY valid goal..good luck Nick)

4) Show the power of blogging to drive Collaboration via OPen Collaborative NMR

SOme additional project deliverables I think include:

1) make online GIAO NMR predictions available

The project deliverables you are working on are defined here and I believe are consistent: http://wwmm.ch.cam.ac.uk/blogs/murrayrust/?p=742

* create a small subset of NMRShiftDB which has been freed from the main errors we – and hopefull the community – can identify.

* Use this to estimate the precision and variance of our QM-based protocol for calculating shifts.

* refine the protocol in the light of variance which can be scientifically explained.

What I still would like to see, BUT this project belongs to you/Henry/Nick of course and you define what it is, is:

1) to help build a bridge between the Open Data community, the academic community and the commercial software community for the benefit of science.” Wolfgang is in academia, so are you, ACD/Labs is commercial and I‘m independent (but of course am associated with ChemSpider…I am an NMR spectrosopist…it‘s why I‘m interested)

2) To validate the performance of GIAO vs HOSE/NN/Inc by providing the final dataset that you used and statistics of performance for GIAO on that datatset. I‘d like to publish the results jointly, if you would be willing to work with the “dark side”

3) To identify where GIAO can outperform the HOSE/NN/Inc approaches

Wolfgang also has thoughts where he says “What would be great to the scientific community: Do calculations on compounds where sophisticated NMR-techniques either fail or are very difficult to perform – e.g. proton-poor compounds or simply ask for a list of compounds which are really suspicious (either the structure is wrong or the assignment is strange, but the puzzle can’t be solved, because the compound is not available for additional measurements).

I‘ve put a lot of effort into blogging onto this project over the past few days. I‘m about to invest some time in making sure that you get information about outliers so you are not doing repeat work. I judge that my hopes for deeper collaboration will remain unfulfilled so I‘ll give up on asking.

I‘ll do what I can to help from this point forward and keep my own rhetoric off of this blog and restrain it to ChemSpider so as to not distract your readers. I look forward to helping for the benefit of the community.

While I was at ACD/Labs I worked with a number of truly excellent scientists. These people were at the forefront of developing NMR prediction technologies as well as Computer Assisted Structure Elucidation (CASE) software. Over the past year and a half I have had the privilege of continuing some of the work I was involved with while at ACD/Labs and our publication regarding “Empirical and DFT GIAO quantum-mechanical methods of 13C chemical shifts prediction: competitors or collaborators?” was released recently. The abstract states:

“The accuracy of 13C chemical shift prediction by both DFT-GIAO quantum-mechanical (QM) and empirical methods was compared using 205 structures for which experimental and QM-calculated chemical shifts were published in the literature. For these structures, 13C chemical shifts were calculated using HOSE code and neural network (NN) algorithms developed within our laboratory. In total, 2531 chemical shifts were analyzed and statistically processed. It has been shown that, in general, QM methods are capable of providing similar but inferior accuracy to the empirical approaches, but quite frequently they give larger mean average error values. For the structural set examined in thiswork, the following mean absolute errors (MAEs) were found: MAE(HOSE) = 1.58 ppm, MAE(NN) = 1.91 ppm and MAE(QM) = 3.29 ppm. A strategy of combined application of both the empirical and DFT GIAO approaches is suggested. The strategy could provide a synergistic effect if the advantages intrinsic to each method are exploited.”

The conclusion includes the following statements “It has been shown that, in general, QM methods are capable of providing similar but inferior accuracy to the empirical approaches, but quite frequently they
give larger mean average error values. This is accounted for mainly with difficulties in selecting the appropriate calculation protocols and difficulties arising from molecular flexibility. The data show that the average accuracy of the QM methods is 1.5–2 times lower than the accuracy shown by the empirical methods. For the structural set examined in this work, the following MAEs were found: MAE(HOSE) = 1.58 ppm, MAE(NN) = 1.91 ppm, MAE(QM) = 3.29 ppm.”

In order to demonstrate that empirical approaches perform QM methods in general we examined 2531 chemical shifts associated with 205 molecules. It was a rather complete study! It took a long time to do the work but it wasn’t done as Open Notebook NMR. It’s published in Magnetic Resonance in Chemistry here: DOI:/10.1002/mrc.2571. Enjoy!

Optical Structure Recognition, Solubility Prediction and Neutral Parties

There are a few areas of cheminformatics that I watch out of professional interest but more out of passion if the truth be known. As an NMR spectroscopist I still watch NMR processing and prediction software, CASE systems (Computer assisted structure elucidation), structure drawing and databasing, and, in regards to our recent interest over at ChemSpider regarding chemical name and structure image recognition, I watch OSR software developments. OSR is Optical Structure Recognition, the equivalent of OCR for chemical structure images. (Egon and I are both interested in OSR it seems…)

Probably the best known OSR system on the market for the past few years is CLiDE and I have had a chance to work with it as discussed here. There are now others available on the market though specifically ChemOCR from the Fraunhofer Institute. There is also OSRA from the National Cancer Institute and ChemReader from the University of Michigan. I can’t find it now but there was also Kekule, also funded by the NCI.

As with all software focusing on a particular problem the intention for these packages is the same but the technology approaches are different. These software packages all have similar intentions…convert structure images into machine readable chemical structure formats. The technology approaches are similar but differ of course in their implementation. This blog isn’t about those differences, it is about how can they be compared?

Recently a gauntlet was thrown down in regards to solubility prediction. The question asked was “Can You Predict Solubilities of Thirty-Two Molecules Using a Database of One Hundred Reliable Measurements? “. The details of the challenge are here. What was nice about this is the fact that the results could be judged by independent parties. What was objective, at least from where I’m sitting, is that experts in the field got to review the data and comment. This is very different from chemistry software vendors comparing each others products and standing with their own opinions. I’ve been involved with this myself in terms of NMR prediction comparisons and these discussions can get rather heated. There was similar “warmth” in the air about a year ago in the OSR domain as discussed here.

So, with so many efforts in the area of OSR how can we get independent testing of multiple OSR packages and get a true representation of the performance characteristics of these packages? Since some packages are commercial while others are Open Source we would need to separate the distinctions of “packaging” from performance. A set of objective criteria separating usability, workflows and interface from algorithms. This doesn’t mean that the former are not important, nay critical to the success of a software package BUT the algorithms, the science, the technology should be the focus of the study.

I suggest taking 100-200 images from different sources and applying the various software packages to validate performance in a neutral way. The study should be conducted by neutral parties…not so neutral that they don’t care about the work but neutral in a way that they are implicitly wed to the outcome of an objective comparison of the OSR algorithms. I have an interest in this so will throw my hat in the ring…I have already done some work on CLiDE and OSRA (1, 2, 3, 4). WHo else would be interested?

The challenges…there are a few:

1) Would all of the OSR producers share their software packages with a neutral panel of reviewers?

2) Who would fund the work? The Solubility challenge appears to have been funded by Pfizer. What immediacy would it be done with without funding…everyone’s busy.

3) How would the panel be selected?

4) Would the work be conducted without all OSR producers participating?

5) About a dozen more concerns….probably Jonathan Goodman, Robert Glen and John Mitchell could give some great advice based on their experience with the Solubility Challenge.

I think this type of comparison needs doing…you?

Chem4Word Project from Microsoft and Murray-Rust

Following on from my presentation regarding text-mining and document mark-up at the ACS meeting in Philly it was interesting to see the announcement about the Chem4Word project from Microsoft. In collaboration with the Unilever School of Informatics at Cambridge university, and specifically working with Peter Murray-Rust and some of his team. From the website announcement it states:  “Microsoft Research is investigating the introduction of chemistry-related features in Microsoft Office Word, including authoring and semantic annotations. Our approach to chemistry authoring will be modeled after the mathematic equation authoring in Word 2007 and will leverage many of the user-interface and XML extensibility options that are provided by Office 2007.

The goal of the Chem4Word project is to enable similar authoring, display, and mining scenarios for chemistry-related information within Office Word. Specifically, we aim to:

  • Provide easy authoring of chemical information within Microsoft Office Word 2007 documents
  • Allow end-user denotation of inline “chemical zones”
  • Render high-quality, print-ready visual depictions of chemical structures
  • Store and expose chemical information in a semantically rich manner to support publishing and mining scenarios, for authors, readers, publishers, and other vendors across the broad chemical information community”

This will be very useful in terms of supporting our efforts to enable the publication process for chemists and we will be watching this project with interest and hope to be engaged in early testing if we are invited.

Publication for People Interested in Computer Assisted Structure Elucidation

Recently I took delivery of a box of reprints of a review article written with by our team of Mikhail Elyashberg (ACD/labs), Gary Martin (Schering-Plough) and myself. It was a major undertaking and took two years of work to final release. It is a >100 pages typeset article. The title is

“Computer Assisted Structure verification and elucidation tools in NMR-based structure elucidation” (doi:10.1016/j.pnmrs.2007.04.003)

The article outline is posted here.

If anyone would like a copy of the article please send me an email at antonyDOTwilliamsATchemspiderDOTcom. I will ask you to cover the costs of shipping via paypal.

Hamburger PDFs and Making Them Structure Searchable

There have been numerous conversations about “Hamburger PDFs” over the months and the most recent exchange is that between Chris Rusbridge and Peter Murray-Rust. Another conversation that I have seen go on has been about making Word documents structure searchable (cannot track down the appropropriate blog-postings at present).

This is just an fyi comment for the community really since this is a general assumption that Word Documents and PDFs cannot be made structure-searchable. The truth is that both can be made structure searchable. How? Well, you need to write the correct information into the file to enable it but it’s possible. There are a number of solutions out there allowing structure-based searching of Word document files. I believe the first one was originally from Oxford Molecular before being acquired by Accelrys. I think there are now multiple including, I believe, Cambridgesoft, ACD/Labs and probably others.

The only PDF structure searching capability I am aware of is that created by ACD/Labs a few years ago. Their website states “Our Search for Structure system allows you to seek out chemical structures in various file formats throughout your computer’s file systems. These formats include: SK2, MOL, SDF, SKC, CHM, CDX, RXN, and PDF (Adobe Acrobat); DOC (Microsoft Word), XLS (Microsoft Excel), and PPT (Microsoft PowerPoint), and ACD/Labs databases: CUD, HUD, CFD, NDB, ND5, and INT.”

For PDF it was required that structure files were “tagged” appropriately when written to PDF by an embedded PDF generation capability. Since the PDF format can be extended ACD/Labs did so. If we wanted to make the majority of PDF files structure searchable then it seems as if the appropriate thing to do would be to extend the general PDF format for Life Sciences, talk to Adobe about including the capabilities into their tools and get the publishers to support it. Ok, there’s details….but why isn’t anyone talking about extending PDF to support structures in this way. it’s already proven, years ago.

Next thing will be that structures will be getting embedded into Word documents and made searchable as if it is something novel. It’s been done many times already. The ACD/Labs website states “Microsoft Word documents with structures created in ChemDraw or MDL ISIS can also be retrieved. Not only can you perform exact structure searches, but you can also search by substructure. Added options allow you to preview search results, open search result documents in ChemSketch as well as in other applications, and store search results for later access.” There are other products doing this too.

Strangely people don’t seem to know about these capabilities. They will…as we move forward to index the web for structures we hope to build the capabilities to search structures inside Word documents directly.

Spaces, Dashes and Issues with Nomenclature Conversion

I’ve been involved with Nomenclature in one way or another for well over a decade. While I’m an NMR spectroscopist by training (as evidenced by the >100 publications in this area)  during my decade long tenure  at ACD/Labs I learned a lot about: PhysChem parameters and their prediction, systematic nomenclature, structure drawing and databasing, chemometrics, LC-MS data analysis and so on. As the product manager for many of these products I was dropped in the deep end. Nomeclature was something I really enjoyed. While I am not a  nomenclature specialist in terms of a “generate a perfect systematic name for Taxol level” I have a decade of experience working with nomenclature software for both generation of names from structures and the generation of structures from names. Having worked with 100s of customers and their needs I’ve dealt with a lot of beliefs around nomenclature and perceptions of how to use the tools.

Having just spent the week at Bio-IT and having been engaged with a number of conversations about Name to Structure conversion, it became clear that one of the prevailing beliefs for users of name to structure conversion packages is that spaces in systematic names can be disregarded. It appears that members of the text-mining for chemistry community are using one or more of the commercial name to structure software programs to convert chemical names to structures and, prior to feeding the algorithms, they are removing all white spaces from the names. They are also doing the same, in some cases, with dashes. How well is that going to work? Is it safe to remove spaces from chemical names and assume this has no effect? Is consideration being given more to the accuracy of the text-mining than to the nature of systematic nomenclature?

Let’s look at some examples of the result of removing spaces from chemical names. Consider the different results just from moving a space.

The impact of spaces on naming

Single structure to separate components based on a space.

Another example of multiple to single component structure.

Another example of space-collapsing structure searching

Clearly there is an impact of removing spaces from systematic names. The same is true of random removal and insertion of dashes. The generation of systematic names by chemists is far from ideal as discussed by Gernot Eller here. The mishandling of correct names when reverting back to structures is one more problem layer. There are many of us using text mining and name to structure conversion to link between documents and structures. It is far from a minor undertaking.