Spaces, Dashes and Issues with Nomenclature Conversion


I’ve been involved with Nomenclature in one way or another for well over a decade. While I’m an NMR spectroscopist by training (as evidenced by the >100 publications in this area)  during my decade long tenure  at ACD/Labs I learned a lot about: PhysChem parameters and their prediction, systematic nomenclature, structure drawing and databasing, chemometrics, LC-MS data analysis and so on. As the product manager for many of these products I was dropped in the deep end. Nomeclature was something I really enjoyed. While I am not a  nomenclature specialist in terms of a “generate a perfect systematic name for Taxol level” I have a decade of experience working with nomenclature software for both generation of names from structures and the generation of structures from names. Having worked with 100s of customers and their needs I’ve dealt with a lot of beliefs around nomenclature and perceptions of how to use the tools.

Having just spent the week at Bio-IT and having been engaged with a number of conversations about Name to Structure conversion, it became clear that one of the prevailing beliefs for users of name to structure conversion packages is that spaces in systematic names can be disregarded. It appears that members of the text-mining for chemistry community are using one or more of the commercial name to structure software programs to convert chemical names to structures and, prior to feeding the algorithms, they are removing all white spaces from the names. They are also doing the same, in some cases, with dashes. How well is that going to work? Is it safe to remove spaces from chemical names and assume this has no effect? Is consideration being given more to the accuracy of the text-mining than to the nature of systematic nomenclature?

Let’s look at some examples of the result of removing spaces from chemical names. Consider the different results just from moving a space.

The impact of spaces on naming

Single structure to separate components based on a space.

Another example of multiple to single component structure.

Another example of space-collapsing structure searching

Clearly there is an impact of removing spaces from systematic names. The same is true of random removal and insertion of dashes. The generation of systematic names by chemists is far from ideal as discussed by Gernot Eller here. The mishandling of correct names when reverting back to structures is one more problem layer. There are many of us using text mining and name to structure conversion to link between documents and structures. It is far from a minor undertaking.

,

  1. #1 by Dave Bower on June 3, 2008 - 7:36 am

    Dear Antony,

    I can most certainly empathize with your concerns over the seemingly off-the-cuff notion that just simply remove dashes and spaces from valid chemical names has no impact (I was the former Fullerene nomenclature expert at CAS from 1996-2000). In order to attempt this I would think it necessary to know what type of systematic name (IUPAC or CAS) you’re dealing with because the methodology for resolving may differ. There is also the notion of inverted and uninverted systematic names. In your last example (1st structure) Benzoic acid, methyl ester and methyl benzoic acid respectively. Strip the spaces out of these and start to have even more problems (an unknown substitution position).

    In general, identifying chemical structures by “names” has started to become increasingly unreliable. The key here is the differentiation between “good/valid” systematic names and “bad/invalid” synonyms and other name types. The world of chemical nomenclature seems to loosely parallel that of spoken languages – in so far as you have the proper dialect with numerous colloquialisms. Take for example the notion of the term “house”, in slang you can also use the term “crib” – which to complicate things has it’s own proper definition. The parallel in chemical nomenclature can be seen in the different “type” of names – like tradenames (Tylenol), systematic names (Acetamide, N-(4-hydroxyphenyl)-), etc…

    Dave

(will not be published)