Chemical Information Mining Book – A Moment of Pride

As I recall last Christmas I was finishing up a chapter in a book that is now on the market and called Chemical Information Mining. What’s amazing is that I just found that the first 21 pages are ALREADY on Google Books. I could’ve seen the cover there before waiting to receive a copy…hell I could’ve read UP to my Chapter…that’s where it stops.

The product description on Amazon is: “This book focuses on information extraction issues, highlights available solutions, and underscores the value of these solutions to academic and commercial scientists. After introducing the drivers behind chemical text mining, it discusses chemical semantics. The contributors describe the tools that identify and convert chemical names and images to structure-searchable information. They also explain natural language processing, name entity recognition concepts, and semantic web technologies. Following a section on current trends in the field, the book looks at where information mining approaches fit into the research needs within the life sciences.”

I’m rather proud of the contribution Andrey Yerin and I made to the book. I worked with Andrey while I was at ACD/Labs and learned all about nomenclature from him. He’s one of the nicest, most competent and focused specialists in the domain of systematic nomenclature in the world. The book chapter contents are listed below. Makes for good Xmas reading if you care about that type of thing…

Spaces, Dashes and Issues with Nomenclature Conversion

I’ve been involved with Nomenclature in one way or another for well over a decade. While I’m an NMR spectroscopist by training (as evidenced by the >100 publications in this area)  during my decade long tenure  at ACD/Labs I learned a lot about: PhysChem parameters and their prediction, systematic nomenclature, structure drawing and databasing, chemometrics, LC-MS data analysis and so on. As the product manager for many of these products I was dropped in the deep end. Nomeclature was something I really enjoyed. While I am not a  nomenclature specialist in terms of a “generate a perfect systematic name for Taxol level” I have a decade of experience working with nomenclature software for both generation of names from structures and the generation of structures from names. Having worked with 100s of customers and their needs I’ve dealt with a lot of beliefs around nomenclature and perceptions of how to use the tools.

Having just spent the week at Bio-IT and having been engaged with a number of conversations about Name to Structure conversion, it became clear that one of the prevailing beliefs for users of name to structure conversion packages is that spaces in systematic names can be disregarded. It appears that members of the text-mining for chemistry community are using one or more of the commercial name to structure software programs to convert chemical names to structures and, prior to feeding the algorithms, they are removing all white spaces from the names. They are also doing the same, in some cases, with dashes. How well is that going to work? Is it safe to remove spaces from chemical names and assume this has no effect? Is consideration being given more to the accuracy of the text-mining than to the nature of systematic nomenclature?

Let’s look at some examples of the result of removing spaces from chemical names. Consider the different results just from moving a space.

The impact of spaces on naming

Single structure to separate components based on a space.

Another example of multiple to single component structure.

Another example of space-collapsing structure searching