Tag Archives: InChIKey

An InChIkey Collision is Discovered and NOT Based on Stereochemistry

InChI Strings and InChIKeys are very much the backbone of ChemSpider and have quickly become a way by which online databases are being connected online. The InChIKey is a hash of the InChiString and when the hash was adopted it was suggested that the likelihood that there would be a collision was very small, the estimate being, as quoted from the official InChI site:

“An example of InChI with its InChKey equivalent is shown below. There is a finite, but very small probability of finding two structures with the same InChIKey. For duplication of only the first block of 14 characters this is 1.3% in 109, equivalent to a single collision in one of 75 databases of 109 compounds each.”

At a previous ACS Meeting Prof Jonathan Goodman from University of Cambridge announced that he had identified a collision. The collision was for two isomers of spongistatin, a rather complex chemical structure with many stereocenters.

Jonathan has “done it again”…what a troublemaker he is (in a supremely gentlemanly way!). I was fortunate enough to receive the news about this collision from him just as I was getting on the flight from ACS Denver to home tonight and asked his permission to blog it as it is both exciting and, I believe, quite surprising news. Why? In this case the collision is for two distinctly different chemicals with totally different formulae and with NO stereochemistry! Very surprising!

As you can see in the figure below the two chemical compounds are simply long branched alkyl chains, one an alcohol and one a ketone.

In case Jonathan’s software tool that he was using to connect to the InChI generation software was doing something untoward with the molfile I confirmed the observation myself by drawing the structures in ACD/ChemSketch and generating the InChIKeys there. And, sure enough…I see exactly the same Standard InChIKeys for both molecules as shown in the movie below. VERY interesting!



Posted by on September 1, 2011 in General Communications, InChI, InChI


Tags: , , ,

Continuing Conflicts in the Messy World of Internet Chemistry

I have been looking at the state of curated data on the internet and blogged last night about the messy world of curated data. I should emphasize…none of these commentaries are meant to be harsh. Believe me, I’ve gone through the process of validating data and it’s difficult. There will be mistakes but what we need are processes and systems to clean these data up efficiently. If I see an error I want to annotate it and let people know there is an error. With todays’s technologies it is not difficult.

Let’s take another example from DrugBank

That listed chemical name above the structure doesn’t look very consistent…I don’t see any stereochemistry, certainly no “dihydroxy” and overall…yes, it’s definitely wrong. The actual structure for that name is shown below. Looks like an entire half of the molecule is missing. The InChI and InChIKey are for the molecule shown in DrugBank but the link to KEGG is to the molecule shown below…here.

The links on DrugBank to PubCHem and ChEBI are to the molecule to the left. All of the data in the DrugBank record in terms of outlinks  are for the structure on the left EXCEPT the actual structure on the record, and its associated SMILEs and InChIs are for the  “2-amino-3,5-dihydro-4H-pyrrolo[2,3-d]pyrimidin-4-one” moiety. Oops.

Recently I pointed out to David Wishart, host of DrugBank, some of the issues I had been seeing and it appears there will be a major update to DrugBank in the next few weeks that, in theory, will address some, and hopefully all of these observations.

1 Comment

Posted by on August 16, 2010 in Uncategorized


Tags: , , , , ,

%d bloggers like this: