An InChIkey Collision is Discovered and NOT Based on Stereochemistry

01 Sep

InChI Strings and InChIKeys are very much the backbone of ChemSpider and have quickly become a way by which online databases are being connected online. The InChIKey is a hash of the InChiString and when the hash was adopted it was suggested that the likelihood that there would be a collision was very small, the estimate being, as quoted from the official InChI site:

“An example of InChI with its InChKey equivalent is shown below. There is a finite, but very small probability of finding two structures with the same InChIKey. For duplication of only the first block of 14 characters this is 1.3% in 109, equivalent to a single collision in one of 75 databases of 109 compounds each.”

At a previous ACS Meeting Prof Jonathan Goodman from University of Cambridge announced that he had identified a collision. The collision was for two isomers of spongistatin, a rather complex chemical structure with many stereocenters.

Jonathan has “done it again”…what a troublemaker he is (in a supremely gentlemanly way!). I was fortunate enough to receive the news about this collision from him just as I was getting on the flight from ACS Denver to home tonight and asked his permission to blog it as it is both exciting and, I believe, quite surprising news. Why? In this case the collision is for two distinctly different chemicals with totally different formulae and with NO stereochemistry! Very surprising!

As you can see in the figure below the two chemical compounds are simply long branched alkyl chains, one an alcohol and one a ketone.

In case Jonathan’s software tool that he was using to connect to the InChI generation software was doing something untoward with the molfile I confirmed the observation myself by drawing the structures in ACD/ChemSketch and generating the InChIKeys there. And, sure enough…I see exactly the same Standard InChIKeys for both molecules as shown in the movie below. VERY interesting!



About tony

Founder of ChemZoo Inc., the host of ChemSpider ( ChemSpider is an open access online database of chemical structures and property transaction based services to enable chemists around the world to data mine chemistry databases. The Royal Society of Chemistry acquired ChemSpider in May 2009. Presently working as a consortium member of the OpenPHACTS IMI project ( This focuses on how drug discovery can utilize semantic technologies to improve decision making and brings together 22 European team members to develop an infrastructure to link together public and private data for the drug discovery community. I am also involved with the PharmaSea FP7 project ( trying to identify new classes of marine natural products with potential pharmacological activity. I am also one of the hosts for three wikis for Science: ScientistsDB, SciMobileApps and SciDBs. Over the past decade I held many responsibilities including the direction of the development of scientific software applications for spectroscopy and general chemistry, directing marketing efforts, sales and business development collaborations for the company. Eight years experience of analytical laboratory leadership and management. Experienced in experimental techniques, implementation of new NMR technologies, walk-up facility management, research and development, manufacturing support and teaching. Ability to provide situation analysis, creative solutions and establish good working relationships. Prolific author with over a 150 peer-reviewed scientific publications, 3 patents and over 300 public presentations. Specialties Leadership in the domain of free access Chemistry, Product and project management, Organizational and Leadership development, Competitive analysis and Business Development, Entrepreneurial.

Posted by on September 1, 2011 in General Communications, InChI, InChI


Tags: , , ,

13 Responses to An InChIkey Collision is Discovered and NOT Based on Stereochemistry

  1. Alex Clark

    September 1, 2011 at 8:44 am

    While I’m not usually the first to rush out to defend InChI, there is no reason to be surprised by this. Hash tags are supposed to have a purely random chance of colliding. The documentation is clear, and implies that this identifier is not suitable for uniquely identifying a compound, as any computer scientist will confirm when you talk about hash-based lookups… it just narrows down the potential number of matches.

  2. tony

    September 1, 2011 at 8:52 am

    Yes indeed…I agree with everything that you say. That said, it remains surprising (an emotional response) because it’s two collisions in a fairly short time and the original discussions had, I judge, the majority of the community imagining we wouldn’t see one for a long time. It’s probabilities..and just like lotto 649…the chances of getting the 6 numbers 1,2,3,4,5,6 in the lottery is the same as a random set of 6 between 1 and 49.

  3. Markus Sitzmann

    September 1, 2011 at 10:17 am

    Is there any particular reason why all structure information is completely “buried” in an image file :-)? And did Jonathan tell you how (on earth) he caught that? I can imagine an “obvious” strategy in case of the the previous collision (spongistatin) he reported, but how did he find this here …

  4. tony

    September 1, 2011 at 10:31 am

    I can’t think of an easy way to put the molfiles onto WordPress 🙂

    They are here:


    I’ve asked Jonathan hoow he has found two collisions…it should be work 🙂

  5. Markus Sitzmann

    September 1, 2011 at 10:37 am

    Thanks for the molfiles – Markus

    • tony

      September 1, 2011 at 10:07 pm

      You are welcome Markus…do you see the same result???

    • Markus Sitzmann

      September 2, 2011 at 2:03 pm

      Yes, I get the same result

      • Markus Sitzmann

        September 2, 2011 at 2:15 pm

        What I was mostly interested in was to see the InChIs (not the Keys) next to each other – well, they look pretty different actually. If Jonathan found this by accident he should consider quitting science and play lotto professionally … we should hear very soon about him as jackpot winner 🙂

  6. M.Karthikeyan

    September 1, 2011 at 10:11 pm

    It is right time to upgrade InChI-Key with prefix or suffix for stereo-isomers or the case discussed above. It was a surprise for me too.. thanks to JG for bringing to our notice.. there is opportunity to chemoinformaticians to build better algorithms. MK (denver)

  7. Egon Willighagen

    September 2, 2011 at 1:44 am

    Thanx for the shout out. I hope in the future Prof. Goodman will blog about such finds himself 🙂 I love to hear too if these compounds are “made up” or actually synthesized and/or measured in the real world. Not that it matters to me, just to have the story complete.

    The importance to me is that it helps raise awareness that databases must never use only the InChIKey, but must always provide the full InChI as well. Then the InChIKey is used for it’s function, a short hash to easily find hits, and not as unique identifier.

    • Egon Willighagen

      September 2, 2011 at 1:46 am

      Oh, and what are the ChemSpider identifiers?

    • Markus Sitzmann

      September 2, 2011 at 2:25 pm

      Well, providing InChIs is one thing – but indexing by them instead of InChIKey (what this problem here kind of implies): not so funny.

      What I find pretty odd currently with this collision:
      – the molecules itself are kind of similar
      – the InChIs itself don’t look similar (at a first glance)
      – I thought the InChIKeys are just calculated by hashing the InChIs …. so two “similar” molecules are made dissimilar by an Identifier which hashed representation “by accident” makes them identical

    • Dave

      September 5, 2011 at 5:54 am

      I’d say that the concept of whether these compounds are “made up” or not is a difficult one to answer in this case. I’d describe the structures as “virtual” or “meta” molecules. The top structure has 20 stereocentres therefore represents 2^20 possible isomers, similarly the second represents 2^15 possible isomers. Of course you can perform a racemic synthesis and generate these mixtures but other than proving that you can construct this skeleton, it’s difficult to know what else can be meaningfully measured.


Leave a Reply

%d bloggers like this: