An InChIkey Collision is Discovered and NOT Based on Stereochemistry


InChI Strings and InChIKeys are very much the backbone of ChemSpider and have quickly become a way by which online databases are being connected online. The InChIKey is a hash of the InChiString and when the hash was adopted it was suggested that the likelihood that there would be a collision was very small, the estimate being, as quoted from the official InChI site:

“An example of InChI with its InChKey equivalent is shown below. There is a finite, but very small probability of finding two structures with the same InChIKey. For duplication of only the first block of 14 characters this is 1.3% in 109, equivalent to a single collision in one of 75 databases of 109 compounds each.”

At a previous ACS Meeting Prof Jonathan Goodman from University of Cambridge announced that he had identified a collision. The collision was for two isomers of spongistatin, a rather complex chemical structure with many stereocenters.

Jonathan has “done it again”…what a troublemaker he is (in a supremely gentlemanly way!). I was fortunate enough to receive the news about this collision from him just as I was getting on the flight from ACS Denver to home tonight and asked his permission to blog it as it is both exciting and, I believe, quite surprising news. Why? In this case the collision is for two distinctly different chemicals with totally different formulae and with NO stereochemistry! Very surprising!

As you can see in the figure below the two chemical compounds are simply long branched alkyl chains, one an alcohol and one a ketone.

In case Jonathan’s software tool that he was using to connect to the InChI generation software was doing something untoward with the molfile I confirmed the observation myself by drawing the structures in ACD/ChemSketch and generating the InChIKeys there. And, sure enough…I see exactly the same Standard InChIKeys for both molecules as shown in the movie below. VERY interesting!

 

, , ,

  1. #1 by Alex Clark on September 1, 2011 - 8:44 am

    While I’m not usually the first to rush out to defend InChI, there is no reason to be surprised by this. Hash tags are supposed to have a purely random chance of colliding. The documentation is clear, and implies that this identifier is not suitable for uniquely identifying a compound, as any computer scientist will confirm when you talk about hash-based lookups… it just narrows down the potential number of matches.

  2. #2 by tony on September 1, 2011 - 8:52 am

    Yes indeed…I agree with everything that you say. That said, it remains surprising (an emotional response) because it’s two collisions in a fairly short time and the original discussions had, I judge, the majority of the community imagining we wouldn’t see one for a long time. It’s probabilities..and just like lotto 649…the chances of getting the 6 numbers 1,2,3,4,5,6 in the lottery is the same as a random set of 6 between 1 and 49.

  3. #3 by Markus Sitzmann on September 1, 2011 - 10:17 am

    Is there any particular reason why all structure information is completely “buried” in an image file :-)? And did Jonathan tell you how (on earth) he caught that? I can imagine an “obvious” strategy in case of the the previous collision (spongistatin) he reported, but how did he find this here …

  4. #4 by tony on September 1, 2011 - 10:31 am

    I can’t think of an easy way to put the molfiles onto WordPress 🙂

    They are here:
    https://docs.google.com/leaf?id=0B5BcxqWjYDpnMGE0ODA3ZGUtNGUyYi00MWRjLWJhY2YtODBlM2EyMDNlMmU0&hl=en_US

    and

    https://docs.google.com/leaf?id=0B5BcxqWjYDpnZTUzOTBkY2QtOTE4OS00ZDE4LWJkOTItY2ZhNGJkNjUwNThh&hl=en_US

    I’ve asked Jonathan hoow he has found two collisions…it should be work 🙂

  5. #5 by Markus Sitzmann on September 1, 2011 - 10:37 am

    Thanks for the molfiles – Markus

    • #6 by tony on September 1, 2011 - 10:07 pm

      You are welcome Markus…do you see the same result???

    • #7 by Markus Sitzmann on September 2, 2011 - 2:03 pm

      Yes, I get the same result

      • #8 by Markus Sitzmann on September 2, 2011 - 2:15 pm

        What I was mostly interested in was to see the InChIs (not the Keys) next to each other – well, they look pretty different actually. If Jonathan found this by accident he should consider quitting science and play lotto professionally … we should hear very soon about him as jackpot winner 🙂

  6. #9 by M.Karthikeyan on September 1, 2011 - 10:11 pm

    It is right time to upgrade InChI-Key with prefix or suffix for stereo-isomers or the case discussed above. It was a surprise for me too.. thanks to JG for bringing to our notice.. there is opportunity to chemoinformaticians to build better algorithms. MK (denver)

  7. #10 by Egon Willighagen on September 2, 2011 - 1:44 am

    Thanx for the shout out. I hope in the future Prof. Goodman will blog about such finds himself 🙂 I love to hear too if these compounds are “made up” or actually synthesized and/or measured in the real world. Not that it matters to me, just to have the story complete.

    The importance to me is that it helps raise awareness that databases must never use only the InChIKey, but must always provide the full InChI as well. Then the InChIKey is used for it’s function, a short hash to easily find hits, and not as unique identifier.

    • #11 by Egon Willighagen on September 2, 2011 - 1:46 am

      Oh, and what are the ChemSpider identifiers?

    • #12 by Markus Sitzmann on September 2, 2011 - 2:25 pm

      Well, providing InChIs is one thing – but indexing by them instead of InChIKey (what this problem here kind of implies): not so funny.

      What I find pretty odd currently with this collision:
      – the molecules itself are kind of similar
      – the InChIs itself don’t look similar (at a first glance)
      – I thought the InChIKeys are just calculated by hashing the InChIs …. so two “similar” molecules are made dissimilar by an Identifier which hashed representation “by accident” makes them identical

    • #13 by Dave on September 5, 2011 - 5:54 am

      Egon,
      I’d say that the concept of whether these compounds are “made up” or not is a difficult one to answer in this case. I’d describe the structures as “virtual” or “meta” molecules. The top structure has 20 stereocentres therefore represents 2^20 possible isomers, similarly the second represents 2^15 possible isomers. Of course you can perform a racemic synthesis and generate these mixtures but other than proving that you can construct this skeleton, it’s difficult to know what else can be meaningfully measured.

(will not be published)


Automatic Backlinks
%d bloggers like this: