RSS

An InChIkey Collision is Discovered and NOT Based on Stereochemistry

01 Sep

InChI Strings and InChIKeys are very much the backbone of ChemSpider and have quickly become a way by which online databases are being connected online. The InChIKey is a hash of the InChiString and when the hash was adopted it was suggested that the likelihood that there would be a collision was very small, the estimate being, as quoted from the official InChI site:

“An example of InChI with its InChKey equivalent is shown below. There is a finite, but very small probability of finding two structures with the same InChIKey. For duplication of only the first block of 14 characters this is 1.3% in 109, equivalent to a single collision in one of 75 databases of 109 compounds each.”

At a previous ACS Meeting Prof Jonathan Goodman from University of Cambridge announced that he had identified a collision. The collision was for two isomers of spongistatin, a rather complex chemical structure with many stereocenters.

Jonathan has “done it again”…what a troublemaker he is (in a supremely gentlemanly way!). I was fortunate enough to receive the news about this collision from him just as I was getting on the flight from ACS Denver to home tonight and asked his permission to blog it as it is both exciting and, I believe, quite surprising news. Why? In this case the collision is for two distinctly different chemicals with totally different formulae and with NO stereochemistry! Very surprising!

As you can see in the figure below the two chemical compounds are simply long branched alkyl chains, one an alcohol and one a ketone.

In case Jonathan’s software tool that he was using to connect to the InChI generation software was doing something untoward with the molfile I confirmed the observation myself by drawing the structures in ACD/ChemSketch and generating the InChIKeys there. And, sure enough…I see exactly the same Standard InChIKeys for both molecules as shown in the movie below. VERY interesting!

 

 

About tony

Antony (Tony) J. Williams received his BSc in 1985 from the University of Liverpool (UK) and PhD in 1988 from the University of London (UK). His PhD research interests were in studying the effects of high pressure on molecular motions within lubricant related systems using Nuclear Magnetic Resonance. He moved to Ottawa, Canada to work for the National Research Council performing fundamental research on the electron paramagnetic resonance of radicals trapped in single crystals. Following his postdoctoral position he became the NMR Facility Manager for Ottawa University. Tony joined the Eastman Kodak Company in Rochester, New York as their NMR Technology Leader. He led the laboratory to develop quality control across multiple spectroscopy labs and helped establish walk-up laboratories providing NMR, LC-MS and other forms of spectroscopy to hundreds of chemists across multiple sites. This included the delivery of spectroscopic data to the desktop, automated processing and his initial interests in computer-assisted structure elucidation (CASE) systems. He also worked with a team to develop the worlds’ first web-based LIMS system, WIMS, capable of allowing chemical structure searching and spectral display. With his developing cheminformatic skills and passion for data management he left corporate America to join a small start-up company working out of Toronto, Canada. He joined ACD/Labs as their NMR Product Manager and various roles, including Chief Science Officer, during his 10 years with the company. His responsibilities included managing over 50 products at one time prior to developing a product management team, managing sales, marketing, technical support and technical services. ACD/Labs was one of Canada’s Fast 50 Tech Companies, and Forbes Fast 500 companies in 2001. His primary passions during his tenure with ACD/Labs was the continued adoption of web-based technologies and developing automated structure verification and elucidation platforms. While at ACD/Labs he suggested the possibility of developing a public resource for chemists attempting to integrate internet available chemical data. He finally pursued this vision with some close friends as a hobby project in the evenings and the result was the ChemSpider database (www.chemspider.com). Even while running out of a basement on hand built servers the website developed a large community following that eventually culminated in the acquisition of the website by the Royal Society of Chemistry (RSC) based in Cambridge, United Kingdom. Tony joined the organization, together with some of the other ChemSpider team, and became their Vice President of Strategic Development. At RSC he continued to develop cheminformatics tools, specifically ChemSpider, and was the technical lead for the chemistry aspects of the Open PHACTS project (http://www.openphacts.org), a project focused on the delivery of open data, open source and open systems to support the pharmaceutical sciences. He was also the technical lead for the UK National Chemical Database Service (http://cds.rsc.org/) and the RSC lead for the PharmaSea project (http://www.pharma-sea.eu/) attempting to identify novel natural products from the ocean. He left RSC in 2015 to become a Computational Chemist in the National Center of Computational Toxicology at the Environmental Protection Agency where he is bringing his skills to bear working with a team on the delivery of a new software architecture for the management and delivery of data, algorithms and visualization tools. The “Chemistry Dashboard” was released on April 1st, no fooling, at https://comptox.epa.gov, and provides access to over 700,000 chemicals, experimental and predicted properties and a developing link network to support the environmental sciences. Tony remains passionate about computer-assisted structure elucidation and verification approaches and continues to publish in this area. He is also passionate about teaching scientists to benefit from the developing array of social networking tools for scientists and is known as the ChemConnector on the networks. Over the years he has had adjunct roles at a number of institutions and presently enjoys working with scientists at both UNC Chapel Hill and NC State University. He is widely published with over 200 papers and book chapters and was the recipient of the Jim Gray Award for eScience in 2012. In 2016 he was awarded the North Carolina ACS Distinguished Speaker Award.
13 Comments

Posted by on September 1, 2011 in General Communications, InChI, InChI

 

Tags: , , ,

13 Responses to An InChIkey Collision is Discovered and NOT Based on Stereochemistry

  1. Alex Clark

    September 1, 2011 at 8:44 am

    While I’m not usually the first to rush out to defend InChI, there is no reason to be surprised by this. Hash tags are supposed to have a purely random chance of colliding. The documentation is clear, and implies that this identifier is not suitable for uniquely identifying a compound, as any computer scientist will confirm when you talk about hash-based lookups… it just narrows down the potential number of matches.

     
  2. tony

    September 1, 2011 at 8:52 am

    Yes indeed…I agree with everything that you say. That said, it remains surprising (an emotional response) because it’s two collisions in a fairly short time and the original discussions had, I judge, the majority of the community imagining we wouldn’t see one for a long time. It’s probabilities..and just like lotto 649…the chances of getting the 6 numbers 1,2,3,4,5,6 in the lottery is the same as a random set of 6 between 1 and 49.

     
  3. Markus Sitzmann

    September 1, 2011 at 10:17 am

    Is there any particular reason why all structure information is completely “buried” in an image file :-)? And did Jonathan tell you how (on earth) he caught that? I can imagine an “obvious” strategy in case of the the previous collision (spongistatin) he reported, but how did he find this here …

     
  4. tony

    September 1, 2011 at 10:31 am

    I can’t think of an easy way to put the molfiles onto WordPress 🙂

    They are here:
    https://docs.google.com/leaf?id=0B5BcxqWjYDpnMGE0ODA3ZGUtNGUyYi00MWRjLWJhY2YtODBlM2EyMDNlMmU0&hl=en_US

    and

    https://docs.google.com/leaf?id=0B5BcxqWjYDpnZTUzOTBkY2QtOTE4OS00ZDE4LWJkOTItY2ZhNGJkNjUwNThh&hl=en_US

    I’ve asked Jonathan hoow he has found two collisions…it should be work 🙂

     
  5. Markus Sitzmann

    September 1, 2011 at 10:37 am

    Thanks for the molfiles – Markus

     
    • tony

      September 1, 2011 at 10:07 pm

      You are welcome Markus…do you see the same result???

       
    • Markus Sitzmann

      September 2, 2011 at 2:03 pm

      Yes, I get the same result

       
      • Markus Sitzmann

        September 2, 2011 at 2:15 pm

        What I was mostly interested in was to see the InChIs (not the Keys) next to each other – well, they look pretty different actually. If Jonathan found this by accident he should consider quitting science and play lotto professionally … we should hear very soon about him as jackpot winner 🙂

         
  6. M.Karthikeyan

    September 1, 2011 at 10:11 pm

    It is right time to upgrade InChI-Key with prefix or suffix for stereo-isomers or the case discussed above. It was a surprise for me too.. thanks to JG for bringing to our notice.. there is opportunity to chemoinformaticians to build better algorithms. MK (denver)

     
  7. Egon Willighagen

    September 2, 2011 at 1:44 am

    Thanx for the shout out. I hope in the future Prof. Goodman will blog about such finds himself 🙂 I love to hear too if these compounds are “made up” or actually synthesized and/or measured in the real world. Not that it matters to me, just to have the story complete.

    The importance to me is that it helps raise awareness that databases must never use only the InChIKey, but must always provide the full InChI as well. Then the InChIKey is used for it’s function, a short hash to easily find hits, and not as unique identifier.

     
    • Egon Willighagen

      September 2, 2011 at 1:46 am

      Oh, and what are the ChemSpider identifiers?

       
    • Markus Sitzmann

      September 2, 2011 at 2:25 pm

      Well, providing InChIs is one thing – but indexing by them instead of InChIKey (what this problem here kind of implies): not so funny.

      What I find pretty odd currently with this collision:
      – the molecules itself are kind of similar
      – the InChIs itself don’t look similar (at a first glance)
      – I thought the InChIKeys are just calculated by hashing the InChIs …. so two “similar” molecules are made dissimilar by an Identifier which hashed representation “by accident” makes them identical

       
    • Dave

      September 5, 2011 at 5:54 am

      Egon,
      I’d say that the concept of whether these compounds are “made up” or not is a difficult one to answer in this case. I’d describe the structures as “virtual” or “meta” molecules. The top structure has 20 stereocentres therefore represents 2^20 possible isomers, similarly the second represents 2^15 possible isomers. Of course you can perform a racemic synthesis and generate these mixtures but other than proving that you can construct this skeleton, it’s difficult to know what else can be meaningfully measured.

       

Leave a Reply

Your email address will not be published. Required fields are marked *