Posts Tagged InChI

An InChIkey Collision is Discovered and NOT Based on Stereochemistry

InChI Strings and InChIKeys are very much the backbone of ChemSpider and have quickly become a way by which online databases are being connected online. The InChIKey is a hash of the InChiString and when the hash was adopted it was suggested that the likelihood that there would be a collision was very small, the estimate being, as quoted from the official InChI site:

“An example of InChI with its InChKey equivalent is shown below. There is a finite, but very small probability of finding two structures with the same InChIKey. For duplication of only the first block of 14 characters this is 1.3% in 109, equivalent to a single collision in one of 75 databases of 109 compounds each.”

At a previous ACS Meeting Prof Jonathan Goodman from University of Cambridge announced that he had identified a collision. The collision was for two isomers of spongistatin, a rather complex chemical structure with many stereocenters.

Jonathan has “done it again”…what a troublemaker he is (in a supremely gentlemanly way!). I was fortunate enough to receive the news about this collision from him just as I was getting on the flight from ACS Denver to home tonight and asked his permission to blog it as it is both exciting and, I believe, quite surprising news. Why? In this case the collision is for two distinctly different chemicals with totally different formulae and with NO stereochemistry! Very surprising!

As you can see in the figure below the two chemical compounds are simply long branched alkyl chains, one an alcohol and one a ketone.

In case Jonathan’s software tool that he was using to connect to the InChI generation software was doing something untoward with the molfile I confirmed the observation myself by drawing the structures in ACD/ChemSketch and generating the InChIKeys there. And, sure enough…I see exactly the same Standard InChIKeys for both molecules as shown in the movie below. VERY interesting!

 

, , ,

13 Comments

Continuing Conflicts in the Messy World of Internet Chemistry

I have been looking at the state of curated data on the internet and blogged last night about the messy world of curated data. I should emphasize…none of these commentaries are meant to be harsh. Believe me, I’ve gone through the process of validating data and it’s difficult. There will be mistakes but what we need are processes and systems to clean these data up efficiently. If I see an error I want to annotate it and let people know there is an error. With todays’s technologies it is not difficult.

Let’s take another example from DrugBank

That listed chemical name above the structure doesn’t look very consistent…I don’t see any stereochemistry, certainly no “dihydroxy” and overall…yes, it’s definitely wrong. The actual structure for that name is shown below. Looks like an entire half of the molecule is missing. The InChI and InChIKey are for the molecule shown in DrugBank but the link to KEGG is to the molecule shown below…here.

The links on DrugBank to PubCHem and ChEBI are to the molecule to the left. All of the data in the DrugBank record in terms of outlinks  are for the structure on the left EXCEPT the actual structure on the record, and its associated SMILEs and InChIs are for the  “2-amino-3,5-dihydro-4H-pyrrolo[2,3-d]pyrimidin-4-one” moiety. Oops.

Recently I pointed out to David Wishart, host of DrugBank, some of the issues I had been seeing and it appears there will be a major update to DrugBank in the next few weeks that, in theory, will address some, and hopefully all of these observations.

, , , , ,

No Comments

The Messy World of Even Curated Chemistry on the Internet

Recently I have been spending my night hours looking into the nature of curated chemistry on the internet. 3 years ago I made some assumptions around the quality of certain online datasets when they were deposited onto ChemSpider. It was clear that a lot of internet chemistry datasets were “impure”…I think messy, untrustworthy and confused would be a fairer statement! However, there were a number of datasets that were manually curated and, at initial viewing, were higher quality. With time however I have become increasingly concerned with some of the datasets that I had originally cited as high quality. Over the next few days/weeks I will examine some of these in detail and highlight some of the issues I am seeing. I want to clarify that all chemical compounds, in terms of  their connection tables, their stereochemistry and the association between the compound and the name(s) are assertions. However, there are “norms” for these structures….we would expect a particular structure for aspirin (acetylsalicylic acid ), a single structure for Cholesterol and a single structure for Taxol. By the way, the links to Wikipedia are not assertions that the structures that are presently on Wikipedia are correct representations…but I can confirm that PREVIOUSLY I did work to confirm that every one of these was consistent with my investigations to assert the association between the chemical name and the structure. SInce then it is possible that someone edited the structure…such is the world of Wikipedia!

Two of the linked data sources I have been investigating of late are DrugBank and the Protein Databank. Both of these are manually curated and are expected to be of high quality. In my discussions with various members of the Life Science industry I have heard many positive comments of these data sources as being trustworthy and high quality. I recently downloaded the drugbank small molecule set and started looking at it. Let’s take one example…

The Drugbank record DB02309 has the chemical name “5-Monophosphate-9-Beta-D-Ribofuranosyl Xanthine“. The structure on Drugbank is shown below.

The chemical name above is inconsistent with the structure…there is no stereochemistry in the molecule displayed despite the “-D-” in the name. The IUPAC name listed in the Drugbank record is “[(2R,3S,4R,5R)-5-(2,6-dioxo-3,7-dihydropurin-9-ium-9-yl)-3,4-dihydroxyoxolan-2-yl]methyl dihydrogen phosphate” and this clearly does not agree with the displayed structure.

The InChI listed on the record does not include a stereo layer (InChI=1/C10H13N4O9P/c15-5-3(1-22-24(19,20)21)23-9(6(5)16)14-2-11-4-7(14)12-10(18)13-8(4)17/h2-3,5-6,9,15-16H,1H2,(H2,19,20,21)(H2,12,13,17,18)/p+1/fC10H14N4O9P/h11-13,19-20H/q+1). The InChIKey is listed as:

DCTLYFZHFGENCW-NIVOTTSGCB

The Drugbank record links to a structure with full explicit stereochemistry on PubChem here and to the ligand on the PDB ligand database hosted by ChEBI here.

The molfile downloaded from DrugBank has no stereochemistry but lists both Canonical and Isomeric SMILES

Isomeric SMILES O[C@H]1[C@H](COP(O)(O)=O)O[C@H]([C@@H]1O)N1C=[NH+]C2=C1NC(=O)NC2=O
Canonical SMILES OC1C(COP(O)(O)=O)OC(C1O)N1C=[NH+]C2=C1NC(=O)NC2=O

It is clear what has happened, I believe….the Drugbank record has used the canonical SMILES to generate the structure image and has neglected the stereochemistry. However, the names carry the original stereochemistry information while the InChI comes from the structure with no stereo. I think that’s what happened.Let’s confirm.

ASSUMING that the isomeric SMILES string is the appropriate stereochemistry I can convert it and get the following InChIKey (generated using ACD/ChemSketch) and using ACD/Name get the name below). I trust ChemSketch and ACD/Name products to generate both appropriately as I managed these products while at ACD/Labs for over a decade.

DCTLYFZHFGENCW-NSVMUQOTBF

9-{(2R,3R,4R,5S)-3,4-dihydroxy-5-[(phosphonooxy)methyl]tetrahydrofuran-2-yl}-2,6-dioxo-2,3,6,9-tetrahydro-1H-purin-7-ium

On Drugbank the chemical name listed is:

[(2R,3S,4R,5R)-5-(2,6-dioxo-3,7-dihydropurin-9-ium-9-yl)-3,4-dihydroxyoxolan-2-yl]methyl dihydrogen phosphate

Okay…the names are subtly different….but there are 3R and 1S centers in each name but they differ, assuming that the nomenclature programs are using consistent numbering schemes. See below.

Name generated from Isomeric SMILES on DrugBank: 2R,3R,4R,5S

Chemical Name on DrugBank: 2R,3S,4R,5R

More on this later. Looking at the linked PubChem record gives the following name: [(2R,3S,4R,5R)-5-(2,6-dioxo-3,7-dihydropurin-9-ium-9-yl)-3,4-dihydroxyoxolan-2-yl]methyl dihydrogen phosphate, exactly the same one as listed on Drugbank….so one assumes that the chemical names on DrugBank come from PubChem. Downloading the molfile from PubChem into the same software used to generate InChIs and chemical names gives:

XHDARDSMKMUDDI-XWTUZWARBP

9-{(2R,3R,4S,5R)-3,4-dihydroxy-5-[(phosphonooxy)methyl]tetrahydrofuran-2-yl}-2,6-dioxo-2,3,6,7-tetrahydro-1H-purin-9-ium

DrugBank is linked out to the PDB ligands hosted by ChEBI and looking at the XMP ligand here we see:

[(2R,3S,4R,5R)-5-(2,6-dioxo-3H-purin-7-ium-9-yl)-3,4-dihydroxy-oxolan-2-yl]methyl dihydrogen phosphate

This is the SAME stereochemistry in the chemical name as on DrugBank, but actually a different chemical name. It is definitely possible, and common, for different systematic names to exist for the same chemical but it does indicate the challenges of linking based on different identifiers.

DrugBank:

[(2R,3S,4R,5R)-5-(2,6-dioxo-3H-purin-7-ium-9-yl)-3,4-dihydroxy-oxolan-2-yl]methyl dihydrogen phosphate

PDBeChem: [(2R,3S,4R,5R)-5-(2,6-dioxo-3,7-dihydropurin-9-ium-9-yl)-3,4-dihydroxyoxolan-2-yl]methyl dihydrogen phosphate

The InChIKeys between the different databases/tools are:

PDBeChem: DCTLYFZHFGENCW-KWDNBKPHDV

DrugBank: DCTLYFZHFGENCW-NIVOTTSGCB

PubChem: XHDARDSMKMUDDI-UUOKFMHZSA-O (this is a StdInChIKey)

ChemSketch: DCTLYFZHFGENCW-NSVMUQOTBF

ALL four are inconsistent.

If I convert the SMILES string listed on the PDBeChem ligand database using ACD/ChemSketch then

O[C@H]1[C@@H](O)[C@@H](O[C@@H]1CO[P](O)(O)=O)n2c[nH+]c3C(=O)NC(=O)Nc23

produces a structure with stereochemistry of 2R,3R,4S,5R and the InChIKey : DCTLYFZHFGENCW-XWTUZWARBW.

The stereochemistry on PDBeChem agrees with that on PubChem (based on the name), the connectivity part of the InChIKey is consistent with all other systems (except PubChem) but is different to all other InChIKeys. It is also possible to download “ideal” and “representative” molfiles from the PDBeChem database.

The InChIKeys between the different databases/tools are now:

PDBeChem: DCTLYFZHFGENCW-KWDNBKPHDV

DrugBank: DCTLYFZHFGENCW-NIVOTTSGCB

PubChem: XHDARDSMKMUDDI-UUOKFMHZSA-O (this is a StdInChIKey)

ChemSketch: DCTLYFZHFGENCW-NSVMUQOTBF

PDBeChem: DCTLYFZHFGENCW-XWTUZWARBW (from Isomeric SMILES converted via ChemSketch)

PDBeCHem: DCTLYFZHFGENCW-RDKRKOOMBN (from molfile from PDBeChem, (2R,3R,4S,5R)

DrugBank also links to the Protein Databank here. XMP is listed as a ligand as shown.

The XMP ligand links here to the detailed page containing the information linked below.

Name XANTHOSINE-5′-MONOPHOSPHATE
5′-xanthylic acid
[(2R,3S,4R,5R)-5-(2,6-dioxo-3H-purin-7-ium-9-yl)-3,4-dihydroxy-oxolan-2-yl]methyl dihydrogen phosphate
Synonyms 5-MONOPHOSPHATE-9-BETA-D-RIBOFURANOSYL XANTHINE
Formula C10 H14 N4 O9 P
Molecular Weight 365.21 g/mol
Type NON-POLYMER
Isomeric SMILES (OpenEye) c1[nH+]c2c(n1[C@H]3[C@@H]([C@@H]([C@H](O3)COP(=O)(O)O)O)O)NC(=O)NC2=O
InChI InChI=1/C10H13N4O9P/c15-5-3(1-22-24(19,20)21)23-9(6(5)16)14-2-11-4-7(14)12-10(18)13-8(4)17/h2-3,5-6,9,15-16H,1H2,(H2,19,20,21)(H2,12,13,17,18)/p+1/t3-,5-,6-,9-/m1/s1/fC10H14N4O9P/h11-13,19-20H/q+1
InChI key DCTLYFZHFGENCW-KWDNBKPHDV

The InChIKeys between the different databases/tools are now:

PDBeChem: DCTLYFZHFGENCW-KWDNBKPHDV

DrugBank: DCTLYFZHFGENCW-NIVOTTSGCB

PubChem: XHDARDSMKMUDDI-UUOKFMHZSA-O (this is a StdInChIKey)

ChemSketch: DCTLYFZHFGENCW-NSVMUQOTBF

PDBeCHem: DCTLYFZHFGENCW-RDKRKOOMBN (from molfile from PDBeChem)

PDB_ligand: DCTLYFZHFGENCW-KWDNBKPHDV

Aagghhhhh…InChIKeys get very convoluted! What we see is that the chemical structure on PDB and on PDBeChem are the same. This is good news at least! There is a difference in the InChIKeys when I download the molfile but this can be explained easily…and in a later blog post.

We believe that the structure on PDB should be expected to be correct. We will assert this.

We expect that DrugBank is sourcing the chemical from PDB to add to their database. The chemical structure on DrugBank should coincide with that from PDB. Unfortunately the SMILES on PDB and DrugBank differ in two stereocenters. We don’t know why. Why the inconsistency? If the DrugBank data aren’t from PDB for the XMP ligand where did they come from?

Did PubChem pick up the structure of XMP from the PDB Database or from DrugBank? Let’s see. If I download the 2D molfile from PubChem and generate the chemical name and InChIs I get consistency…PubChem IS consistent with PDB. It is NOT consistent with DrugBank despite the fact that DrugBank is linked into this PubChem record.

This is a very convoluted, and maybe confusing analysis of ONE compound on DrugBank. I have looked at dozens and see similar issues. Assuming that PDB is the source database for data on DrugBank why are the structures differing so much? There are worse examples to come…the linking together of data on the web between even curated databases is an incredible mess.

Caveat: This is detailed and challenging work. I recommend anyone to check my work and see if I missed anything and confirm or challenge the observations as some of the issues I am seeing can be tool-based…the software tools I use may have issues with SMILEs conversion, molfile or SDF reading etc. It is exacting to check chemical structures…

, , , ,

No Comments

Three Presentations to give at ACS Spring, Salt Lake City

I’ll be giving three papers at the ACS meeting in Salt Lake City in Spring of next year. It seems way in the distance but as usual that time will come way too quickly. I’ve accepted invitations to write four papers before the end of January so it will be the usual crunch. See you in Salt Lake City!?

PAPER ID: 1212659
PAPER TITLE: “Going a mile InChI by InChI: Enabling online chemistry at ChemSpider”

DIVISION: Division of Chemical Information
SESSION: The Adoption and Use of the IUPAC InChI/InChIKey SESSION START TIME: Sunday, March 22, 2009, 9:00 AM

PRESENTATION FORMAT: Oral
DAY & TIME OF PRESENTATION: Sunday, March 22, 2009 from 9:35 AM to 10:05 AM

PAPER ID: 1238487
PAPER TITLE: “Text mining for chemistry and building a public platform for document markup”

DIVISION: Division of Chemical Information
SESSION: General Papers
SESSION START TIME: Wednesday, March 25, 2009, 2:00 PM

PRESENTATION FORMAT: Oral
DAY & TIME OF PRESENTATION: Wednesday, March 25, 2009 from 2:05 PM to 2:30 PM
PAPER ID: 1243060
PAPER TITLE: “Cleaning up chemistry for the pharma industry: Delivering a flexible platform for interrogating the FDA DailyMed website”

DIVISION: Division of Chemical Information
SESSION: General Papers
SESSION START TIME: Wednesday, March 25, 2009, 2:00 PM

PRESENTATION FORMAT: Oral
DAY & TIME OF PRESENTATION: Wednesday, March 25, 2009 from 3:55 PM to 4:20 PM

, , , ,

1 Comment