I am presently working on a couple of articles, book chapters and guest blog posts regarding quality in public domain chemistry databases. In so doing I have continued to work through the data contained within the NPC Browser that I have blogged about many times before
. I HAVE been adding curation comments to the data as I have worked through them and have removed inappropriately associated chemical names. Eventually it became too much of a burden relative to me getting my work done as there are so many edits required. What I have been looking for specifically is examples of what I thought would exist in the database – that of a failure to deduplicate. Deduplication, in terms of chemistry databases, is collapsing together records based on the same chemical structure. This sounds easy but it isn’t necessarily so….consider some of the complexities of collapsing tautomers. SIMPLE collapsing can be done by generating InChIKeys and deduplicating but InChI tautomer detection is imperfect and this approach will fail regularly. The majority of the cheminformatics toolkits have their own ways of generating fingerprints to deal with this issue of deduplication.
While browsing the database I came across Ranitidine, the active component of the well known drug Zantac. I found two records in the database. They are shown below and numbered as 1/2 and 2/2.
Ranitidine record 1.
Ranitidine record 2.
I have compared these records as molfiles. I have compared SMILES string (below).
I have compared InChIs
Try as I might I don’t see a difference between these structures. Why were they not deduplicated? This leads to the question how many more duplicates are in the database and why? I have no idea….just an observation.