Markush Misrepresentations in ChemSpider


Following on from the many comments made about the recent post about the NPC Browser Markus Sitzmann highlighted a “fun molecule” that he found on ChemSpider. It was here as ChemSpiderID 19053748 shown below but it has now been deprecated…I logged in and deprecated it .

A "fun structure" on ChemSpider

Markus also commented on Sean Ekin’s blog here:

“Well, particularly ChemSpider belongs to the group of “polluters” in PubChem. Count the number of Aspirin, Benzene or Ethanol structures submitted by ChemSpider to PubChem (only linking to a “deprecated” ChemSpider record). Or make an advanced search for ChemSpider records containing also Argon, here is an example:

http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=20187034&loc=ec_rcs

There are many other examples.”

Markus is CORRECT. I have commented on this publicly myself on a number of occasions and many people have noticed that there are data in PubChem that are in error and originally came from ChemSpider. There’s no point denying it as it’s there for all to see ! We have had the intention for a LONG time to deprecate this data from PubChem and replace it with an updated deposition of cleaner data. The intention remains but the challenge is finding the time to do it. We will do it.

Where did the data came from? These “argon” issues are really NOT argon issues…they are the results of molfiles finding their way into ChemSpider from “patent molecules” where the -Ar is expected to represent a Markush structure where Ar means “Aryl”. This is like  -Alk meaning alkyl. Similar issues arise when molecules are drawn as -X, -Y and -Z and lists of X,Y,Z substitutions are give. For example X=CH3, C2H5, Y=F, Br and Z= Br, Cl. Unfortunately Y is not only a substitution it’s an element, Yttrium. So when a molecule is drawn with a supposed Markush bond to -Y then we have a REAL molecule with Yttrium attached. Agh.

A list of the examples of “interesting Ar molecules” are shown below.

At this point these have all been deprecated…takes about 30 seconds per molecule..but if they were in our original deposition to PubChem they are still there until we deprecate. Ahh…the ongoing joys of data curation.

 

  1. #1 by Markus Sitzmann on April 29, 2011 - 10:38 pm

    Well, from my analysis with our local copy of ChemSpider here (taken from PubChem) I estimate the number of problematic/fishy structures in ChemSpider between 1 and 2 million CSIDs (in this regard 30 sec/molecule is a long time ;-) ). I spot-checked maybe a few hundreds of them with the ChemSpider web page – most of them are marked as “deprecated”. Since doing things like this with a web page is cumbersome, it would be great if the information “deprecated CSID” is available in an easier manner (e.g. as URL API or something like this).

  2. #2 by tony on April 29, 2011 - 10:43 pm

    Yup…many of those marked as deprecated on our site, and therefore necessary to deprecate from PubChem, are from a historic definition of patent data. These will all get removed when we deprecate and when we redeposit will be gone from the set. NONE of the Argon related compounds that I deprecated tonight came from that dataset. If you look at my post on Mercury Argon for example that originated with PubChem as did some of the others. But now they are gone….from our site at least.
    Watch for the news shortly about the work we are doing to share deprecation information out with appropriate feeds. You should be able to use these directly when we expose!

  3. #3 by Egon Willighagen on April 30, 2011 - 2:44 am

    Antony, the CDK project has a fairly decent atom typing framework, which would catch many issues like this. Now, if the ChemSpider data would have been Open, then you would have seen alerts on structures like these at least two years ago.

    • #4 by tony on April 30, 2011 - 9:52 pm

      Egon….despite the fact that PubChem is not “open” per se, as we have discussed before, the PubChem data is avaialable for download so it would be good to see how the atom typing framework would perform on PubChem. Is it feasible to take on their 30 million structures and check them?

(will not be published)