Markush Misrepresentations in ChemSpider

29 Apr

Following on from the many comments made about the recent post about the NPC Browser Markus Sitzmann highlighted a “fun molecule” that he found on ChemSpider. It was here as ChemSpiderID 19053748 shown below but it has now been deprecated…I logged in and deprecated it .

A "fun structure" on ChemSpider

Markus also commented on Sean Ekin’s blog here:

“Well, particularly ChemSpider belongs to the group of “polluters” in PubChem. Count the number of Aspirin, Benzene or Ethanol structures submitted by ChemSpider to PubChem (only linking to a “deprecated” ChemSpider record). Or make an advanced search for ChemSpider records containing also Argon, here is an example:

There are many other examples.”

Markus is CORRECT. I have commented on this publicly myself on a number of occasions and many people have noticed that there are data in PubChem that are in error and originally came from ChemSpider. There’s no point denying it as it’s there for all to see ! We have had the intention for a LONG time to deprecate this data from PubChem and replace it with an updated deposition of cleaner data. The intention remains but the challenge is finding the time to do it. We will do it.

Where did the data came from? These “argon” issues are really NOT argon issues…they are the results of molfiles finding their way into ChemSpider from “patent molecules” where the -Ar is expected to represent a Markush structure where Ar means “Aryl”. This is like  -Alk meaning alkyl. Similar issues arise when molecules are drawn as -X, -Y and -Z and lists of X,Y,Z substitutions are give. For example X=CH3, C2H5, Y=F, Br and Z= Br, Cl. Unfortunately Y is not only a substitution it’s an element, Yttrium. So when a molecule is drawn with a supposed Markush bond to -Y then we have a REAL molecule with Yttrium attached. Agh.

A list of the examples of “interesting Ar molecules” are shown below.

At this point these have all been deprecated…takes about 30 seconds per molecule..but if they were in our original deposition to PubChem they are still there until we deprecate. Ahh…the ongoing joys of data curation.



About tony

Founder of ChemZoo Inc., the host of ChemSpider ( ChemSpider is an open access online database of chemical structures and property transaction based services to enable chemists around the world to data mine chemistry databases. The Royal Society of Chemistry acquired ChemSpider in May 2009. Presently working as a consortium member of the OpenPHACTS IMI project ( This focuses on how drug discovery can utilize semantic technologies to improve decision making and brings together 22 European team members to develop an infrastructure to link together public and private data for the drug discovery community. I am also involved with the PharmaSea FP7 project ( trying to identify new classes of marine natural products with potential pharmacological activity. I am also one of the hosts for three wikis for Science: ScientistsDB, SciMobileApps and SciDBs. Over the past decade I held many responsibilities including the direction of the development of scientific software applications for spectroscopy and general chemistry, directing marketing efforts, sales and business development collaborations for the company. Eight years experience of analytical laboratory leadership and management. Experienced in experimental techniques, implementation of new NMR technologies, walk-up facility management, research and development, manufacturing support and teaching. Ability to provide situation analysis, creative solutions and establish good working relationships. Prolific author with over a 150 peer-reviewed scientific publications, 3 patents and over 300 public presentations. Specialties Leadership in the domain of free access Chemistry, Product and project management, Organizational and Leadership development, Competitive analysis and Business Development, Entrepreneurial.

4 Responses to Markush Misrepresentations in ChemSpider

  1. Markus Sitzmann

    April 29, 2011 at 10:38 pm

    Well, from my analysis with our local copy of ChemSpider here (taken from PubChem) I estimate the number of problematic/fishy structures in ChemSpider between 1 and 2 million CSIDs (in this regard 30 sec/molecule is a long time 😉 ). I spot-checked maybe a few hundreds of them with the ChemSpider web page – most of them are marked as “deprecated”. Since doing things like this with a web page is cumbersome, it would be great if the information “deprecated CSID” is available in an easier manner (e.g. as URL API or something like this).

  2. tony

    April 29, 2011 at 10:43 pm

    Yup…many of those marked as deprecated on our site, and therefore necessary to deprecate from PubChem, are from a historic definition of patent data. These will all get removed when we deprecate and when we redeposit will be gone from the set. NONE of the Argon related compounds that I deprecated tonight came from that dataset. If you look at my post on Mercury Argon for example that originated with PubChem as did some of the others. But now they are gone….from our site at least.
    Watch for the news shortly about the work we are doing to share deprecation information out with appropriate feeds. You should be able to use these directly when we expose!

  3. Egon Willighagen

    April 30, 2011 at 2:44 am

    Antony, the CDK project has a fairly decent atom typing framework, which would catch many issues like this. Now, if the ChemSpider data would have been Open, then you would have seen alerts on structures like these at least two years ago.

    • tony

      April 30, 2011 at 9:52 pm

      Egon….despite the fact that PubChem is not “open” per se, as we have discussed before, the PubChem data is avaialable for download so it would be good to see how the atom typing framework would perform on PubChem. Is it feasible to take on their 30 million structures and check them?


Leave a Reply

Your email address will not be published. Required fields are marked *