RSS

Tag Archives: PubChem

Terminal Dimethyl means Death by Methane twice

When writing talks I try to find interesting (and where possible fun) examples of how challenging the world of managing chemistry data is for all of us that work in the world of managing 10s of thousands, or in our cases millions of compound pages for the community to use. I have told many stories over the past few years of the challenges we collectively have in regards to data quality and how it flows between our databases unabated. My latest example used at the recent talk at the EBI (ChemSpider – An Online Database and Registration System Linking the Web) was the structure known as Terminal Dimethyl presently on PubChem, DrugBankWolfram Alpha and PDBe. It was originally inherited into ChemSpider also but has been deprecated. I left a comment on DrugBank a couple of weeks ago but it hasn’t been published yet…generally such errors are removed VERY quickly by the DrugBank hosts. I added a comment to Wolfram Alpha and received a canned response and no changes to the record as yet.

There ARE ways to communally resolve these issues and I will blog about that shortly.

 
1 Comment

Posted by on October 31, 2011 in Data Quality, Humor, Quality and Content

 

Tags: , , , ,

Navigating an Internet of Chemistry via ChemSpider

The internet is a rich source of chemistry related data and, nowadays, if a chemist knows how to initiate a search, data can be sourced for millions of chemicals online. The nature of online data varies from simple molecule diagrams, to experimental and predicted properties, encyclopedic articles, synthetic routes, analytical data, patents and publications. The array of information now accessible is distributed across thousands of sites giving rise to the information overload commonly associated with the Google-type searches on the internet. In addition the purest language of chemistry, that of chemical structures, is not fully supported on the web as yet. This presentation will provide an overview of how the internet is being meshed together using data aggregation and standardization approaches to enable a structure-searchable internet for chemistry. The speaker will present an overview of the ChemSpider platform (http://www.chemspider.com), the challenges of linking together over 400 internet resources and 26 million unique chemicals, and discuss how members of the chemistry community can directly contribute to enhancing the availability of quality data online.

This is a movie of the talk I gave using the BigBlueButton platform to students and faculty at the University of Arkansas, Little Rock.

 
Leave a comment

Posted by on October 19, 2011 in Publications and Presentations

 

Tags: , , , , ,

Presentation at the 5th Meeting on U.S. Government Chemical Databases and Open Chemistry

Yesterday I gave a talk at the 5th Meeting on U.S. Government Chemical Databases and Open Chemistry hosted by Mark Nicklaus. It was a great meeting. A lot of like minded people and some great work going on to provide access to chemical databases. I’ll blog more when I get back from the ACS Denver meeting this coming week. For now I am simply putting up a copy of the talk I gave.

“ChemSpider is a structure centric database hosted by the Royal Society of Chemistry and integrating over 25 million chemical compounds to over 400 internet-based resources including many public domain databases, Wikipedia, chemical vendors, patents, publications and other web-based services. The intention is for ChemSpider to become one of the primary online hubs for chemists to source chemistry related data. During the development of the ChemSpider database we have utilized numerous approaches to standardizing, curating and validating the data supplied to us for hosting and integration. This presentation will provide an overview of our initial development of the ChemSpider database and provide an overview of our present processes and procedures for handling incoming data depositions. We will also discuss how crowdsourcing can help to expand, curate and validate the data on the ChemSpider database.”

 

 
Leave a comment

Posted by on August 27, 2011 in Publications and Presentations

 

Tags: , , ,

Searching for “Complete Synonyms” in PubChem and the NPC Browser

I am interested in feedback from online databases as to expected behaviors from a search. PubChem has a Complete Synonym search that limits a chemical name based search to the synonym field. Without that fielded search the search is across all text in a record, I assume. The difference in the results is shown below. The top image shows a search for Taxol and returning 59 results.

A search for Taxol in PubChem

Below is a search on Taxol[completesynonym]. This search returns 5 hits for Taxol.

I wonder whether most users of PubChem know that they need to add the [completesynonym] definition to limit the search? You might want to try Diamond and Diamond[completesynonym] as searches and look at the results.

I am assuming that on the NPC Browser a similar type of search can be conducted to limit results as a search on the drug Lidocaine returns 14 chemicals..all of them different. If this search exists I have missed it. Can anyone comment?

With ChemSpider we do our utmost to return a single structure for a clearly unique name such as Taxol and Lidocaine. We believe that’s what most people would expect. Thoughts and comments welcome.

 

Tags: , ,

How Dominant is Pubmed for Chemistry

Recently I moved this blog to WordPress hosting and started using a new Theme. This is work in progress. Many of the original image associations still need to be remade as the blog went from www.chemconnector.com/chemunicating to simply www.chemconnector.com. With the new theme I decided to start managing my CV, presentations and publications online too. I’ve had it staggered across various sites such as Mendeley but having it managed on my own blog just made more sense. In particular, what I have been doing is spending half an hour per night creating links between the papers on the My Curriculum Vitae page using the DOI and associated CrossRef Resolver to do the linking. It makes sense to go this path.

In order to do the linking I first have to find the DOI. To do the DOI I search the paper title on google, or the reference where necessary. It’s had some interesting results already as I detailed here. While linking up the papers…75 done and about 30 to go…I observed an increasingly obvious trend. It was an unexpected trend based on what I had been told. The trend? PubMed is not just about the Medicine and the Life Sciences.

Wikipedia declares “PubMed is a free database accessing primarily the MEDLINE database of references and abstracts on life sciences and biomedical topics. The United States National Library of Medicine (NLM) at the National Institutes of Health maintains the database as part of the Entrez information retrieval system.”

However, time after time, as I searched for the titles of my articles PubMed kept coming out on top, ABOVE the actual publisher. Let’s take an example. On my CV is a paper referenced as:

G. E. Martin, B. D. Hilton, K. A. Blinov, and A. J. Williams, Multistep correlations via covariance processing of COSY/GCOSY spectra: opportunities and artifacts. Magn. Reson. Chem. 46, 997-1002, (2008). The title “Multistep correlations via covariance processing of COSY/GCOSY spectra: opportunities and artifacts.” Here’s PART of the abstract..the entire abstract is marked as copyrighted but I was involved in writing it…

“Long-range homonuclear coupling pathways can be observed in COSY or GCOSY spectra by the acquisition of spectra with larger numbers of increments of the evolution period, t1, than would normally be used.  Alternatively, covariance processing of COSY-type spectra acquired with modest numbers of t1 increments, allows the observation of multistage correlations. In this work results obtained from covariance-processed GCOSY spectra are fully analyzed and compared to normally processed COSY and 80 ms TOCSY spectra. ”

I’m sure you’d agree it’s NOT very “medical”, “biomedical” or “life sciences”. Yet…if I do a Google search we find:

 

 

 

 

 

 

 

 

 

 

The full reference is here on Pubmed.

As can be seen, PubMed returns the reference above Wiley, the publisher of the article. I saw this for many, many of the publications listed on my CV. Most of them are based on NMR spectroscopy data processing approaches so why would they be in Pubmed? I am assuming this is simply because the journal itself has been identified as a journal that is “acceptable” to Pubmed? Now, I’m a chemist…and it would be super if there was a Pubmed for the whole of chemistry…of course we cannot call it PubChem…that’s already taken. But I wonder what is standing in the way of PubMed simply becoming all-encompassing…why can’t it accept all chemistry papers, for example. It’s clearly accepting some (many!) that I have authored/co-authored. Why not more? Is it policy? Is it resources? Can anyone comment?

 
Leave a comment

Posted by on April 20, 2011 in General Communications

 

Tags: , ,

Community Views and Trust in Public Domain Chemistry Resources

Over the past 4 weeks I have been involved with some new and old friends in the world of chemistry to initiate an analysis of “quality” in public chemistry resources. This is work in progress and involves 3 separate groups (lets call them labs) looking at various resources. Here’s s short description of the project.

The questions we are attempting to answer are:

Core question : what is the quality of data online in public chemistry databases? How accurate and unambiguous is the representation of chemical structures and their measured properties in public chemistry databases?

How capable are the present cheminformatics tools of handling the complexities of structure representations – limited to “small” organic molecules
How hard is it to generate a reference set of highly curated, “gold standard” data (chemical structure and activity) for a database of “known drugs”?

We’ve started with the top 200 selling drugs. The three labs had to come to an agreement about which of the top 200 drugs were small molecules (many of the top 200 are monoclonal antibodies or polymers for example). We then had to decide would we deal with mixtures and combination drugs. Just to identify the list of NAMES of drugs we wanted to deal with was iterative and a negotiation.

Then we decided that each lab would work independently, each lab would have at least two members of the lab working on the same problem independently. We would have both intra-lab and inter-lab comparisons. We decided to start on a set of 10 drug names, using the GENERIC name as the name to work from. I started my part of the work just before I had to give a presentation at the EBI last week and was able to gather a lot of the data before the talk.

Starting with a chemical name how to you determine what the “correct” structure for that drug is. Think it’s easy? Try it! Where would you look? How would you confirm? What would the iterative loop look like in order for YOU to assert the chemical structure(s) for the drug “Vytorin”.

For me the process looks something like this. Use a level of “trust and experience” with previously used resources as a starting point and declare “This is the structure of X based on searching on the drug name for X”. Now, cross-reference, iterate, reiterate, find consistencies and collisions in order to come up with a final assertion, a list of consistent structures and the associated sources, and a set of other resources with inconsistent structures and a list of why they differ. Where possible, and if necessary, make edits to change the information (e.g. ChemSpider and Wikipedia). You can see an example of this for Vitamin K in the talk I gave at EBI here. For ten structures I came up with a number of observations for a number of drugs. The screenshot below summarizes some of the results (Click on the image to see the detail).

Represented in the table is the following information, true at the time of the search and may be already out of date

1) A search for thalidomide in ChEBI gives no hits

2) The structures of Zocor and Crestor on Drugbank are incorrect

3) There are no hits for Voglibose and Crestor on Common Chemistry

4) There were 3 incorrect structures on ChemSpider (now edited of course)

5) For most searches on a drug name on PubChem there are MULTIPLE hits and, for the set examined, the correct structure is in the results set. For example, there are 44 structures of Taxol retrieved with the search and the one I assert to be correct is there.

6) There were two incorrect structures on Wikipedia and one drug without an associated structure.

When I started the work I had a “trust” level for a number of the databases. My basic position at that time was as follows. I could rarely find the correct structure for a drug based on a text-based search of PubChem. I would generally find a set of hits and it was a lot of work to determine what was correct. Common Chemistry was excellent…but limited. Dailymed was generally good but structure representations could be abysmal.  ChEBI, DrugBank, ChemIDPlus and Wikipedia were generally VERY good. Of all of the sources I used, despite the rich data on PubChem, I struggled most with this resource to find the correct structure. The results started to show that my trust perceptions were being challenged.

In parallel with the work to prepare this small dataset for the presentation at EBI I decided that it was appropriate to ask the community for their views on some of the databases I was looking at in this work…specifically asking for how much they “trusted” a resource. Trust means different things to different people. The word, and the question I was asking in the survey, would be interpreted in different ways. But that’s the way we work…so why fight it? The survey is online here…and if you haven’t filled it in PLEASE DO!

The answers to date, from the 46 responders, are below (Click on the image to see the detail):

There are some very interesting results in here…and, I willingly admit, some I find VERY surprising. 1 person has no experience with Wikipedia? Wow. The majority of people have not heard of PDSP, ChemIDPLus, DailyMed or DrugBank…without knowing who the people are that are providing feedback of course I should not be shocked. Most of these resources are not for chemists per se but for Life Scientists. The number of votes for “Always Trust” for ChemSpider and PubChem are very high, and one might say, are a compliment. The results are clearly ChemSpider-biased since I asked the question to my social network. The difference between the people who Always Trust PubChem and Commonly Trust PubChem is one person only. This is wildly different from my own views. I have heard people say that PubChem is the equivalent of quality to CAS except it’s free. Sorry folks…afraid not! (I have since heard at the EBI meeting from one of the people from PubChem that it is possible to do searches in certain ways to limit hits. It should be noted that this does not guarantee that the correct structure is retrieved.) On the flip side to this the distribution of people rarely trusting PubChem is also quite high so someone has had some interesting experiences!

There are a small number of people who NEVER Trust the resources, and early on one person declared they trusted none of them. I trusted myself to tell a colleague…that will be “Egon Willighagen” and this was later confirmed in his blog “Trust has No Place in Science“. That may be true, and the topic of a separate post, but my judgment is pretty good!

How would I fill in the questionnaire. I would NEVER flag “Always Trust” for any of the databases. I would be able to rank order the databases in terms of my perceptions/experience and extracted trust for the quality of results I would find. The answers WOULD be different before I had conducted the work on the first 10 drugs compared with now, after the pre-work.

As the host of ChemSpider I would prefer that no one “Always Trusts” the resource as that will stop people from taking care with the data and thereby removing the possible value of them curating the data. However I am more than happy to have it Commonly Trusted and we have been working hard to gain the community’s trust in this area.

This work has triggered a number of responses….I’ll make my own comments on their positions separately… but their opinions are worth reading:

Egon Willighagen: Truth has no place in Science

Egon Willighagen: Truth has no place in Science Part 2

Christina Pikas: The role of trust in science Christina has a comment “I think that Anthony (sp.) could have chosen a better word than trust in his survey. “which of these have you evaluated and decided you could use? which of these would you prefer to use based on your evaluations of their merit?” Christina is right..I could have chosen a different word but I judge (chosen carefully!) that the responses would not have differed much.

There is also a healthy exchange happening on Friendfeed.

This work has only just started. An examination of >150 “small molecule drugs” by three labs is going to provide a lot of data. The work isn’t over and we have much to do. We’re learning a lot in the process about assertional loops, iterative process, collaboration and agreement. It’s a great adventure.

 
10 Comments

Posted by on December 11, 2010 in Community Building, Data Quality

 

Tags: , , , , , ,

Presentation at European Bioinformatics Institute

Last week was quite the trip to the United Kingdom…hit by the flu that put me into bed without a voice for an entire day and then gave the rescheduled talk the next day feeling a little beaten up. The talk discussed the recently conducted survey of public domain databases that I initiated last week (results embedded in the talk) as well as some of the observations comparing data for 10 drugs across a series of Public Domain databases. The meeting was a good chance to meet some of the hosts of some of the databases including PubChem, DrugBank, ChEBI/ChEMBL and SureChem. I’m sorry I missed the first day…

 
 

Tags: , , , ,

Finding the Structure of Vitamin K1 Online

You would think that finding the correct structure of Vitamin K1 online in public domain resources would be an easy exercise. But not so fast. Using the assertion that the chemical structure is correct in the Merck Index, and then wandering through CAS’s Common Chemistry to validate this assumption, this short movie takes us through Wikipedia, Wolfram Alpha, KEGG, DrugBank, PubChem and other online resources to show how complex and impure the public domain databases are in terms of resourcing good quality name-structure associations for chemicals. Vitamin K1 is actually a rather simple chemical structure. Finding the correct chemical structure online…not so simple.

 

Tags: , , , , , , , ,

The Messy World of Even Curated Chemistry on the Internet

Recently I have been spending my night hours looking into the nature of curated chemistry on the internet. 3 years ago I made some assumptions around the quality of certain online datasets when they were deposited onto ChemSpider. It was clear that a lot of internet chemistry datasets were “impure”…I think messy, untrustworthy and confused would be a fairer statement! However, there were a number of datasets that were manually curated and, at initial viewing, were higher quality. With time however I have become increasingly concerned with some of the datasets that I had originally cited as high quality. Over the next few days/weeks I will examine some of these in detail and highlight some of the issues I am seeing. I want to clarify that all chemical compounds, in terms of  their connection tables, their stereochemistry and the association between the compound and the name(s) are assertions. However, there are “norms” for these structures….we would expect a particular structure for aspirin (acetylsalicylic acid ), a single structure for Cholesterol and a single structure for Taxol. By the way, the links to Wikipedia are not assertions that the structures that are presently on Wikipedia are correct representations…but I can confirm that PREVIOUSLY I did work to confirm that every one of these was consistent with my investigations to assert the association between the chemical name and the structure. SInce then it is possible that someone edited the structure…such is the world of Wikipedia!

Two of the linked data sources I have been investigating of late are DrugBank and the Protein Databank. Both of these are manually curated and are expected to be of high quality. In my discussions with various members of the Life Science industry I have heard many positive comments of these data sources as being trustworthy and high quality. I recently downloaded the drugbank small molecule set and started looking at it. Let’s take one example…

The Drugbank record DB02309 has the chemical name “5-Monophosphate-9-Beta-D-Ribofuranosyl Xanthine“. The structure on Drugbank is shown below.

The chemical name above is inconsistent with the structure…there is no stereochemistry in the molecule displayed despite the “-D-” in the name. The IUPAC name listed in the Drugbank record is “[(2R,3S,4R,5R)-5-(2,6-dioxo-3,7-dihydropurin-9-ium-9-yl)-3,4-dihydroxyoxolan-2-yl]methyl dihydrogen phosphate” and this clearly does not agree with the displayed structure.

The InChI listed on the record does not include a stereo layer (InChI=1/C10H13N4O9P/c15-5-3(1-22-24(19,20)21)23-9(6(5)16)14-2-11-4-7(14)12-10(18)13-8(4)17/h2-3,5-6,9,15-16H,1H2,(H2,19,20,21)(H2,12,13,17,18)/p+1/fC10H14N4O9P/h11-13,19-20H/q+1). The InChIKey is listed as:

DCTLYFZHFGENCW-NIVOTTSGCB

The Drugbank record links to a structure with full explicit stereochemistry on PubChem here and to the ligand on the PDB ligand database hosted by ChEBI here.

The molfile downloaded from DrugBank has no stereochemistry but lists both Canonical and Isomeric SMILES

Isomeric SMILES O[C@H]1[C@H](COP(O)(O)=O)O[C@H]([C@@H]1O)N1C=[NH+]C2=C1NC(=O)NC2=O
Canonical SMILES OC1C(COP(O)(O)=O)OC(C1O)N1C=[NH+]C2=C1NC(=O)NC2=O

It is clear what has happened, I believe….the Drugbank record has used the canonical SMILES to generate the structure image and has neglected the stereochemistry. However, the names carry the original stereochemistry information while the InChI comes from the structure with no stereo. I think that’s what happened.Let’s confirm.

ASSUMING that the isomeric SMILES string is the appropriate stereochemistry I can convert it and get the following InChIKey (generated using ACD/ChemSketch) and using ACD/Name get the name below). I trust ChemSketch and ACD/Name products to generate both appropriately as I managed these products while at ACD/Labs for over a decade.

DCTLYFZHFGENCW-NSVMUQOTBF

9-{(2R,3R,4R,5S)-3,4-dihydroxy-5-[(phosphonooxy)methyl]tetrahydrofuran-2-yl}-2,6-dioxo-2,3,6,9-tetrahydro-1H-purin-7-ium

On Drugbank the chemical name listed is:

[(2R,3S,4R,5R)-5-(2,6-dioxo-3,7-dihydropurin-9-ium-9-yl)-3,4-dihydroxyoxolan-2-yl]methyl dihydrogen phosphate

Okay…the names are subtly different….but there are 3R and 1S centers in each name but they differ, assuming that the nomenclature programs are using consistent numbering schemes. See below.

Name generated from Isomeric SMILES on DrugBank: 2R,3R,4R,5S

Chemical Name on DrugBank: 2R,3S,4R,5R

More on this later. Looking at the linked PubChem record gives the following name: [(2R,3S,4R,5R)-5-(2,6-dioxo-3,7-dihydropurin-9-ium-9-yl)-3,4-dihydroxyoxolan-2-yl]methyl dihydrogen phosphate, exactly the same one as listed on Drugbank….so one assumes that the chemical names on DrugBank come from PubChem. Downloading the molfile from PubChem into the same software used to generate InChIs and chemical names gives:

XHDARDSMKMUDDI-XWTUZWARBP

9-{(2R,3R,4S,5R)-3,4-dihydroxy-5-[(phosphonooxy)methyl]tetrahydrofuran-2-yl}-2,6-dioxo-2,3,6,7-tetrahydro-1H-purin-9-ium

DrugBank is linked out to the PDB ligands hosted by ChEBI and looking at the XMP ligand here we see:

[(2R,3S,4R,5R)-5-(2,6-dioxo-3H-purin-7-ium-9-yl)-3,4-dihydroxy-oxolan-2-yl]methyl dihydrogen phosphate

This is the SAME stereochemistry in the chemical name as on DrugBank, but actually a different chemical name. It is definitely possible, and common, for different systematic names to exist for the same chemical but it does indicate the challenges of linking based on different identifiers.

DrugBank:

[(2R,3S,4R,5R)-5-(2,6-dioxo-3H-purin-7-ium-9-yl)-3,4-dihydroxy-oxolan-2-yl]methyl dihydrogen phosphate

PDBeChem: [(2R,3S,4R,5R)-5-(2,6-dioxo-3,7-dihydropurin-9-ium-9-yl)-3,4-dihydroxyoxolan-2-yl]methyl dihydrogen phosphate

The InChIKeys between the different databases/tools are:

PDBeChem: DCTLYFZHFGENCW-KWDNBKPHDV

DrugBank: DCTLYFZHFGENCW-NIVOTTSGCB

PubChem: XHDARDSMKMUDDI-UUOKFMHZSA-O (this is a StdInChIKey)

ChemSketch: DCTLYFZHFGENCW-NSVMUQOTBF

ALL four are inconsistent.

If I convert the SMILES string listed on the PDBeChem ligand database using ACD/ChemSketch then

O[C@H]1[C@@H](O)[C@@H](O[C@@H]1CO[P](O)(O)=O)n2c[nH+]c3C(=O)NC(=O)Nc23

produces a structure with stereochemistry of 2R,3R,4S,5R and the InChIKey : DCTLYFZHFGENCW-XWTUZWARBW.

The stereochemistry on PDBeChem agrees with that on PubChem (based on the name), the connectivity part of the InChIKey is consistent with all other systems (except PubChem) but is different to all other InChIKeys. It is also possible to download “ideal” and “representative” molfiles from the PDBeChem database.

The InChIKeys between the different databases/tools are now:

PDBeChem: DCTLYFZHFGENCW-KWDNBKPHDV

DrugBank: DCTLYFZHFGENCW-NIVOTTSGCB

PubChem: XHDARDSMKMUDDI-UUOKFMHZSA-O (this is a StdInChIKey)

ChemSketch: DCTLYFZHFGENCW-NSVMUQOTBF

PDBeChem: DCTLYFZHFGENCW-XWTUZWARBW (from Isomeric SMILES converted via ChemSketch)

PDBeCHem: DCTLYFZHFGENCW-RDKRKOOMBN (from molfile from PDBeChem, (2R,3R,4S,5R)

DrugBank also links to the Protein Databank here. XMP is listed as a ligand as shown.

The XMP ligand links here to the detailed page containing the information linked below.

Name XANTHOSINE-5′-MONOPHOSPHATE
5′-xanthylic acid
[(2R,3S,4R,5R)-5-(2,6-dioxo-3H-purin-7-ium-9-yl)-3,4-dihydroxy-oxolan-2-yl]methyl dihydrogen phosphate
Synonyms 5-MONOPHOSPHATE-9-BETA-D-RIBOFURANOSYL XANTHINE
Formula C10 H14 N4 O9 P
Molecular Weight 365.21 g/mol
Type NON-POLYMER
Isomeric SMILES (OpenEye) c1[nH+]c2c(n1[C@H]3[C@@H]([C@@H]([C@H](O3)COP(=O)(O)O)O)O)NC(=O)NC2=O
InChI InChI=1/C10H13N4O9P/c15-5-3(1-22-24(19,20)21)23-9(6(5)16)14-2-11-4-7(14)12-10(18)13-8(4)17/h2-3,5-6,9,15-16H,1H2,(H2,19,20,21)(H2,12,13,17,18)/p+1/t3-,5-,6-,9-/m1/s1/fC10H14N4O9P/h11-13,19-20H/q+1
InChI key DCTLYFZHFGENCW-KWDNBKPHDV

The InChIKeys between the different databases/tools are now:

PDBeChem: DCTLYFZHFGENCW-KWDNBKPHDV

DrugBank: DCTLYFZHFGENCW-NIVOTTSGCB

PubChem: XHDARDSMKMUDDI-UUOKFMHZSA-O (this is a StdInChIKey)

ChemSketch: DCTLYFZHFGENCW-NSVMUQOTBF

PDBeCHem: DCTLYFZHFGENCW-RDKRKOOMBN (from molfile from PDBeChem)

PDB_ligand: DCTLYFZHFGENCW-KWDNBKPHDV

Aagghhhhh…InChIKeys get very convoluted! What we see is that the chemical structure on PDB and on PDBeChem are the same. This is good news at least! There is a difference in the InChIKeys when I download the molfile but this can be explained easily…and in a later blog post.

We believe that the structure on PDB should be expected to be correct. We will assert this.

We expect that DrugBank is sourcing the chemical from PDB to add to their database. The chemical structure on DrugBank should coincide with that from PDB. Unfortunately the SMILES on PDB and DrugBank differ in two stereocenters. We don’t know why. Why the inconsistency? If the DrugBank data aren’t from PDB for the XMP ligand where did they come from?

Did PubChem pick up the structure of XMP from the PDB Database or from DrugBank? Let’s see. If I download the 2D molfile from PubChem and generate the chemical name and InChIs I get consistency…PubChem IS consistent with PDB. It is NOT consistent with DrugBank despite the fact that DrugBank is linked into this PubChem record.

This is a very convoluted, and maybe confusing analysis of ONE compound on DrugBank. I have looked at dozens and see similar issues. Assuming that PDB is the source database for data on DrugBank why are the structures differing so much? There are worse examples to come…the linking together of data on the web between even curated databases is an incredible mess.

Caveat: This is detailed and challenging work. I recommend anyone to check my work and see if I missed anything and confirm or challenge the observations as some of the issues I am seeing can be tool-based…the software tools I use may have issues with SMILEs conversion, molfile or SDF reading etc. It is exacting to check chemical structures…

 
Leave a comment

Posted by on August 15, 2010 in Open Science..all its forms, Software

 

Tags: , , , ,

Another Response to Constructive Feedback from Peter Murray-Rust…

Since ChemSpider went live in March of this year we have received a lot of feedback and questions regarding our understanding of science, our purpose and our passions. We have an excellent Advisory Group who participate in dialogs and constructive discussions. Much of the feedback we have received has been from one individual , Peter Murray-Rust (PMR).

Before proceeding with this post I want to clarify my perceptions. I believe PMR brings a lot of value to the Chemistry Blogosphere. Over the past decade I have watched Peter’s activities with interest as he has participated with many other evangelists to pursue the cause of ODOSOS (Open Data, Open Source and Open Standards). Over the years I will confess a level of hero-worship. I had enjoyed watching what he was doing in regards to enabling the web for chemists. He is prolific..I don’t know where he finds the time to write so much. He travels the world and informs us all of what is going on “out there”. He does a great service. In contrast to these positive traits which I honor I am of the opinion that Peter is overly harsh and judgmental in some cases. Often he posts without necessary research and his perceptions become the “truth”. This is dangerous when he has such a public profile and such influence. For evidence of influence visit the graph here and notice the incredible spike in traffic resulting from his post about the Monkeys at ChemZoo in April of this year. It is unlikely those visitors ever returned to our site or blog to hear our comments. Potential damage was done.This blog post is in regards to his most recent judgments of ChemSpider.

When ChemSpider was set up for the benefit of the chemistry community I had assumed that this humble effort by a small group of dedicated individuals would be welcomed by PMR and other Open Access advocates. In general I believe that’s true. Our actions, policies and status have drawn a significant amount of feedback from PMR on his blog. New feedback was posted late last week and I’ll get to that shortly. As a review, in keeping with the trend being set by Rich Apodaca (1,2,3), I am listing what’s happened to date.

“Constructive Feedback” for Newbies

The Challenge to ChemSpider Chemistry

When Sodium chloride dimers are bad science..but are on NIST Webbook and PubChem

Calcium Carbonate is not soluble and can’t have a logP PLUS Lipinski says Calcium Carbonate CAN have a logP

Prussian Blue on ChemSpider is Terrible…but still as good as Pubchem and Emolecules.

Is Stereochemistry on Taxol important? Should the public data be curated?

ChemSpider VERSUS PubChem or ChemSpider SUPPORTS PubChem

ChemSpider ripped off PubChem…damn them.

ChemSpider and Their Openness and non-Web 2.0

ChemSpider don’t understand what Web 2.0 is.

ChemSpider contribute to the community…and support PubChem

Spectral Data are Declared Open Data

Helping out the community with Web Services

There are a lot more…and so to the latest. I’ll identify the recent post comments in italics.

PMR> Recently the Chemspider company has announced an “Open Chemistry Web” which in my opinion misuses the word “Open”.

Open Chemistry Web is the name of a new blog set up and hosted by Will Griffiths. It’s not ODOSOS. It’s a NAME of a blog. If we are in an environment where the name of a blog cannot include the word “Open” then we are living in sad times. Will’s passion is in text-mining OPEN ACCESS Chemistry Articles..or others if people will allow it. Can he not name his own blog? Hmmm….

PMR> Chemspider.com and its associates are commercial organization which have aggregated a large number of chemical connection tables and have started by calculating their properties and extracting literature references which they make freely accessible but not Open. The freedom is for an unspecified timescale and you cannot download significant amounts of the data and you cannot re-use it without permission. ”

Yes we are “commercial”. I dealt with this same comment previously. If you have interest in this please browse it. A later post outlines the present status of the project and whether or not it will survive.

Yes, we have aggregated a large number of connection tables and have started by calculating their properties and extracting literature references which they make freely accessible.We have done a lot more. We have made multiple services available to the community (1,2,3,4) but, with no surprise, have received no acknowledgment.

Regarding “not open“. We are giving away the ChemSpider database to those who ask for it. It will be published in PubChem. We USE Open Source components (1,2,3,4). We have not generated any Open Source components yet and our source code is not Open. We index Open Access articles on ChemRefer. We work with the Open Source data community to help.

Regarding “you cannot download significant amounts of the data and you cannot re-use it without permission“. We are giving away the ChemSpider database to those who ask for it. We do NOT have a server farm to support downloads. The FAQ page says

May I download the data and use it in my own database(s)?

You have limited rights in this regard. You can only assemble a database of 5000 structures or less, and their associated properties, from our database without our permission. You can download up to 1000 structures per day from the website. Please contact us at feedbackATchemspiderDOTcom to request an extension outside this constraint. We are willing to provide the ENTIRE database of ChemSpider structures at your request – the file will consist of InChI Strings, InChIKeys and ChemSpider IDs. These constraints are under regular review so please feel free to engage us in conversation.”

PMR>”Initially I was concerned about the complete lack of quality in these calculations and said so – I believe there has been some improvement in quality but I do not check and do not intend to do so. I do not follow Chemspider regularly but they appear to have added the ability for anyone to add annotations and curation. I have serious concerns about the lack of thought given to metadata and I do not expect Chemspider to be able to scale or to compete against modern approaches.”

I acknowledge the judgments and opinions. A question…in terms of online data sources for chemistry I believe that approximately 20 million structures ranks in the top 3. We have about 1500 chemists per day using the site with thousands of transactions including text and structre/substructure searching. Please compare with other services in this domain and, if you do this, provide quantitative information. We welcome any feedback on metadata. We are presently working on RDF’ing ChemSpider thanks to the guidance and support of Egon Willighagen. I have dealt with the metadata discussion previously here and abstracted below.

“Other comments include “I see very little difference between Chemfinder and Chemspider. They are both closed, proprietary, do not expose data, or metadata, or algorithms; have closed code, do not allow downloads or re-use. They lose metadata in their aggregation process. I have nothing personal against Chemspider (or, if they are associated, ACDLabs) – I just think the Web 1.0 model is out of date for chemistry.”

To respond…yes, the code is proprietary and closed..we don’t know of any Open Source code that would quickly search >10 million structures by structure and substructure (that will be covered in a separate blog as I have the utmost respect for the commercial entities that do this well! It’s DIFFICULT.) Oh…but Open Source isn’t part of the Web 2.0 definition. We don’t expose algorithms…correct…many are provided by collaborators and we do not have the right to expose their code. But that isn’t part of Web 2.0 either.

And next…the beloved “metadata” term. What exactly IS metadata? Let’s refer again to our web-friendly Wikipedia regarding metadata. In brief it’s “data about data” and a perfect example is an XML schema vs XML. An XML schema is metadata. According to my interpretation this means InChI and SMILES are not metadata since these data can be interchanged with the structure itself. I may be wrong. The hypothetical entity describing what data can be bound to a structure would be metadata not necessarily data related somehow to the structure, but rather more general data describing the datamodel – for example the source of the data – this IS metadata. ChemSpider doesn’t lose the metadata…we retain the only metadata currently available, the data source, and use it as our link out to the provider. Our primary role again, for now, is to connect silos of information via chemical structures.”

PMR> Chemspider also encourages Uploading Spectra Onto ChemSpider. These spectra by default all belong to Chemspider. They are not Open. If you can convince the world at large to donate IPR to you for free, you deserve some form of congratulations for sheer bravado. Note that even if you upload data and metadata you are not allowed to download it (there is a limit of 100 structures).

Thanks, again, for the judgments. We have been testing out the system with two of our advisory group and myself. Only JC Bradley’s Lab and Bob Lancashire have deposited and with the understanding, I believe, that the data would be “Open”. Since PMR’s blog posts continue to do damage to our reputation we have no choice but to respond. We do this with coding. Within 24 hours of his comments Open Data was declared, spectra can be downloaded. The intention was always there to do this…just we have higher priorities.

PMR>”We have ca. 250,000 calculations on molecules and 130,000 crystal structures which Chemspider have suggested we upload to them. I’m not yet sure why we should do this.”

Well, if they are Open Data, as marked at the CrystalEye website, and seeing as though people would like to access the data via ChemSpider, we should just be able to download. But, we don’t want all the data..we just want the structures and the appropriate URL structure to link back to CrystalEye. This is what we do with all data sources including NMRShiftDB.
PMR>”Chemrefer appears to allow searching of Open chemistry articles by keyword. Unexceptional, but why shouldn’t we simply use Pubchem? AFAIK it will index all these journals.”

PubChem indexes these journals? No, I think it’s PubMed. We’ll check on whether everything ChemRefer indexes is in PubMed. However, what they don’t do, yet, or ever, is connect the chemical names in those journals to chemical structures. That’s what’s been done for patents.

“PMR> The IPR model of Chemspider seems clear. No data, metadata and author contributions are Open.

Incorrect.

“PMR>That allows them, at some stage in the future to close some or all of the site and to charge for data and services”

The site, as it exists today, is intended to stay free for all. We may, OPENLY acknowledged, open services that are for charge.

“PMR> and – like eMolecules and their tie-up with Wiley (Wiley and eMolecules: unacceptable; an explanation would be welcome) – I predict this will happen within 5 years (unless Chemspider fails to survive in its current form).

I have posted on what I believe is an inappropriate judgment by Peter that the data on Chemgate is extracted from the journals. I put a trackback to Peter’s original post. He never responded. He did comment separately though about busyness and commenting. Unfortunately Wiley and Chemgate now show up again…with no effort to clean up the previous comments and, unfortunately, more incorrect information about ChemSpider.

“PMR> So all the authors who are contributing metadata are, in effect, donating IP to Chemspider. I have no moral objection to this – it just seems retrograde when we have Open collections of molecules such as PubChem and our own crystalEye.”

ChemSpider data will all go onto PubChem shortly. This was announced at the recent PubChem meeting. I have asked PMR to point me to where I can download the CrystalEye collection if it is indeed Open Data.

“PMR>But a number of my friends in the Open Chemistry area are on the Chemspider advisory board, so I must be missing something. Perhaps they can show how donating IP to a commercial closed company advances the cause of Open Chemistry.”

I hope they discuss with you. This group is a powerful team of intellect, capabilities, insight and support. I value the opportunity to work with them.

“PMR> And I applaud Chemspiderman’s efforts to clean up chemistry. Sometimes this gets muddled with the association with a commercial organisation based on possessing chemical IP so sometimes my messages have been less than generous and I apologized.”

Yes, you did. And I accept it willingly. It was very gentleman like.

“PMR> I am not anti-capitalist – I do not attack companies per se. But I do attack people who use the word “Open” incorrectly and to promote themselves. I have done this when publishers come up with “Open Access” offerings which appear to be less than satisfactory ( see “open access products” at Nature obscures the debate, Why Open Access metrics are necessary) and for which the community has to pay. “Open” is now used by commercial organisations in the same way as “healthy” – please feel good about us and our activities as we use the word “Open”. We know it’s meaningless, but it makes us look good. Well, it isn’t meaningless. A number of people are trying carefully to describe what is meant by Open access, Open Data, Open source and Open Services. And when others use it to mean something less, I take exception. If nothing else it makes our job much harder.”

I will comment on this in a couple of later posts. I do not support the “marketing” use of Open and do not believe we are doing so. However, I want to comment more on this, but at a later date. Marketing statements bug me too. You’d think that “…the world’s most comprehensive openly accessible search engine for chemical structures” would be PubChem. But it’s not according to this marketing statement …who is it?

There have been comments about PubChem being the model of Openness. I think the effort is great. FULLY support it. But let’s wake up. If funding ceases then PubChem could go away. The data is Open. The software is NOT. PubChem is built around some home-built services and on top of commercial modules such as CACTVS and OpenEye. I discussed it here and it has not been challenged. Am I wrong?

“PMR>: There is nothing Open about this. Even the blog is not Open (it does not carry a CC licence). The services may be free, and they may be useful, but they are not Open. The text that they index may indeed be Open Access in its own right (and probably is because otherwise the publishers will sue them) but this is no especial credit to Chemrefer. We also index Open resources but we make our results Open.Chemrefer could disappear tomorrow. Only if the data, and the source code are made Openly available under licence can they be called Open.”

There is a CC license on the page. Peter acknowledged this. Who said the services were Open? if we did, point me to it and we will rectify. I have asked Peter separately whether all articles linked to CrystalEye are Open Access or some with permission from the publishers. This is very interesting.

This has been a long post. I understand I have likely added fuel to the fire. I have done it in a public way. I judge that ChemSpider is being harmed by the ongoing misinformation. I wish it to stop. What I want is advice and support to make this a better service for our users. However, I refuse to make it my personal mission to satiate PMR’s requests and objectives. ChemSpider is developed for its users and the community in general NOT for it’s non-users. PMR is not a user. Not everything has to be Open for it to be of high-value. I believe we deliver value.

 
4 Comments

Posted by on October 15, 2007 in ChemSpider Chemistry, Community Building

 

Tags: , , , ,