Archive for category Open Science..all its forms
Open Drug Discovery Presentation Session with JC Bradley
Posted by tony in Open Science..all its forms, Publications and Presentations, RSC Publishing on October 17, 2011
I had the pleasure of co-presenting with my friend Jean-Claude Bradley today at the “3rd Annual Drug Discovery Partnership: Filling the Pipeline“. Jean-Claude gave a great talk, available on Slideshare here, and discussed the issue of data quality, how improve data gives improved models, the cross-validation of data and proliferation of errors. My talk is on Slideshare here and embedded below. In many ways I discussed similar issues, though not focused on melting point data but rather on structures, structure-identifier relationships, the cross-linking of multiple resources on the internet and how online resources can support Open Drug Discovery Systems. In this presentation I discussed some of the work we are doing on Open PHACTS.
Potential Issues with Google Scholar Citations
Posted by tony in Community Building, Open Science..all its forms, Quality and Content on August 7, 2011
I have blogged recently about my experiences with Google Scholar Citations. (1,2). It has been useful in highlighting what science I have published that people might find interest as well as trends in citation patterns. It has also highlighted some potential issues in the data.
I must admit I was quite surprised to see that the top cited paper was one from Eastman Kodak company where we looked at interactions between Sodium Dodecyl sulfate and gelatin, followed by work I did at the University of Ottawa. This work was in 1994 and 1991 respectively. This work was almost 20 years ago so it does make sense that the aggregation of citations over the years might have reached those levels. However, I would have expected that my work in the areas of NMR prediction, Computer-Assisted Structure Elucidation (CASE) and Indirect Covariance would have garnered a lot more citations, but that work did come about 10 years later. It is good to see that the more recent papers, for example that from 2008 on internet-based tools for communication and collaboration in chemistry, has garnered a following.
Above is shown a list of my papers from as far back as 1990 that appear to not have any citations. There are also a lot of recent papers listed that I KNOW are cited, multiple times, as they have been referred to in some of my own publications. For example, the second one in the list, from 2009, entitled “Computer-assisted methods for molecular structure elucidation: realizing a spectroscopist’s dream” is an Open Access article and according to the journal statistics it is the top read article of all time on the Journal of Cheminformatics as shown here with, as of this writing, 10770 accesses. In fact, if I search the article directly on Google Scholar I find it IS cited 7 times as shown below.
I don’t know why it shows up as cited in Google Scholar but not in Google Scholar Citations. However, the same issue exists for the paper on the Spectral game. See below. Shows no citations on Google Schoalr Citations but shows 7 on Google Scholar.
Notice that in BOTH cases the article is listed as the Journal of Cheminformatics, not as the title of the paper. Maybe THIS is the reason the citations are missed. Maybe the publisher for the Journal of Cheminformatics is not exposed in a manner that has the publications indexed properly? Maybe….
Encouraging Collaboration in Washington as a Hub for Chemistry Databases
Posted by tony in Community Building, Computing, Data Quality, NPC Browser and NCGC Collection, Open Science..all its forms, Politics, Quality and Content on August 1, 2011
On August 25/26 I will be attending the 5th Meeting on U.S. Government Chemical Databases and Open Chemistry. I will have the opportunity to spend time with people I appreciate for the contributions they are making to chemistry: Martin Walker, JC Bradley, Andy Lang, Markus Sitzmann, Ann Richard, Frank Switzer, Evan Bolton, Marc Zimmermann, Wolf Ihlenfeldt, Steve Heller, John Overington, Noel O’Boyle, and many others. It is surely going to be an excellent meeting. The agenda is given here.
Some of the people listed above are associated with “Washington-based databases”. Databases that are developed in or around Washington by government-funded organizations – the FDA, NIH, NCBI/NLM, NCI, NIST. There are also other government funded databases, non-Washington-based, represented – EPA and CDC. If you are not sure what all those three letter acronyms are then here you go.
FDA – Food and Drug Administration
NIH – National Institutes of Health
NCBI/NLM – National Center for Biotechnology Information/National Library of Medicine
NCI – National Cancer Institute
EPA – Environmental Protection Agency
CDC – Center of Disease Control
NIST – National Institute of Standards and Technology
One organization with a chemistry database conspicuous by its absence is the NCGC data collection contained in the NPC Browser. I’ve blogged a lot about this one on this blog.
NCGC – NIH Chemical Genomics Center
I am hoping to get to talk to some members of the team if they attend the meeting though.
There will be a LOT of government databases represented at this meeting. I have experience with many of the databases provided by these institutions. The DSSTox database is one of the most highly curated databases based on my review of the data. The NCI resolver is an excellent resource with good quality data in terms of the accuracy of name-structure relationships.
The various databases are developed independently of each other. True, some of the databases contain contents from some of the other databases but, as far as I can tell, there is not much collaboration in terms of coordinated curation of data. What would it be like if each of these organizations participated in a roundtable discussion to agree to a process by which to collaboratively validate and curate the data, once and for all? Maybe this meeting can catalyze such a discussion. I would encourage the organizations to take advantage of other data sources that can share their data – ChEBI/ChEMBL is one example! If these various groups coordinate their work then the result could be a massively improved quality dataset to share across the databases and across the community. If this work was done then the group that assembled the NPC Browser would likely have a lot less work to do in terms of assembling the data. The various database providers should certainly have provided clean, curated data for many of the top known drugs. While working on a manuscript reviewing the quality of public domain chemistry databases I assembled a table of 25 of the top selling drugs in the US and checked the data quality in the NPC Browser relative to a gold standard set. The assembly of the data will be discussed in its entirety in a later publication.
The errors listed in the table are:
1 Correct skeleton, No stereochemistry
2 Correct skeleton, Missing stereochemistry
3 Correct skeleton, Incorrect stereochemistry
4 Single component of multicomponent structure
5 Multiple components for single component structure
6 No structure returned based on Name Search
7 Incorrect skeleton
8 Multiple structures based on name search
Clearly there are a lot of errors in the structures associated with 25 of the best selling drugs on the US market. These should be the easy ones to get right as they are so well known!!! Collaboration between the domains top database providers would have helped, almost certainly. This would not necessarily be an issue of meshing technologies but agreeing on a common goal to have the highest quality data available. Since the government puts so much money into the development of these databases it would be appropriate to have some oversight and push for aligning efforts. Collaboration is essential!
With that in mind…a shameless pointer to how Sean Ekins, Maggie Hupcey and I BELIEVE in the need for collaboration…our book. If we can encourage others in the government chemistry databases to adopt active collaborative approaches wonderful things could happen.
Who Could Participate in #NMRCAVES
Posted by admin in Community Building, Computing, Nuclear magnetic resonance, Open Science..all its forms on January 20, 2011
I recently posted about the project that will become known as NMRCAVES, NMR Computer-Assisted Verification and Elucidation Systems. This will be a workshop to be held at SMASH. There will be no workshop without two essential ingredients: participants and data.
The participants will need to be willing participants to work with us with their software, algorithms and approaches to test their systems on data. The data will be data supplied by the community and provided to the participants in a blind study to test their systems.
To populate the workshop is the first challenge. if we cannot get enough participants then even though we might get an abundance of data there will be no workshop to hold if we cannot engage the groups to work with it. There are a limited number of groups/individuals working in the areas of computer-assisted structure verification and elucidation by NMR. I have listed them below. No offense meant if I have accidentally missed anyone out. Also, they are listed in alphabetical order so no favoritism either…
Verification
ACD/Labs Structure Verification with NMR Predictors
Bruker Complete Molecular Confidence
Mestre Labs MNova
Sciencesoft VerifyIt
Elucidation
ACD/Labs Structure Elucidator
Jmnsoft LSD
SENECA - package for Computer Assisted Structure Elucidation (CASE) ported into the CDK
Sciencesoft AssembleIT
Can anyone point me to groups or software solutions that I am missing and other potential solutions out in the community that I should approach? I will be approaching the listed groups with an invite to participate in NMRCAVES and then will be asking the community if you are willing to provide data for the project!
#NMRCAVES is NMR Computer Assisted Verification and Elucidation Systems
Posted by admin in Community Building, Computing, Data Quality, Nuclear magnetic resonance, Open Science..all its forms, Software on January 18, 2011
I am honored to have been invited to lead a workshop at the SMASH NMR conference later this year. I will be co-hosting with Michael Bernstein, someone who I have known for many years and with whom I have spent many hours (if not days!) discussing the ins and outs of NMR prediction and structure verification by NMR,
The workshop will provide an environment for developers of software packages and associated algorithms allowing for structure verification and elucidation to engage with interested members of the NMR community attending the SMASH NMR meeting. Presenters may include both commercial and non-commercial software packages and the workshop will allow the participants to report on their respective approaches as well as report on the performance of their algorithms against a large set of data provided by the community.
The one day workshop will be separated into Structure Verification and Structure Elucidation segments with participants who have chosen to participate in the project. We are hoping for participants from both the academic and commercial sectors.
I’ve called the workshop NMRCAVES: NMR Computer Assisted Verification and Elucidation Systems. Below is an outline to initiate a conversation with interested parties. It is a suggested outline for the project and I welcome feedback.
The data analysis components of the workshop are outlined below.
CASV: Four sets of data will be made available to the participants.
(1) HNMR only, minimum of 25 spectra and 25 suggested structures (random distribution of correct/incorrect with at least 50% correct)
(2) HNMR and 2D HSQC, minimum of 25 spectra and 25 suggested structures (random distribution of correct/incorrect with at least 50% correct)
(3) HNMR only, minimum of 25 spectra and 25 sets of 3 structures (1 of each of the 3 is the always the correct structure)
(4) HNMR and 2D HSQC (preferably multiplicity edited-HSQC) minimum of 25 sets of spectra and 25 sets of 3 structures (1 of each of the 3 is the always the correct structure)
The participants will receive the data via download from an FTP site with each folder numbered in an ambiguous manner. All structures will be known to only two parties: the laboratories acquiring the data and the host of the workshop (AJW). The participants will have the responsibility to provide a report identifying the correct/incorrect structures in test sets (1) and (2) and identifying the correct structure out of the combination of 3 provided in (3) and (4). When all reports have been submitted each participant will receive a report identifying the correct structures for their review and in order for them to report on their successes and to further review and report on the data during the workshop.
The overall performance statistics comparing the results of the various participants will be reviewed and presented at the workshop by the workshop host.
CASE: The objective should be to test the ability of algorithms to correctly elucidate the skeletons of unknowns with the provision of “high-quality datasets” where sensitivity is deemed not to be a limitation. While it is acknowledged that sensitivity is an issue in CASE approaches this particular hurdle should be removed from the challenging of the algorithms. Request data from a series of laboratories. The minimum dataset should include “High-resolution MS”, 1H, COSY, HSQC/HMBC. Additional data can include TOCSY, DEPT-HSQC, HSQC-TOCSY, 1H-N15 direct and long-range correlation, NOESY/ROESY.
The participants will receive the data via download from an FTP site with each folder numbered in an ambiguous manner. All structures will be known to only two parties: the laboratories acquiring the data and the host of the workshop (AJW). All elucidations will be done blind and the participants will have the responsibility to provide a report including a table of the top 3 structures for each dataset, rank-ordered if possible, from most-likely to least-likely. When all reports have been submitted each participant will receive a report containing the correct structures for their review and in order for them to report on their successes and to further review and report on the data during the workshop.
The overall performance statistics comparing the results of the various participants will be reviewed and presented at the workshop by the workshop host.
Outcome of Project
1) A review of the state of contemporary computer-based structure verification and elucidation
2) All data to be publicly shared and made available as Open Data for download and to become a gold standard reference set of data for the community to utilize for further testing and development
3) All processed spectra to be uploaded and available on a public domain database (e.g. ChemSpider) and associated with the correct chemical structure
4) A minimum of one co-authored publication reviewing the results of the workshop and associated studies
Your feedback, comments and questions are welcomed. We are especially looking for laboratories who are willing to provide sets of data for analysis during the project as well as software groups who develop algorithms for structure verification and elucidation and who wish to participate in the project.
Finding the Structure of Vitamin K1 Online
Posted by admin in Chemicals and our Health, General Communications, Open Science..all its forms, Wikipedia Chemistry on November 9, 2010
Presentation at FACCS2010 in Raleigh
Posted by admin in Community Building, Open Science..all its forms on October 21, 2010
Today I gave a presentation at FACCS 2010 here in Raleigh, NC. The abstract and embedded SlideShare presentation are listed below.
Building a Community Resource of Open Spectral Data
ChemSpider is an online database of almost 25 million chemical compounds sourced from over 300 different sources including government laboratories, chemical vendors, public resources and publications. Developed with the intention of building community for chemists ChemSpider allows its users to deposit data including structures, properties, links to external resources and various forms of spectral data. Over the past three years ChemSpider has aggregated almost 3000 spectra including Infrared and Raman Data and continues to expand as the community deposits additional data. The majority of spectral data is licensed as Open Data allowing it to be downloaded and reused in presentations, lesson plans and for teaching purposes. This presentation will provide an overview of our efforts to build a structure-indexed online database of spectral data, initiate a call to action to the community to participate in improving this resource for the community at large and discuss how such a resource could be used as the basis of a spectral game to teach students spectral interpretation.
Presentation at the BAGIM Meeting in Boston
Posted by admin in Community Building, General Communications, Open Science..all its forms, Wikipedia Chemistry on August 25, 2010
Tonight I gave a presentation at the BAGIM meeting in Boston. The abstract is below together with the embedded presentation from Slideshare
ChemSpider – Is This The Future of Linked Chemistry on the Internet?
ChemSpider was developed with the intention of aggregating and indexing available sources of chemical structures and their associated information into a single searchable repository and making it available to everybody, at no charge. There are now hundreds of chemical structure databases such as literature data, chemical vendor catalogs, molecular properties, environmental data, toxicity data, analytical data etc. and no single way to search across them. Despite the diversity of databases available online their inherent quality, accuracy and completeness is lacking in many regards. ChemSpider was established to provide a platform whereby the chemistry community could contribute to cleaning up the data, improving the quality of data online and expanding the information available to include data such as reaction syntheses, analytical data and experimental properties. ChemSpider has now grown into a database of almost 25 million chemical substances, grows daily, and is integrated with over 400 sources, many of these directly supporting the Life Sciences. This presentation will provide an overview of our efforts to improve the quality of data online, to provide a foundation for a linked web for chemistry and to provide access to a set online tools and services to support access to these data.
The Messy World of Even Curated Chemistry on the Internet
Posted by admin in Open Science..all its forms, Software on August 15, 2010
Recently I have been spending my night hours looking into the nature of curated chemistry on the internet. 3 years ago I made some assumptions around the quality of certain online datasets when they were deposited onto ChemSpider. It was clear that a lot of internet chemistry datasets were “impure”…I think messy, untrustworthy and confused would be a fairer statement! However, there were a number of datasets that were manually curated and, at initial viewing, were higher quality. With time however I have become increasingly concerned with some of the datasets that I had originally cited as high quality. Over the next few days/weeks I will examine some of these in detail and highlight some of the issues I am seeing. I want to clarify that all chemical compounds, in terms of their connection tables, their stereochemistry and the association between the compound and the name(s) are assertions. However, there are “norms” for these structures….we would expect a particular structure for aspirin (acetylsalicylic acid ), a single structure for Cholesterol and a single structure for Taxol. By the way, the links to Wikipedia are not assertions that the structures that are presently on Wikipedia are correct representations…but I can confirm that PREVIOUSLY I did work to confirm that every one of these was consistent with my investigations to assert the association between the chemical name and the structure. SInce then it is possible that someone edited the structure…such is the world of Wikipedia!
Two of the linked data sources I have been investigating of late are DrugBank and the Protein Databank. Both of these are manually curated and are expected to be of high quality. In my discussions with various members of the Life Science industry I have heard many positive comments of these data sources as being trustworthy and high quality. I recently downloaded the drugbank small molecule set and started looking at it. Let’s take one example…
The Drugbank record DB02309 has the chemical name “5-Monophosphate-9-Beta-D-Ribofuranosyl Xanthine“. The structure on Drugbank is shown below.
The chemical name above is inconsistent with the structure…there is no stereochemistry in the molecule displayed despite the “-D-” in the name. The IUPAC name listed in the Drugbank record is “[(2R,3S,4R,5R)-5-(2,6-dioxo-3,7-dihydropurin-9-ium-9-yl)-3,4-dihydroxyoxolan-2-yl]methyl dihydrogen phosphate” and this clearly does not agree with the displayed structure.
The InChI listed on the record does not include a stereo layer (InChI=1/C10H13N4O9P/c15-5-3(1-22-24(19,20)21)23-9(6(5)16)14-2-11-4-7(14)12-10(18)13-8(4)17/h2-3,5-6,9,15-16H,1H2,(H2,19,20,21)(H2,12,13,17,18)/p+1/fC10H14N4O9P/h11-13,19-20H/q+1). The InChIKey is listed as:
DCTLYFZHFGENCW-NIVOTTSGCB
The Drugbank record links to a structure with full explicit stereochemistry on PubChem here and to the ligand on the PDB ligand database hosted by ChEBI here.
The molfile downloaded from DrugBank has no stereochemistry but lists both Canonical and Isomeric SMILES
| Isomeric SMILES | O[C@H]1[C@H](COP(O)(O)=O)O[C@H]([C@@H]1O)N1C=[NH+]C2=C1NC(=O)NC2=O |
| Canonical SMILES | OC1C(COP(O)(O)=O)OC(C1O)N1C=[NH+]C2=C1NC(=O)NC2=O |
It is clear what has happened, I believe….the Drugbank record has used the canonical SMILES to generate the structure image and has neglected the stereochemistry. However, the names carry the original stereochemistry information while the InChI comes from the structure with no stereo. I think that’s what happened.Let’s confirm.
ASSUMING that the isomeric SMILES string is the appropriate stereochemistry I can convert it and get the following InChIKey (generated using ACD/ChemSketch) and using ACD/Name get the name below). I trust ChemSketch and ACD/Name products to generate both appropriately as I managed these products while at ACD/Labs for over a decade.
DCTLYFZHFGENCW-NSVMUQOTBF
9-{(2R,3R,4R,5S)-3,4-dihydroxy-5-[(phosphonooxy)methyl]tetrahydrofuran-2-yl}-2,6-dioxo-2,3,6,9-tetrahydro-1H-purin-7-ium
On Drugbank the chemical name listed is:
[(2R,3S,4R,5R)-5-(2,6-dioxo-3,7-dihydropurin-9-ium-9-yl)-3,4-dihydroxyoxolan-2-yl]methyl dihydrogen phosphate
Okay…the names are subtly different….but there are 3R and 1S centers in each name but they differ, assuming that the nomenclature programs are using consistent numbering schemes. See below.
Name generated from Isomeric SMILES on DrugBank: 2R,3R,4R,5S
Chemical Name on DrugBank: 2R,3S,4R,5R
More on this later. Looking at the linked PubChem record gives the following name: [(2R,3S,4R,5R)-5-(2,6-dioxo-3,7-dihydropurin-9-ium-9-yl)-3,4-dihydroxyoxolan-2-yl]methyl dihydrogen phosphate, exactly the same one as listed on Drugbank….so one assumes that the chemical names on DrugBank come from PubChem. Downloading the molfile from PubChem into the same software used to generate InChIs and chemical names gives:
XHDARDSMKMUDDI-XWTUZWARBP
9-{(2R,3R,4S,5R)-3,4-dihydroxy-5-[(phosphonooxy)methyl]tetrahydrofuran-2-yl}-2,6-dioxo-2,3,6,7-tetrahydro-1H-purin-9-ium
DrugBank is linked out to the PDB ligands hosted by ChEBI and looking at the XMP ligand here we see:
[(2R,3S,4R,5R)-5-(2,6-dioxo-3H-purin-7-ium-9-yl)-3,4-dihydroxy-oxolan-2-yl]methyl dihydrogen phosphate
This is the SAME stereochemistry in the chemical name as on DrugBank, but actually a different chemical name. It is definitely possible, and common, for different systematic names to exist for the same chemical but it does indicate the challenges of linking based on different identifiers.
DrugBank:
[(2R,3S,4R,5R)-5-(2,6-dioxo-3H-purin-7-ium-9-yl)-3,4-dihydroxy-oxolan-2-yl]methyl dihydrogen phosphate
PDBeChem: [(2R,3S,4R,5R)-5-(2,6-dioxo-3,7-dihydropurin-9-ium-9-yl)-3,4-dihydroxyoxolan-2-yl]methyl dihydrogen phosphate
The InChIKeys between the different databases/tools are:
PDBeChem: DCTLYFZHFGENCW-KWDNBKPHDV
DrugBank: DCTLYFZHFGENCW-NIVOTTSGCB
PubChem: XHDARDSMKMUDDI-UUOKFMHZSA-O (this is a StdInChIKey)
ChemSketch: DCTLYFZHFGENCW-NSVMUQOTBF
ALL four are inconsistent.
If I convert the SMILES string listed on the PDBeChem ligand database using ACD/ChemSketch then
O[C@H]1[C@@H](O)[C@@H](O[C@@H]1CO[P](O)(O)=O)n2c[nH+]c3C(=O)NC(=O)Nc23
produces a structure with stereochemistry of 2R,3R,4S,5R and the InChIKey : DCTLYFZHFGENCW-XWTUZWARBW.
The stereochemistry on PDBeChem agrees with that on PubChem (based on the name), the connectivity part of the InChIKey is consistent with all other systems (except PubChem) but is different to all other InChIKeys. It is also possible to download “ideal” and “representative” molfiles from the PDBeChem database.
The InChIKeys between the different databases/tools are now:
PDBeChem: DCTLYFZHFGENCW-KWDNBKPHDV
DrugBank: DCTLYFZHFGENCW-NIVOTTSGCB
PubChem: XHDARDSMKMUDDI-UUOKFMHZSA-O (this is a StdInChIKey)
ChemSketch: DCTLYFZHFGENCW-NSVMUQOTBF
PDBeChem: DCTLYFZHFGENCW-XWTUZWARBW (from Isomeric SMILES converted via ChemSketch)
PDBeCHem: DCTLYFZHFGENCW-RDKRKOOMBN (from molfile from PDBeChem, (2R,3R,4S,5R)
DrugBank also links to the Protein Databank here. XMP is listed as a ligand as shown.
The XMP ligand links here to the detailed page containing the information linked below.
| Name | XANTHOSINE-5′-MONOPHOSPHATE |
| 5′-xanthylic acid | |
| [(2R,3S,4R,5R)-5-(2,6-dioxo-3H-purin-7-ium-9-yl)-3,4-dihydroxy-oxolan-2-yl]methyl dihydrogen phosphate | |
| Synonyms | 5-MONOPHOSPHATE-9-BETA-D-RIBOFURANOSYL XANTHINE |
| Formula | C10 H14 N4 O9 P |
| Molecular Weight | 365.21 g/mol |
| Type | NON-POLYMER |
| Isomeric SMILES (OpenEye) | c1[nH+]c2c(n1[C@H]3[C@@H]([C@@H]([C@H](O3)COP(=O)(O)O)O)O)NC(=O)NC2=O |
| InChI | InChI=1/C10H13N4O9P/c15-5-3(1-22-24(19,20)21)23-9(6(5)16)14-2-11-4-7(14)12-10(18)13-8(4)17/h2-3,5-6,9,15-16H,1H2,(H2,19,20,21)(H2,12,13,17,18)/p+1/t3-,5-,6-,9-/m1/s1/fC10H14N4O9P/h11-13,19-20H/q+1 |
| InChI key | DCTLYFZHFGENCW-KWDNBKPHDV |
The InChIKeys between the different databases/tools are now:
PDBeChem: DCTLYFZHFGENCW-KWDNBKPHDV
DrugBank: DCTLYFZHFGENCW-NIVOTTSGCB
PubChem: XHDARDSMKMUDDI-UUOKFMHZSA-O (this is a StdInChIKey)
ChemSketch: DCTLYFZHFGENCW-NSVMUQOTBF
PDBeCHem: DCTLYFZHFGENCW-RDKRKOOMBN (from molfile from PDBeChem)
PDB_ligand: DCTLYFZHFGENCW-KWDNBKPHDV
Aagghhhhh…InChIKeys get very convoluted! What we see is that the chemical structure on PDB and on PDBeChem are the same. This is good news at least! There is a difference in the InChIKeys when I download the molfile but this can be explained easily…and in a later blog post.
We believe that the structure on PDB should be expected to be correct. We will assert this.
We expect that DrugBank is sourcing the chemical from PDB to add to their database. The chemical structure on DrugBank should coincide with that from PDB. Unfortunately the SMILES on PDB and DrugBank differ in two stereocenters. We don’t know why. Why the inconsistency? If the DrugBank data aren’t from PDB for the XMP ligand where did they come from?
Did PubChem pick up the structure of XMP from the PDB Database or from DrugBank? Let’s see. If I download the 2D molfile from PubChem and generate the chemical name and InChIs I get consistency…PubChem IS consistent with PDB. It is NOT consistent with DrugBank despite the fact that DrugBank is linked into this PubChem record.
This is a very convoluted, and maybe confusing analysis of ONE compound on DrugBank. I have looked at dozens and see similar issues. Assuming that PDB is the source database for data on DrugBank why are the structures differing so much? There are worse examples to come…the linking together of data on the web between even curated databases is an incredible mess.
Caveat: This is detailed and challenging work. I recommend anyone to check my work and see if I missed anything and confirm or challenge the observations as some of the issues I am seeing can be tool-based…the software tools I use may have issues with SMILEs conversion, molfile or SDF reading etc. It is exacting to check chemical structures…




















