RSS

Category Archives: Quality and Content

The old adage says that bigger is not always better. So, while we are focused on growing the content of the ChemSpider database we are very concerned with the quality of what is posted and processes by which this can be improved.

Copy Editors Redefine Standard Units During the Proofing Process

I write a lot of publications, averaging about one peer-reviewed publication or book chapter per month. I have published with a number of publishers including my employer (Royal Society of Chemistry), with Elsevier, Wiley, Springer, ACS and many others. The experience with each publisher is different but, generally, pleasant, and high quality. However, once in awhile the experience is “interesting”. I especially have had some very interesting peer-review “experiences”. But that is not the point ot this post. This post is about the other end of the process…paper reviewed, paper accepted and into proofing stage.

Last month Sean Ekins and I had a paper accepted and we listed in the paper some physicochemical parameters. These included logP, pKa, Lipinski parameters and Polar Surface Area, commonly known as PSA. When we got the paper back for proofing PSA had been replaced by “Prostate Specific Antigen“. It was a good catch on Sean’s part as first proofreader to spot it! How would that happen? One has to imagine a set of scripts that are searching for abbreviations and doing a find and replace. For PSA clearly context matters. For most biology papers the prostate specific antigen conversion for PSA might make sense. It doesn’t really make sense for chemistry and QSAR modeling. So, it’s all about context.

We recently submitted an article in relation to our work on Computer Assisted Structure Elucidation. This is at a time when our book on CASE is about to go to the printers! This is one of our most interesting applications of ACD/Structure Elucidator and will be discussed in more detail when the paper is published. The paper is going to be published with Wiley’s Magnetic Resonance in Chemistry. MRC is my favorite NMR journal by far and I am always happy to publish there! After all these years I was shocked when the feedback from the copy-editors for our paper said…

The copy-editor was suggesting that we changing all instances of PPM for chemical shift to mg/g. Excuse me, but reout usually? Are you serious. First of all PPM is THE defined unit for chemical shift. Did IUPAC change it without us knowing? PPM is a dimensionless unit, based on Hz/MHz, thus the 1oE-6  dependence. Even if it was in terms of Gauss (another interpretation of the mg/G) it should be microGauss/Gauss, so mcg/G.

Anyway, it makes no sense right? Surely it is just an oversight, just a one off? Unfortunately no…this entire paper HAS been published with every PPM reference to chemical shift changed into mg/g. How did that happen? We have to imagine a search and replace replacement, acceptance of the “house style” by the author and no oversight by the editor post-proofing. The result, chemical shifts are now quoted in milligrams/gram. Terrific! Surely a context issue of some type…but truth be told, I am not sure for what!

Is this a side effect of non-skilled copy-editors? A result of off-shoring? Whatever the reason it is wrong..unless IUPAC truly decided on a new standard????! NOT….

 
2 Comments

Posted by on November 3, 2011 in Quality and Content

 

Tags:

Terminal Dimethyl means Death by Methane twice

When writing talks I try to find interesting (and where possible fun) examples of how challenging the world of managing chemistry data is for all of us that work in the world of managing 10s of thousands, or in our cases millions of compound pages for the community to use. I have told many stories over the past few years of the challenges we collectively have in regards to data quality and how it flows between our databases unabated. My latest example used at the recent talk at the EBI (ChemSpider – An Online Database and Registration System Linking the Web) was the structure known as Terminal Dimethyl presently on PubChem, DrugBankWolfram Alpha and PDBe. It was originally inherited into ChemSpider also but has been deprecated. I left a comment on DrugBank a couple of weeks ago but it hasn’t been published yet…generally such errors are removed VERY quickly by the DrugBank hosts. I added a comment to Wolfram Alpha and received a canned response and no changes to the record as yet.

There ARE ways to communally resolve these issues and I will blog about that shortly.

 
1 Comment

Posted by on October 31, 2011 in Data Quality, Humor, Quality and Content

 

Tags: , , , ,

Open Notebook Science and One Future for Scientific Research

A few weeks ago I was invited to give a presentation to the Board of Directors at Burroughs Wellcome. I was very interested in taking this opportunity to discuss my views on Open Science, Open Notebook Science, Open Data etc with this group of very esteemed scientists. However, it turned out it clashed with a planned vacation. Since my friend and frequent co-author Sean Ekins is an evangelist for open science for drug discovery, improving data quality, and Mobile Apps, and since we think alike on so many levels, I asked Sean whether he’d want to give the presentation. And, always welcoming adventure Sean jumped at the chance to present.

As it turned out Hurricane Rina resulted in us cancelling our vacation so I ended up attending the presentation with Sean. While we had bounced the slides between each other prior to the presentation Sean did a terrific job as the presenter and we had some very interesting questions regarding what is standing in the way of open science, especially around chemistry databases (of compounds), what are good examples of bioinformatics projects that are successful, and whether there are “risks” inherent to Open Science, especially in regards to what is shared online in public compound databases. I thoroughly enjoyed the meeting, short as it was and am glad that we were given the opportunity.

Sean has eloquently outlined the nature of the presentation at his site (he is Collabchem) and the presentation is below for your comments and review. I recommend that you check out Sean’s other presentations too!

 

 

Mobile Chemistry and Generation App

A presentation given today at the ICIC Meeting in Barcelona #icic2011

While the internet has been revolutionizing our access to data and information via our computers, computers have been miniaturizing to the point where a smart phone offers capabilities that many desktops could not deliver less than a decade ago. Mobile browser technology and app-based delivery for software has now delivered into our hands further access to data via phones, pads and tablets. Whether it be in the form of chemical calculators, accessing publishers websites or public domain databases containing millions of chemical structures, mobile chemistry is here and is expanding in capability and coverage at a dramatic rate. This presentation will review the status of mobile devices and how they are being used to enable chemists.

 

 

 

Tags: , , , ,

The story behind one publication – Or how not to win friends and influence people

This blog post is a co-authored post by Sean Ekins and I …we survived the challenges of getting this article published together so we’ll share the blogpost together also!

We recently published an editorial in Drug Discovery Today. It is entitled “A Quality Alert and Call for Improved Curation of Public Chemistry Databases” (LINK to http://dx.doi.org/10.1016/j.drudis.2011.07.007). The abstract is below:

In the last ten years, public online databases have rapidly become trusted valuable resources upon which researchers rely for their chemical structures and data for use in cheminformatics, bioinformatics, systems biology, translational medicine and now drug repositioning or repurposing efforts. Their utility depends on the quality of the underlying molecular structures used. Unfortunately, the quality of much of the chemical structure-based data introduced to the public domain is poor. As an example we describe some of the errors found in the recently released NIH Chemical Genomics Center “NPC browser” database as an example. There is an urgent need for government funded data curation to improve the quality of internet chemistry and to limit the proliferation of errors and wasted efforts.”

We certainly welcome your feedback on the article if you have read it. We have the PDF of the final published form of the article and will be able to circulate a limited number to interested parties (please contact us here in the comment section and provide emails).

Like everything we do in this complex-connected-world there is a long and interesting story behind this article. We would like to share it because we think it will be of interest and if nothing else it shows how difficult it is to get an important alert out there. OK, so it’s not on the scale of an asteroid coming towards earth in the next hour but the implications are far reaching and people need to know now to avert future scientific errors. Prior to our article being accepted for publication in DDT the manuscript had taken an interesting trip. Here it is in all its gory detail.

In a conversation one day Sean asked Tony whether or not he had ever written up the observations regarding the quality of data that he had been investigating, and blogging about, for almost 4 years. Tony commented that no, he’d considered it but had never got to it. Sean  suggested that there was likely a wider audience for the data being gathered than just the readers of the blog. Of course he was right…how many people would know Tony’s blog versus the number of people who might read an article in a journal. We have co-authored several articles in recent years and Sean has been a keen recipient of many of the curated datasets that Tony has assembled in order to use them in modeling studies. It was natural that we work together on a manuscript regarding the quality of data in Public Domain databases serving chemistry. We thought everybody would want to hear about it – how wrong we were!

We put an initial manuscript together and submitted it in late December 2010 as a Christmas present to the journal Science. We aimed BIG (why not- it was of general interest as molecule databases are proliferating) as we thought that the issue we had identified was an important one and urgent attention was needed. Simply put, millions of our tax dollars are invested in building public domain databases that are funded by grants to do this both in the US and elsewhere (Pubchem etc). Surely the quality of chemical structure data is important to everyone? Get a structure wrong and it’s no longer the molecule you say it is and that confuses everyone. The response from Science was “Thank you for submitting your manuscript “A Quality Alert for Chemistry Databases” to Science.  We discussed the manuscript extensively-but unfortunately we will not be able to publish it.   Although you have framed this issue as a data management and/or chemical complexity problem, it seems to us that it’s primarily economics-quality control is not  cost-free.” One rejection down.

Now if you were in our shoes what would you do next. That is right…we chose next to submit to Nature and the paper was again rejected rather quickly also. So we regrouped and then chose to submit to one of the top Open Access journals (PLoS Computational Biology) where it was accepted as being of interest and sent out for review (we were hopeful). We received some good feedback from the reviewers, some of the feedback abstracted below.

Reviewer #1: This manuscript brings forward a critical issue of chemical data quality in publicly available databases of biologically active molecules that has become available relatively recently.  …the authors should spend more time and give more examples from sister disciplines as well as illustrate how the use of wrong structures (or wrong data) affects molecular modeling/cheminformatics investigations. …the authors should provide compelling examples when using wrong structures without prior data curation led to erroneous models or predictions.
In summary, I believe that providing more screaming examples as to how data curaction impacts the outcome of scientific research in Cheminformatics, medicinal chemistry, chemical biology, etc., besides stating the obvious (for any scientific discipline) need to be rigorous and accurate when assembling and curating the data, will add significantly to the appeal of this highly relevant, potentially influential, and timely publication.

Reviewer #2: The quality of data freely available at the internet is more and more becoming a crucial issue in the medicinal chemistry community. This no longer only affects academic institutions which cannot afford paying for commercial databases, but also pharmaceutical industry, which is dumping these data into their in house data warehouses. However,  up to now the issue of quality was almost exclusively addressed from the side of biological data, as everyone assumed that a chemical structure is unambiguous, easy to check and thus the probability of errors is quite low. In addition, most of the data are provided by the owners or come from literature sources, where one can assume that a rigorous quality check has been performed.  In light of this, the present study is a highly valuable contribution to make the community aware of the fact that this might not be the case.… I propose to add several examples where a structure is represented indeed as wrong structure in that way that it would lead to a wrong logP, wrong TPSA or wrong number of H-bond acceptors.

Reviewer #3: This paper focuses on database errors in chemistry-based resources, specifically chemical names being associated with incorrect chemical structures.  It serves as a firm reminder to scientists to check their facts when working with freely available data sources.  It also highlights issues of data aggregation and the flow of chemistry data between resources.…No recent clear examples of publications were provided where where the data obtained from a public resource had an incorrect “dramatic effect” on results or gave “misleading predictions”.

One might even suggest that it seemed more like an “infomercial” geared towards favorable treatment towards the author’s public data resource than an actual scientific publication.  There seems little point to publish this work: without systematic study to show the extent of the issue; its effect on the literature; attempts to help identify the source errors (Lack of rigorous data exchange standards? File conversion issues? Disagreement over the actual chemical structure {as a function of time}?); strong, accurate and exemplary examples to demonstrate the issue; and suggestions on ways to remediate the situation.  As this paper stands, without a serious attempt to address the real issues, this reviewer recommends rejection of this publication.

The paper was rejected. At the same time as this rejection the NCGC data collection was released with the NPC Browser with much fanfare in Science Translational Medicine and Sean passed the dataset along to Tony for review of the data. It was a timely occurrence and a PERFECT example of the issues with data quality that we had been writing about in the articles rejected by all the major journals to this point. We commented back to the PLoS Computational Biology editor as follows “When this perspective was written we wrote it in a manner that it was not a research paper per se. I believe that we can address ALL comments from the reviewers using data that has been assembled in the past two weeks as a result of examining the dataset released by the NIH’s NCGC team. Some very basic comments are made below and I would expand on these issues in the edits to the manuscript. If you could please review the blogposts below and let me know whether you would accept an edited submission. We believe that our review of this particular dataset is a perfect example of the alert to data quality that we are pointing to and gives direct examples. Thank you.

http://www.chemconnector.com/2011/04/28/reviewing-data-quality-in-the-ncgc-pharmaceutical-collection-browser/

http://www.chemconnector.com/2011/05/02/what-is-a-drug-data-quality-in-the-ncgc-pharmaceutical-collection-browser-part-2/

http://www.chemconnector.com/2011/05/08/data-quality-in-the-ncgc-pharmaceutical-collection-browser-part-4/

Unfortunately, we could not sway the editor to accept the commentary despite inserting a lot of details about the NCGC dataset quality and fully addressing, we believe, all reviewers comments.

Since the manuscript about the NPC Browser had been published in STM we thought it was appropriate to draw attention to the data quality in the original publication in the same journal. With a newly updated manuscript we then submitted to STM. That rejection was fast “Thank you for submitting your manuscript “A Quality Alert and Call for Improved Curation of Public Chemistry Databases” to Science Translational Medicine. During the initial screening process your manuscript did not receive a high enough priority rating to warrant in-depth review. We are therefore notifying you that your submitted manuscript has been rejected.”

They did not want to allow us to report on the issue that was perfectly represented in an article published in their journal and, in our opinion, had been inadequately reviewed in regards to the quality of data in the database reported and highlighted in their journal. We did our best to explain the connection of our article to the original paper regarding the NPC Browser but made no progress in having the decision reconsidered.

At this point we made a decision to submit as an editorial to Drug Discovery Today. We have published in this journal before because it has a wide readership in both the industry and elsewhere and covers chemistry and informatics topics. We also have had very positive experience with both the review and publishing processes, being fair and fast. In addition this journal is outside of any potential “political circles of influence” being Europe-based and with a broad international editorial board and audience. It was suggested to us that we were struggling to get the article published as we were highlighting potentially sensitive issues. We’re not saying we agree…but it was suggested. We submitted the balanced article in the form detailing our findings with the NPC Browser dataset, received very positive feedback and the rest is history. The article is now out. Finally. Eight months after it was submitted in its original form.

Publishing this article has been a very enlightening experience. With one journal we had two positive reviewers with suggested changes and one negative reviewer. We addressed all suggested changes but it was still rejected. We submitted to three other journals in total and were rejected outright. Our article is, we believe, an appropriate and well-founded editorial about the quality of data. The details about the NPC browser and NCGC Dataset were knitted into the article based on timing…it was the last public domain dataset to be released after we had written the original article. We have been accused of having something against the NCGC database.  It is but one representative of the data quality issue we have reported on. Many tens of hours have been spent reviewing the data quality and providing feedback and guidance on our findings. We are not against any database…we are for improving data quality.

What has it taught us that might be applicable to other people’s publishing experiences?

1. It appears that pointing out an important problem and offering solutions does not win you friends. Possibly it was too contentious and this is the reason is did not get published in some of the journals it was submitted to? Or maybe it was badly written…you can be the judge.

3. It is possible that doing scientific analyses of database quality is not something reviewers or journals want to hear about because they publish articles that use such databases to create new analyses and therefore new papers that they in turn publish. If you turn around and tell them that these very databases have errors then what is the quality of the output? Certainly not 100% trustworthy. But then again what is?

4. Many scientists TRUST chemistry and biology databases  that are so often reused, reanalyzed and integrated with new cheminformatics or bioinformatics tools. The authors of such articles do not appear to analyze for problems caused by poor data quality or hypotheses that are incorrect due to poor underlying data.

5. We believe, raising an alternative viewpoint that challenges the status quo even in area that one would think was solved – e.g. structure quality, does not get a fair or balanced airing in the mainstream media.

Thankfully, the wider dissemination of the DDT editorial should get more people thinking about it. We have prepared an extensive manuscript that provides an even deeper analysis of the topic and we will be submitting that to journals. If any journal editors are reading this blog post and would like to welcome us to submit the complete manuscript we’d love to hear from you.

This is, in our opinion, a newsworthy topic. The very fact that the NPC browser and underlying datasets had issues that illustrated the problems we had alerted Science, Nature and PLoS Computational Biology etc. to months before, attests to that. We hope that those journals will agree that this is a serious issue. Who  do we turn too to report a problem that affects everyone in the scientific community? The multiple submission processes, with reformatting and reworking consumed many hours. The blog is certainly a simpler, openly reviewed format for reporting for sure.

It should be noted in the interest of full disclosure that Sean is on the editorial board of the DDT journal – which comes with no financial reward- but that Tony submitted the article.  Also, neither Sean or I have a financial interest in improving database quality, but we believe it is the right thing to do.

 

Potential Issues with Google Scholar Citations

I have blogged recently about my experiences with Google Scholar Citations. (1,2). It has been useful in highlighting what science I have published that people might find interest as well as trends in citation patterns. It has also highlighted some potential issues in the data.

My Top Citations on Google Scholar Citations

I must admit I was quite surprised to see that the top cited paper was one from Eastman Kodak company where we looked at interactions between Sodium Dodecyl sulfate and gelatin, followed by work I did at the University of Ottawa. This work was in 1994 and 1991 respectively. This work was almost 20 years ago so it does make sense that the aggregation of citations over the years might have reached those levels. However, I would have expected that my work in the areas of NMR prediction, Computer-Assisted Structure Elucidation (CASE) and Indirect Covariance would have garnered a lot more citations, but that work did come about 10 years later. It is good to see that the more recent papers, for example that from 2008 on internet-based tools for communication and collaboration in chemistry, has garnered a following.

My papers that supposedly have no citations

Above is shown a list of my papers from as far back as 1990 that appear to not have any citations. There are also a lot of recent papers listed that I KNOW are cited, multiple times, as they have been referred to in some of my own publications. For example, the second one in the list, from 2009, entitled “Computer-assisted methods for molecular structure elucidation: realizing a spectroscopist’s dream” is an Open Access article and according to the journal statistics it is the top read article of all time on the Journal of Cheminformatics as shown here with, as of this writing, 10770 accesses. In fact, if I search the article directly on Google Scholar I find it IS cited 7 times as shown below.

 

The JChemInf Paper on Google Scholar shows 7 Citations

I don’t know why it shows up as cited in Google Scholar but not in Google Scholar Citations. However, the same issue exists for the paper on the Spectral game. See below. Shows no citations on Google Schoalr Citations but shows 7 on Google Scholar.

CItations for the Spectral Game Paper

Notice that in BOTH cases the article is listed as the Journal of Cheminformatics, not as the title of the paper. Maybe THIS is the reason the citations are missed. Maybe the publisher for the Journal of Cheminformatics is not exposed in a manner that has the publications indexed properly? Maybe….

 

 

 

Tags: ,

Encouraging Collaboration in Washington as a Hub for Chemistry Databases

On August 25/26 I will be attending the 5th Meeting on U.S. Government Chemical Databases and Open Chemistry. I will have the opportunity to spend time with people I appreciate for the contributions they are making to chemistry: Martin Walker, JC Bradley, Andy Lang, Markus Sitzmann, Ann Richard, Frank Switzer, Evan Bolton, Marc Zimmermann, Wolf Ihlenfeldt, Steve Heller, John Overington, Noel O’Boyle, and many others. It is surely going to be an excellent meeting. The agenda is given here.

Some of the people listed above are associated with “Washington-based databases”. Databases that are developed in or around Washington by government-funded organizations – the FDA, NIH, NCBI/NLM, NCI, NIST. There are also other government funded databases, non-Washington-based, represented – EPA and CDC. If you are not sure what all those three letter acronyms are then here you go.

FDA – Food and Drug Administration

NIH – National Institutes of Health

NCBI/NLM – National Center for Biotechnology Information/National Library of Medicine

NCI – National Cancer Institute

EPA – Environmental Protection Agency

CDC – Center of Disease Control

NIST – National Institute of Standards and Technology

One organization with a chemistry database conspicuous by its absence is the NCGC data collection contained in the NPC Browser. I’ve blogged a lot about this one on this blog.

NCGC – NIH Chemical Genomics Center

I am hoping to get to talk to some members of the team if they attend the meeting though.

There will be a LOT of government databases represented at this meeting. I have experience with many of the databases provided by these institutions. The DSSTox database is one of the most highly curated databases based on my review of the data. The NCI resolver is an excellent resource with good quality data in terms of the accuracy of name-structure relationships.

The various databases are developed independently of each other. True, some of the databases contain contents from some of the other databases but, as far as I can tell, there is not much collaboration in terms of coordinated curation of data. What would it be like if each of these organizations participated in a roundtable discussion to agree to a process by which to collaboratively validate and curate the data, once and for all? Maybe this meeting can catalyze such a discussion. I would encourage the organizations to take advantage of other data sources that can share their data – ChEBI/ChEMBL is one example! If these various groups coordinate their work then the result could be a massively improved quality dataset to share across the databases and across the community. If this work was done then the group that assembled the NPC Browser would likely have a lot less work to do in terms of assembling the data. The various database providers should certainly have provided clean, curated data for many of the top known drugs. While working on a manuscript reviewing the quality of public domain chemistry databases I assembled a table of 25 of the top selling drugs in the US and checked the data quality in the NPC Browser relative to a gold standard set. The assembly of the data will be discussed in its entirety  in a later publication.

25 of the Top Selling Drugs in the USA - Data Quality in the NPC Browser

The errors listed in the table are:

1 Correct skeleton, No stereochemistry
2 Correct skeleton, Missing stereochemistry
3 Correct skeleton, Incorrect stereochemistry
4 Single component of multicomponent structure
5 Multiple components for single component structure
6 No structure returned based on Name Search
7 Incorrect skeleton
8 Multiple structures based on name search

 

Clearly there are a lot of errors in the structures associated with 25 of the best selling drugs on the US market. These should be the easy ones to get right as they are so well known!!! Collaboration between the domains top database providers would have helped, almost certainly. This would not necessarily be an issue of meshing technologies but agreeing on a common goal to have the highest quality data available. Since the government puts so much money into the development of these databases it would be appropriate to have some oversight and push for aligning efforts. Collaboration is essential!

With that in mind…a shameless pointer to how Sean Ekins, Maggie Hupcey and I BELIEVE in the need for collaboration…our book. If we can encourage others in the government chemistry databases to adopt active collaborative approaches wonderful things could happen.

Collaborative Computational Technologies for Biomedical Research

 

 

Tags: , ,

Continuing Review of the NPC Browser Content – Most Cleanup is the Responsibility of the Hosts

In the past two weeks I have been in a number of discussions regarding my blog posts about the NPC Browser. My last blog post brought a comment from Ajit Jadhav, one of the authors of the original Science Translational Medicines publication about the NPC Browser. Ajit commended Sean and I on our light-hearted approach to discussing the issues of quality.

Specifically he picked me up on the fact that American Cockroach IS listed on Dailymed as a medication. VERY interesting!  He commented

“Tony, Thanks for the amusing post. See here for more details of one example, american cockroach, which is Antigen Laboratories’ allergenic extract: http://dailymed.nlm.nih.gov/dailymed/drugInfo.cfm?id=12809.

And we can go on. But I would rather keep moving in a forward direction in life.

Regarding NPC… in case if it’s not clear yet, the collection is a small subset of HTS amenable compounds. The other content in the NPC Browser is supplementary.

Regarding you and Sean Ekins, you guys should go on the road as a comedic duo act. After all the serious scientific talks, the two of you can be the entertainment. One can be called Spinning and the other can be called Wheel.

I will volunteer to do the drum rolls for you :)

Gents, have fun working. Or… spinning if you enjoy that more. Apparently, the NPC Browser has hit a nerve in each of you. So I will check back on the blog to see what other entertainment you’re dishing out.  The more outrageous, the better! It just reveals more about you than the NPC Browser :)

Ajit”

My response is here and I insert a slice below.

“NPC was not ORIGINALLY described as a small set of HTS amenable compounds according to the Science Trans Med paper that describes it. According to the paper, and I quote “…the NCGC Pharmaceutical Collection (NPC) – a definitive collection of drugs registered or approved for use use in humans or animals.” It also states that the “NPC is the most comprehensive and accurate exposition to date of MEs registered or approved for human or vetinary use worldwide.” Having reviewed a subset of structures related to a particular class of compounds, over 140 entities, with a >70% failure in “accuracy”, I have to question this statement. I judge that the Merck Index (book form or electronic form) is a better collection. In case you are not aware of this resource details are:

http://www.merckbooks.com/mindex/referenceset.html

As I blogged in my post “Rabbits, Potatoes and other Vegetables in the NCGC Database” there are some interesting things in the database. Responding to a comment on that post I commented on other things listed in the database.

WATERCRESS
WATERMELON
WHEAT
WHEAT BRAN
WHEAT ENDOSPREM
WHEAT GERM
WHEAT GLUTEN
WHEAT GLUTEN
WHEAT MIDDLINGS
WHEAT MIDDLINGS
WHEAT MIDDLINGS
WHEY
WHITE FISH
WHITE MUSTARD
WHITE OAK BARK
WHITE PEPPER
WHITE WILLOW EXTRACT
WILD ROSE EXTRACT
WINE

I’ve searched these in DailyMed …not much luck I’m afraid 🙁

I DO believe that list below would give me hits in Dailymed but these members of the NCGC pharmaceutical collection are likely just a little generic!

List of "generics" in the NCGC pharmaceutical collection

It’s likely that most all Dailymed labels contain “ingredients, water and additives”. I wonder how many of them contain “self heal” though.

As defined in the original paper ““…the NCGC Pharmaceutical Collection (NPC) – a definitive collection of drugs registered or approved for use use in humans or animals.” Also “NPC is the most comprehensive and accurate exposition to date of MEs registered or approved for human or vetinary use worldwide.” I challenge that based on the observations above.

I have to argue that it is time to do some very basic browsing of the entries in the database that are simply text entries with no structures. There are MANY that are distinct chemicals for which the chemical can easily be located. There are also many common terms that should simply be deleted out of the dataset. Hundreds in fact. I judge that one good evening of work would catch many of the most obvious terms that are in error. I doubt that a crowdsourcing approach will address this and this very basic clean up is the responsibility of the database hosts. It’s certainly a reputation issue. Ajit commented “The other content in the NPC Browser is supplementary”. I am trying to understand how? It doesn’t align with my interpretation of the paper or that of many of the people who have been discussing the data set with me in recent weeks.

 

 

Searching for “Complete Synonyms” in PubChem and the NPC Browser

I am interested in feedback from online databases as to expected behaviors from a search. PubChem has a Complete Synonym search that limits a chemical name based search to the synonym field. Without that fielded search the search is across all text in a record, I assume. The difference in the results is shown below. The top image shows a search for Taxol and returning 59 results.

A search for Taxol in PubChem

Below is a search on Taxol[completesynonym]. This search returns 5 hits for Taxol.

I wonder whether most users of PubChem know that they need to add the [completesynonym] definition to limit the search? You might want to try Diamond and Diamond[completesynonym] as searches and look at the results.

I am assuming that on the NPC Browser a similar type of search can be conducted to limit results as a search on the drug Lidocaine returns 14 chemicals..all of them different. If this search exists I have missed it. Can anyone comment?

With ChemSpider we do our utmost to return a single structure for a clearly unique name such as Taxol and Lidocaine. We believe that’s what most people would expect. Thoughts and comments welcome.

 

Tags: , ,

Integrating and Curating Internet-Based Chemistry Resources to Serve Life Scientists

Today I gave a presentation at PharmSciFair2011 here in Prague. I spoke about ChemSpider, the Open PHACTS project, data quality in public domain databases and the work we will be doing to serve the Open PHACTS project in terms of data mappings, improving data quality and supporting a semantic web for Drug Discovery. The abstract and Slideshare Presentation are below.
Title: ChemSpider: Integrating and Curating Internet-Based Chemistry Resources to Serve Life Scientists
The internet now offers access to a myriad of online resources that can be of value to chemists working in the Life Sciences. While finding information online is, in many cases, a simple search away, the accuracy and validity of the associated data and information should be questioned. As more databases and resources are introduced online, and commonly not integrated to other resources, a scientist must perform multiple searches and then undertake the task of meshing and merging data. ChemSpider is a freely accessible online database that has taken on the challenge of meshing together distributed resources across the internet to provide a structure-based hub. It is a crowdsourcing environment hosting over 26 million unique compounds linked out to over 400 data sources. With well defined programming interfaces for integration ChemSpider has been integrated to many commercial and open software packages and is presently serving as the chemistry foundation for the IMI Open PHACTS project.
 
 
Stop SOPA