RSS

Search results for ‘NPC Browser’

Continuing Review of the NPC Browser Content – Most Cleanup is the Responsibility of the Hosts

In the past two weeks I have been in a number of discussions regarding my blog posts about the NPC Browser. My last blog post brought a comment from Ajit Jadhav, one of the authors of the original Science Translational Medicines publication about the NPC Browser. Ajit commended Sean and I on our light-hearted approach to discussing the issues of quality.

Specifically he picked me up on the fact that American Cockroach IS listed on Dailymed as a medication. VERY interesting!  He commented

“Tony, Thanks for the amusing post. See here for more details of one example, american cockroach, which is Antigen Laboratories’ allergenic extract: http://dailymed.nlm.nih.gov/dailymed/drugInfo.cfm?id=12809.

And we can go on. But I would rather keep moving in a forward direction in life.

Regarding NPC… in case if it’s not clear yet, the collection is a small subset of HTS amenable compounds. The other content in the NPC Browser is supplementary.

Regarding you and Sean Ekins, you guys should go on the road as a comedic duo act. After all the serious scientific talks, the two of you can be the entertainment. One can be called Spinning and the other can be called Wheel.

I will volunteer to do the drum rolls for you :)

Gents, have fun working. Or… spinning if you enjoy that more. Apparently, the NPC Browser has hit a nerve in each of you. So I will check back on the blog to see what other entertainment you’re dishing out.  The more outrageous, the better! It just reveals more about you than the NPC Browser :)

Ajit”

My response is here and I insert a slice below.

“NPC was not ORIGINALLY described as a small set of HTS amenable compounds according to the Science Trans Med paper that describes it. According to the paper, and I quote “…the NCGC Pharmaceutical Collection (NPC) – a definitive collection of drugs registered or approved for use use in humans or animals.” It also states that the “NPC is the most comprehensive and accurate exposition to date of MEs registered or approved for human or vetinary use worldwide.” Having reviewed a subset of structures related to a particular class of compounds, over 140 entities, with a >70% failure in “accuracy”, I have to question this statement. I judge that the Merck Index (book form or electronic form) is a better collection. In case you are not aware of this resource details are:

http://www.merckbooks.com/mindex/referenceset.html

As I blogged in my post “Rabbits, Potatoes and other Vegetables in the NCGC Database” there are some interesting things in the database. Responding to a comment on that post I commented on other things listed in the database.

WATERCRESS
WATERMELON
WHEAT
WHEAT BRAN
WHEAT ENDOSPREM
WHEAT GERM
WHEAT GLUTEN
WHEAT GLUTEN
WHEAT MIDDLINGS
WHEAT MIDDLINGS
WHEAT MIDDLINGS
WHEY
WHITE FISH
WHITE MUSTARD
WHITE OAK BARK
WHITE PEPPER
WHITE WILLOW EXTRACT
WILD ROSE EXTRACT
WINE

I’ve searched these in DailyMed …not much luck I’m afraid 🙁

I DO believe that list below would give me hits in Dailymed but these members of the NCGC pharmaceutical collection are likely just a little generic!

List of "generics" in the NCGC pharmaceutical collection

It’s likely that most all Dailymed labels contain “ingredients, water and additives”. I wonder how many of them contain “self heal” though.

As defined in the original paper ““…the NCGC Pharmaceutical Collection (NPC) – a definitive collection of drugs registered or approved for use use in humans or animals.” Also “NPC is the most comprehensive and accurate exposition to date of MEs registered or approved for human or vetinary use worldwide.” I challenge that based on the observations above.

I have to argue that it is time to do some very basic browsing of the entries in the database that are simply text entries with no structures. There are MANY that are distinct chemicals for which the chemical can easily be located. There are also many common terms that should simply be deleted out of the dataset. Hundreds in fact. I judge that one good evening of work would catch many of the most obvious terms that are in error. I doubt that a crowdsourcing approach will address this and this very basic clean up is the responsibility of the database hosts. It’s certainly a reputation issue. Ajit commented “The other content in the NPC Browser is supplementary”. I am trying to understand how? It doesn’t align with my interpretation of the paper or that of many of the people who have been discussing the data set with me in recent weeks.

 

 

Searching for “Complete Synonyms” in PubChem and the NPC Browser

I am interested in feedback from online databases as to expected behaviors from a search. PubChem has a Complete Synonym search that limits a chemical name based search to the synonym field. Without that fielded search the search is across all text in a record, I assume. The difference in the results is shown below. The top image shows a search for Taxol and returning 59 results.

A search for Taxol in PubChem

Below is a search on Taxol[completesynonym]. This search returns 5 hits for Taxol.

I wonder whether most users of PubChem know that they need to add the [completesynonym] definition to limit the search? You might want to try Diamond and Diamond[completesynonym] as searches and look at the results.

I am assuming that on the NPC Browser a similar type of search can be conducted to limit results as a search on the drug Lidocaine returns 14 chemicals..all of them different. If this search exists I have missed it. Can anyone comment?

With ChemSpider we do our utmost to return a single structure for a clearly unique name such as Taxol and Lidocaine. We believe that’s what most people would expect. Thoughts and comments welcome.

 

Tags: , ,

Thanks for the acknowledgment from the hosts of the NPC Browser

Over the past few weeks I have been reviewing the NPC Browser and NCGC data content and have posted a number of posts on this blog. I have exchanged a small number of emails with the team and they have graciously acknowledged my efforts on the NPC webpage as shown below.

Acknowledgements

The NPC resource has benefited immensely from community feedback  since its initial release.  We are particularly grateful to the following individuals who have generously donated their time  and/or resources in helping us improve the NPC software and database:

  • Tudor Oprea  , Oleg Ursu,  and Sunset Molecular LLCgraciously donated the drug subset of the WOMBAT database.  This dataset was instrumental in enabling us to validate a large number  of curated structures (i.e., curation of curations so to speak).
  • Manish Sudwas an early adopter of the NPC resource.  His thorough analysis helped us debugged a number of errors in the software as well as database.
  • Matthew Hall of NCI  provided valuable feedback on the handling of metal containing compounds (certainly beyond organometallics).
  • Antony Williams‘ critical scrutinies of the compound content revealed numerous errors in the  original version of the database.  He also provided valuable feedback on other issues related to the software and data curation.

We also would like to extend our gratitude to everyone who has  contributed to the curation effort of the NPC database.  As a token of our appreciation, we have created, for each curator, a dedication  badge within the software to acknowledge his/her contribution.”

Thanks for the acknowledgments guys! I was curating data again yesterday…glad to participate.

 
Leave a comment

Posted by on July 26, 2011 in NPC Browser and NCGC Collection

 

Duplicate compounds in the NPC Browser and NCGC Dataset

I am presently working on a couple of articles, book chapters and guest blog posts regarding quality in public domain chemistry databases. In so doing I have continued to work through the data contained within the NPC Browser that I have blogged about many times before. I HAVE been adding curation comments to the data as I have worked through them and have removed inappropriately associated chemical names. Eventually it became too much of a burden relative to me getting my work done as there are so many edits required. What I have been looking for specifically is examples of what I thought would exist in the database – that of a failure to deduplicate. Deduplication, in terms of chemistry databases, is collapsing together records based on the same chemical structure. This sounds easy but it isn’t necessarily so….consider some of the complexities of collapsing tautomers. SIMPLE collapsing can be done by generating InChIKeys and deduplicating but InChI tautomer detection is imperfect and this approach will fail regularly. The majority of the cheminformatics toolkits have their own ways of generating fingerprints to deal with this issue of deduplication.

While browsing the database I came across Ranitidine, the active component of the well known drug Zantac. I found two records in the database. They are shown below and numbered as 1/2 and 2/2.

Ranitidine record 1.

Ranitidine record 2.

I have compared these records as molfiles. I have compared SMILES string (below).

CN\C(NCCSCC1=CC=C(CN(C)C)O1)=C/[N+]([O-])=O
CN\C(NCCSCC1=CC=C(CN(C)C)O1)=C/[N+]([O-])=O

I have compared InChIs

InChI=1S/C13H22N4O3S/c1-14-13(9-17(18)19)15-6-7-21-10-12-5-4-11(20-12)8-16(2)3/h4-5,9,14-15H,6-8,10H2,1-3H3/b13-9+
InChI=1S/C13H22N4O3S/c1-14-13(9-17(18)19)15-6-7-21-10-12-5-4-11(20-12)8-16(2)3/h4-5,9,14-15H,6-8,10H2,1-3H3/b13-9+

VMXUWOKSQNHOCA-UKTHLTGXSA-N
VMXUWOKSQNHOCA-UKTHLTGXSA-N

Try as I might I don’t see a difference between these structures. Why were they not deduplicated? This leads to the question how many more duplicates are in the database and why? I have no idea….just an observation.

 

 

 

 

 

 

Tags: ,

Confusing Search Results in the NPC Browser

For the past few days I have been in Prague at the PharmSciFair meeting. Beautiful city (with a little too much graffiti on the glorious architecture), good meeting and way too long a flight. I am traveling alone and despite the networking the disrupted sleep patterns have me working on various projects as I struggle to get more than 4 hours of sleep a night. The joys of sleep deprivation include productivity! After a break from looking at the NPC Browser and my earlier commentaries I did an update of the browser when I restarted it last night as Sean’s Collabchem blog has been keeping me updated on new changes.  Sean had informed me in two separate posts (X, Y) about the new disclaimer that states:

We are very much aware of a number of issues with the database and software. Please bear with us as we work meticulously to resolve them. Below are some of the specific issues we are addressing in the next update (which is likely to be version 1.1.0) of the software in a couple of weeks:

  • A new database release that incorporates a number of improvements to compound records (e.g., stereochemistry annotations, 2D layout, etc.).
  • There exists an indicator associated with each compound record to notify whether the record has or has not been manually curated. If the record has been curated, curators are acknowledged with proper attribution. .
  • A simple curation mechanism is integrated with each compound record so as to facilitate error reporting at finer resolutions (e.g., on an attribute basis).
  • Access to all curated records can be done through a single filter mechanism.
  • About ~2,000 QC LC/MS spectra are available for the “NPC screening” subset, courtesy of NCGC’s Bill Leister and the Analytical Chemistry group.
  • Other minor bug fixes and enhancements.

I started to poke around looking for the new features and discuss some below. Some definite moves in the right direction! Like it. I’ll comment on them in a separate post.

As I played more with the system and tried some new searches I started to get very concerned with the results I received. For example, a search on Taxol gives TWO answers. One is Taxol and the other is Ixabepilone, see below. This is weird. They are NOT the same structure at all so why would a search on one compound name bring back a separate “drug” which is really what the browser is supposed to provide us access to. The original paper reports “the creation of a definitive, complete, and non-redundant list of all approved molecular entities”. Certainly the two compounds are non-redundant (compare Taxol, and Ixabepilone). My first thought is that the search is looking for associated information that has been attached to the compound somehow. I found it under the therapeutic tab where it says…”Like taxol, Ixabepilone binds to the αβ-tubulin heterodimer subunit.” If it’s that subtle that will likely give rise to some very interesting challenges (see below). That will mean that I might search for a drug and ANY mention of that drug will retrieve hits. I expect that most people would expect to retrieve the drug itself, not all mentions ever of that drug. Maybe I’m wrong.

 

To take this to an extreme lets search for “Manganese” and see what we get in the NPC Browser…for one we see elemental manganese as the “compound” but associated with the label for the ionic form.

but also a LOT of organic compounds….one shown below. This is not an inherently obvious result but maybe exactly as its expected to be.

Of the 17 compounds retrieved with a search on Manganese only 5 actually have manganese in the formula. Do you find these results confusing? I would expect a Synonym only search would occur to retrieve just Mn++ (and maybe Mn if that is distinguished from Mn++ as a drug).

 

 

Review of NCGC Dataset in the NPC Browser Finished

For the past couple of weeks I have been looking at the NPC browser and the dataset contained within it. I am using it as an example of what type of data is finding its way into the public domain for use by Life Scientists. I had the “opportunity” to take a couple of LONG flights to and from Europe last week and late nights in hotels/ During the trip I finished my review of the data. This does NOT mean that I have a fully curated dataset …no chance. That would take a few weeks to assemble! However, it is enough data to insert some of the conclusions into a paper that has just returned from review as well as provide data for a paper presently being assembled. With that said I’m unlikely to report much more on the data until that paper is through review.

What I can comment is that the dataset does not seem to align with a lot of the comments in the original paper listed below.

R. Huang, N. Southall, Y. Wang, A. Yasgar, P. Shinn, A. Jadhav, D.-T. Nguyen, C. P. Austin. The NCGC Pharmaceutical Collection: A Comprehensive Resource of Clinically Approved Drugs Enabling Repurposing and Chemical Genomics. Science Translational Medicine, 2011; 3 (80): 80ps16 DOI: 10.1126/scitranslmed.3001862

The data has hardly been curated aqnd many of the suggested heuristics applied to the assembly of the dataset failed based on what came through the set that was issued. One of my favorite “drugs” in the screening set is shown below. I doubt Mn2+ is easily marketed as a drug, and having Mn2+ labeled as Selenium oxide, cadmium salt (1:1) seems a little strange. Having it labeled as Strontium tetraborate or barium tetraborate seems just as weird. This is one of many…many others will be discussed in a publication presently in development. Watch this space.

One of the "drugs" from the HTC Screening Set in the NPC Browser

 

 

 
 

Support for Common Compounds in the NPC Browser. Data Quality Part 3

Time for bed but a simple observation. There is no need that potassium dichromate cannot be accurately represented in a database. We do it on ChemSpider with no issue here. But for some reason it equates to Cr(I) hydroxide in the NPC Browser? Ummm….nope. Definitely not.

 

The structure for potassium dichromate....it is not Cr(I) hydroxide

 

Data Quality in the NCGC Pharmaceutical Collection Browser Part 4

I am now back in the US after a week in Europe and late at night, with a disrupted sleep pattern and looking for something to soothe me to sleep I have been wandering through the rest of the NPC Browser data set [1,2,3] looking for more patterns in the data to see if I can come up with some general advice and cautions about how to build datasets of chemistry. I already have my “conclusion” in my head as to the best advice I can give to any government organization, and others, who are trying to build chemical databases but I will save this for later in the week. For now I want to highlight some of the issues to be careful of. Tonight’s focus is “structural depictions”.

Achieving high quality algorithmic 2D structural layout is difficult across large databases is difficult. All of the cheminformatics vendors have layout tools whereby a structure can be “cleaned” so that the layout on the page is visually appealing. If you have ever used any of the cleaning tools you might have discovered that they fail dismally with certain structures and you have to perform a layout manually. You might have discovered that when you clean the structure that stereocenters flip (something that should NEVER happen but, believe me, does!) However, there are some good tools that are available. OpenEye provide layout algorithms as part of their cheminformatics toolkit and it is certainly one I have experience of.

In any case, when creating a database of chemicals it makes good sense to use algorithms for layout rather than accept what is submitted. PubChem and ChemSpider, with tens of millions of structures, have to use algorithmic cleaning but if you have a small database it won’t take long to visually inspect and spot errors very quickly. There are many examples that SHOULD have been caught with the recent NGC HTS screening set.

The worst one of these that I found while simply browsing was that associated with “Silidianin” as shown below.

The "0D" Structure of Silidianin in the NPC Browser

Here we see the structure of a bare hydroxyl group, but without a negative charge. However, the CAS number and various synonyms certainly don’t support the compound being a hydroxyl group. Indeed, this is a “0D structure” with all xy coordinates set to 0,0 and if the structure is cleaned in a drawing package then you see the structure shown on the left below. While it is the connection table for Sildianin it does not have the appropriate stereochemistry for Sildianin encoded into the structure with wedge and dashed-wedge bonds. The structure on the right has that shown.

Silidianin: Left Structure is Cleaned and Right Structure is with Stereocenters Added

I think you would agree it is more aesthetically pleasing and does communicate the bridged nature of the compound and carries the stereochemical information. However, there is something wrong with even this picture. Can anyone say what’s wrong?

Accurate structural representation in any database takes time, effort and, often, a skilled and careful eye to get right. Clearly the 0D structure is simply wrong and should have been caught. There are other offenders in the database that should have been caught also as shown below. There are LOTS more.

Various Examples of Structures Requiring Layout Improvements

Building high quality chemical databases certainly is tricky…and can be very time-consuming to solve all of these issues.

 

 

What is a Drug? Data Quality in the NCGC Pharmaceutical Collection Browser Part 2

My initial post on the data quality in the NCGC Pharmaceutical Collection Browser (the NPC Browser) drew some interesting comments. I have continued to review the data and will post various aspects of this to be digested. Because I want to provide feedback to the hosts of the database, and am committed to help with review of the data, what I want to do is try to determine whether the chemical structure provided is consistent with the descriptors associated with the name. In this case by descriptor I mean the chemical name(s) and the CAS number(s) associated with a particular drug. In a number of cases however it is difficult to understand what in a record is meant to be a particular drug. If the database is to be curated some clarity around what a particular record is meant to be curated against is necessary. For example, if we were given the  synonym aspirin then we would know that the associated chemical structure is meant to be aspirin. We could also validate the CAS numbers provided for that chemical. In the case of Picobenzide below, this is easy as I am looking for consistency between resources of the chemical structure as shown, the CAS number and the list of chemical names. There is certainly enough information online to suggest that Picobenzide is a drug.

 

However, in some cases the synonyms and associated CAS numbers are meshed together from multiple sources and it is difficult to distinguish what the chemical is meant to be. For example, see the two records below.

1) “Gulose”

The structure of a sugar and the long list of associated names and CAS Numbers

A seaerch on Gulose brings up the record above with the list of synonyms and CAS numbers below.

Synonyms: C.m.c.; Aquaplast; Gulose; Sodium carboxymethylcellulose; Idose, (l)-isomer; Polycell; Carmellose sodium; Carboxymethylcellulose, sodiu; Polymannose; Cellulose, carboxymethy; Croscarmelosa; Carboxymethylcellulose sodium; Carmellosum; Croscarmellose sodium; Who no. 4950; Idose, 14c-labeled, (alpha-d)-isomer; Ruspol; Allose, (l)-isomer; Cethylose; Croscarmellose; Celluvisc; Allose; Mannose homopolymer; Gulose, (d)-isomer; Cellolax; Carmelosa; Idose, (alpha-d)-isomer; Carboxymethylcellulose; Gulose, (l)-isomer; Orabase; Cellulose gum; Croscarmellosum; Idose; Poly(mannose); Idopyranose; Allose, (d)-isomer; Sodium, croscarmellos; Sodium, carboxymethylcellulos
CAS Numbers : 9000-11-7; 9004-32-4; 2152-76-3; 5934-56-5; 7282-82-8; 771-89-1; 30142-85-9; 81209-86-1; 2595-97-3; 7635-11-2; 6038-51-3; 4205-23-6; 6027-89-0; 19163-87-2; 1990-29-0; 10030-80-5; 15572-79-9; 23567-25-1; 2595-98-4; 1949-88-8; 5978-95-0; 68400-63-5; 62057-26-5; 25191-16-6; 37370-41-5; 117385-93-0; 12624-09-8; 198084-97-8; 37231-14-4; 37231-15-5; 50642-44-9; 54018-17-6; 55607-96-0; 73699-63-5; 80296-93-1; 82197-79-3; 9045-95-8; 9085-26-1; 177317-30-5; 191616-54-3; 196886-89-2; 204336-41-4; 56727-45-8; 3573-62-4; 56050-40-9; 28823-03-2; 815-92-9; 2535-38-8; 68951-61-1; 42396-95-2; 3615-68-7; 4005-41-8; 10326-73-5; 14049-06-0; 19030-38-7; 22348-49-8; 68784-18-9; 33417-97-9; 68784-15-6

Following any of the CAS Numbers out to PubChem gives a record that tends to have distinct stereochemistry. For example clicking on one of the CAS numbers takes us to this record on PubChem that has names including “Polymannose, Poly(mannose),Mannose homopolymer, D-Mannose polymers” and specific stereochemistry for D-Mannose. The associated molecule is the monomer. Similarly, on ChemSpider we do not host polymers at present so I understand why this would be. This record was sourced from ChemIDPlus which also doesn’t support polymers. By following the rules of data assembly the associated stereochemistry was lost. But many of the names collide also as we can see both the L- and D-isomers listed. So, it’s hard to confirm what the chemical structure should be as I don’t know what to validate against.

This list is fairly long but is much longer for adipic acid as can be seen below. Read BELOW the list for more comments..and yes, I know it’s a long list.

2) Synonyms: Sodium hydrogen adipate-adipic acid; Camin ap; Piperazine adipinate; Piperazine adipate; Amphetamine adipate; Adipic acid; Adipate-adipic acid; Poly(propylene adipate)
CAS Numbers: 124-04-9; 142-88-1; 22322-28-7; 25666-61-9; 3385-41-9; 40975-75-5; 51137-10-1; 5683-79-4; 7486-38-6; 134886-82-1; 68258-78-6; 67905-77-5; 137315-44-7; 40959-29-3; 41366-44-3; 59518-84-2; 60865-39-6; 61630-89-5; 62271-81-2; 62548-83-8; 63410-51-5; 66787-20-0; 67859-52-3; 25101-03-5; 41366-45-4; 42767-90-8; 50981-28-7; 55157-42-1; 52496-38-5; 53123-38-9; 53989-20-1; 54335-11-4; 54688-53-8; 55012-14-1; 127195-46-4; 161865-24-3; 162281-11-0; 55398-96-4; 56551-71-4; 51178-67-7; 73817-40-0; 9046-11-1; 68411-87-0; 68368-50-3; 67989-20-2; 67953-53-1; 60608-99-3; 141490-13-3; 221695-34-7; 60961-73-1; 92680-63-2; 180324-85-0; 61256-56-2; 61680-38-4; 61840-27-5; 62118-43-8; 62197-02-8; 62942-09-0; 64365-96-4; 65970-51-6; 29534-39-2; 66525-94-8; 67784-94-5; 67939-68-8; 67989-19-9; 19147-16-1; 68937-27-9; 56509-15-0; 74350-54-2; 52627-55-1; 68894-40-6; 73018-29-8; 21697-94-9; 93203-03-3; 15511-81-6; 16031-83-7; 19628-28-5; 19628-29-6; 23311-84-4; 7486-39-7; 24938-37-2; 89468-83-7; 105866-32-8; 25103-87-1; 73816-44-1; 31699-72-6; 31699-74-8; 13425-34-8; 764-65-8; 160886-56-6; 3323-53-3; 94289-34-6; 93505-75-0; 60368-40-3; 41222-49-5; 42133-47-1; 42603-22-5; 49792-84-9; 64927-24-8; 50327-77-0; 51601-35-5; 51912-17-5; 52235-79-7; 52349-42-5; 7486-40-0; 19584-53-3; 40798-45-6; 40989-36-4; 52656-15-2; 52738-38-2; 53351-10-3; 55231-26-0; 58891-19-3; 55447-58-0; 56816-51-4; 73018-30-1; 68238-77-7; 72270-78-1; 159309-70-3; 64091-34-5; 233661-81-9; 55636-50-5; 56266-32-1; 35919-04-1; 67989-13-3; 68389-68-4; 68527-44-6; 68583-79-9; 68583-87-9; 69011-30-9; 69331-29-9; 70775-82-5; 94167-19-8; 26702-48-7; 26780-60-9; 26876-10-8; 103439-11-8; 27925-07-1; 29403-67-6; 29408-39-7; 30110-00-0; 87397-36-2; 179809-38-2; 30376-45-5; 32505-78-5; 163205-75-2; 195889-46-4; 24937-93-7; 86438-03-1; 63149-70-2; 31698-46-1; 112651-27-1; 25931-01-5; 37129-62-7; 39389-41-8; 53302-95-7; 119471-35-1; 126879-92-3; 26375-23-5; 50938-99-3; 65916-90-7; 76199-80-9; 79230-10-7; 103842-92-8; 106097-11-4; 118817-20-2; 26570-73-0; 66167-60-0; 72993-61-4; 73070-76-5; 97621-66-4; 5423-61-0; 9011-80-7; 167856-49-7; 9017-08-7; 11116-57-7; 151486-17-8; 37324-51-9; 51281-06-2; 53241-28-4; 64927-22-6; 64972-64-1; 9019-92-5; 9052-53-3; 42610-80-0; 9019-93-6; 9019-94-7; 52213-54-4; 9063-78-9; 39277-72-0; 9068-94-4; 9068-96-6; 37277-51-3; 85138-64-3; 9044-95-5; 9080-04-0; 18621-94-8; 19090-60-9; 100359-19-1; 11139-74-5; 149984-55-4; 157971-18-1; 178252-44-3; 24993-04-2; 27030-82-6; 51248-46-5; 51555-86-3; 71119-52-3; 72506-60-6; 9049-00-7; 9049-01-8; 25212-06-0; 511272-89-2; 25212-19-5; 37280-34-5; 51329-77-2; 52932-31-7; 72246-34-5; 74504-46-4; 9036-70-8; 110737-13-8; 25214-14-6; 82785-46-4; 156014-73-2; 25214-18-0; 25748-37-2; 333388-26-4; 25950-35-0; 26140-99-8; 26141-00-4; 26523-14-8; 9036-87-7; 32732-51-7; 33338-25-9; 34012-85-6; 35164-40-0; 36089-13-1; 37310-98-8; 52504-11-7; 66593-97-3; 39281-13-5; 55231-08-8; 28132-94-7; 83890-02-2; 202974-01-4; 28209-35-0; 28301-90-8; 63623-33-6; 63623-34-7; 247906-35-0; 28407-73-0; 30525-45-2; 30580-35-9; 31587-43-6; 76649-35-9; 76649-45-1; 58481-42-8; 63549-52-0; 85169-08-0; 68298-57-7; 68212-31-7; 68140-61-4; 68133-07-3; 68855-39-0; 68954-46-1; 68956-51-4; 68937-26-8; 68989-90-2; 65916-86-1; 27083-55-2; 53526-58-2; 64873-15-0; 76199-81-0; 79921-25-8; 9087-79-0; 32238-28-1; 133544-04-4; 37208-77-8; 68212-32-8; 85646-06-6; 9017-16-7; 97649-50-8; 150747-01-6; 25053-13-8; 73561-43-0; 74083-22-0; 12619-99-7; 12688-24-3; 25191-90-6; 27082-56-0; 37228-90-3; 37275-97-1; 39470-93-4; 51258-14-1; 52228-27-0; 52350-25-1; 52932-19-1; 55777-57-6; 62253-12-7; 70213-58-0; 9038-28-2; 9048-01-5; 9050-55-9; 25464-21-5; 25950-34-9; 26282-28-0; 127004-49-3; 26777-62-8; 39385-68-7; 51822-29-8; 60605-01-8; 61673-82-3; 65357-52-0; 68859-52-9; 83712-77-0; 102561-56-8; 26936-72-1; 27417-33-0; 28430-17-3; 212271-21-1; 28472-89-1; 28477-54-5; 29295-79-2; 30662-91-0; 31048-26-7; 152103-09-8; 31075-20-4; 34313-71-8; 34557-94-3; 67892-88-0; 35561-07-0; 38702-16-8; 38702-18-0; 40471-09-8; 74748-99-5; 50821-59-5; 112310-22-2; 51293-82-4; 85441-42-5; 51365-12-9; 83740-03-8; 52004-58-7; 52247-59-3; 139989-39-2; 52270-22-1; 53184-55-7

 

The “adipate” appears to be the counterion is some cases (amphetamine adipate) and, based on the definition in Wikipedia, comes along for the ride in most cases as a coating. It is not a drug per se. I am not sure whether this should be removed from the list for screening as I could see the value in screening such a common chemical but I don’t know if it would be an ME based on the definitions of the paper : “the term drug referes to a molecular entity (ME) that interacts with one of more molecular targets and effects a change in biological state.”

“Adipic acid has been incorporated into controlled-release formulation matrix tablets to obtain pH-independent release for both weakly basic and weakly acidic drugs. It has also been incorporated into the polymeric coating of hydrophilic monolithic systems to modulate the intragel pH, resulting in zero-order release of a hydrophilic drug. The disintegration at intestinal pH of the enteric polymer shellac has been reported to improve when adipic acid was used as a pore-forming agent without affecting release in the acidic media. Other controlled-release formulations have included adipic acid with the intention of obtaining a late-burst release profile”

The NPC Browser Result for Adipic Acid. Notice the label (there is no sodium associated compound) but this is due to the challenges of aggregating synonyms.

The NCGC have taken on an enormous challenge aggregating these data and will need our feedback on the system moving forward as they intend to improve it based on crowdsourced feedback. Please don’t let them down. If you see an issue in the data use the feedback box!

 

 

Reviewing Data Quality in the NCGC Pharmaceutical Collection Browser

I wasn’t aware of the NCGC Pharmaceutical Collection Browser until today. The work behind the development of the database and the browser is discussed in the publication here:

R. Huang, N. Southall, Y. Wang, A. Yasgar, P. Shinn, A. Jadhav, D.-T. Nguyen, C. P. Austin. The NCGC Pharmaceutical Collection: A Comprehensive Resource of Clinically Approved Drugs Enabling Repurposing and Chemical Genomics. Science Translational Medicine, 2011; 3 (80): 80ps16 DOI: 10.1126/scitranslmed.3001862

As is usual with new databases that come online I always concern myself with data quality. In order to take a look at the data quality I looked at the HTS amenable compounds subset of data. It’s a dataset of >7600 compounds. I ran a couple of very simple filters to try and identify potential issues with the data. In particular I was looking for presence/absence/confusions in stereochemistry. The filters also checked for valency issues and charge imbalance. Based on these checks my estimates are, for the HTS amenable compounds at least, the errors in the data amount to a minimum of 5% and probably over 10%. This is an estimate of course and it would be a lot of work to clean it all up. I’ll try and take a look at the entire database shortly.

Some examples of the errors I saw are below…Unfortunately there are many hundreds of errors in just this subset of the database. We keep creating databases, and in this case a 90 Mbyte desktop browser solution, but WHO is curating and checking the data? What is the cost to develop software that keeps getting invested relative to building quality datasets to use in the various systems? And so it continues….

Charge Balance Issues

NCGC Browser Charge Balance Issues - Screenshots from Browser

 

Imperfect, Absent and Incorrect Stereochemistry

Stereo Issues: Left Hand Side NCGC Structures, Right Hand Side "Correct Structures"

Incorrect Valence Issues

Valence Issues for "Tannic Acid Glycerite" - Screenshot from NCGC Browser

And, just to clarify…I am not saying that our own database, ChemSpider, is perfect. It’s not. But the crowds can help us improve it and curate the data online and immediately. One thing I DO like is that the developers thought ahead about getting immediate feedback as shown below. Unfortunately when I tried to use it it threw a message that there was an error so I don’t know whether the message got through. I hope to get a response at my email address.

Feedback Screen in the NPC Browser

 

 
30 Comments

Posted by on April 28, 2011 in Data Quality, Quality and Content

 
 
Stop SOPA