Category Archives: NPC Browser and NCGC Collection

Rabbits, Potatoes and other Vegetables in the NCGC Database

No wonder I’m so healthy…I like rabbits, eat lots of potatoes, drink “green sludge in a bottle” to get my vegetables and also have lots of other “ingredients” in my diet. According to the NCGC data collection that I have been browsing through the NPC Browser on my desktop these are all parts of the NCGC data collection. While chatting with a number of pharmaceutical scientists last week regarding data quality in public domain databases the NPC Browser was used as an example of data content “to review”. If you’d like to review the contents yourself you will find many issues regarding stereochemistry, valency, charge balance, incorrect associations between structures and identifiers but you can also review the data in table format and look at the content that doesn’t have structures associated and scratch your head at some of the content. To see the errors go to tabular mode as shown below and scroll down past record 8000.

How to display the tabular format for the NCGC Data in the NPC Browser

Notice that the drugs “water Lily”, “Water Cress” and “Water Hemlock” are listed. I wonder what water cress is used for? It all becomes much more fun when you see some of the others listed below. Rabbit…now that’s a good…take two in the morning, without food, and repeat dosage for 7 days. Cures “big sharp pointy teeth”. Vegetables…ah yes, nothing specific. Just “vegetables”. Kind of a cure all really. Take 5 portions per day, with food, obviously. And potatoes…good for a stiff neck (starch collar syndrome). If you browse through you’ll also find “ingredients” listed as a compound. Glad about that really…most drugs have ingredients. I am sure there is a reason that these are listed, though I cannot imagine what the reason would be. If there is no good reason it is time to decommission this dataset until it is cleaned up in a major way. Clearly the contents are suspect at best.

A Rabbit in the NCGC Collection - I pity the rabbit during high-throughput screening

Potatoes - the drug of choice for many McDonald's visitors. Fry-style


Momma was right - eat your vegetables and you'll be healthy. They are drugs?


Confusing Search Results in the NPC Browser

For the past few days I have been in Prague at the PharmSciFair meeting. Beautiful city (with a little too much graffiti on the glorious architecture), good meeting and way too long a flight. I am traveling alone and despite the networking the disrupted sleep patterns have me working on various projects as I struggle to get more than 4 hours of sleep a night. The joys of sleep deprivation include productivity! After a break from looking at the NPC Browser and my earlier commentaries I did an update of the browser when I restarted it last night as Sean’s Collabchem blog has been keeping me updated on new changes.  Sean had informed me in two separate posts (X, Y) about the new disclaimer that states:

We are very much aware of a number of issues with the database and software. Please bear with us as we work meticulously to resolve them. Below are some of the specific issues we are addressing in the next update (which is likely to be version 1.1.0) of the software in a couple of weeks:

  • A new database release that incorporates a number of improvements to compound records (e.g., stereochemistry annotations, 2D layout, etc.).
  • There exists an indicator associated with each compound record to notify whether the record has or has not been manually curated. If the record has been curated, curators are acknowledged with proper attribution. .
  • A simple curation mechanism is integrated with each compound record so as to facilitate error reporting at finer resolutions (e.g., on an attribute basis).
  • Access to all curated records can be done through a single filter mechanism.
  • About ~2,000 QC LC/MS spectra are available for the “NPC screening” subset, courtesy of NCGC’s Bill Leister and the Analytical Chemistry group.
  • Other minor bug fixes and enhancements.

I started to poke around looking for the new features and discuss some below. Some definite moves in the right direction! Like it. I’ll comment on them in a separate post.

As I played more with the system and tried some new searches I started to get very concerned with the results I received. For example, a search on Taxol gives TWO answers. One is Taxol and the other is Ixabepilone, see below. This is weird. They are NOT the same structure at all so why would a search on one compound name bring back a separate “drug” which is really what the browser is supposed to provide us access to. The original paper reports “the creation of a definitive, complete, and non-redundant list of all approved molecular entities”. Certainly the two compounds are non-redundant (compare Taxol, and Ixabepilone). My first thought is that the search is looking for associated information that has been attached to the compound somehow. I found it under the therapeutic tab where it says…”Like taxol, Ixabepilone binds to the αβ-tubulin heterodimer subunit.” If it’s that subtle that will likely give rise to some very interesting challenges (see below). That will mean that I might search for a drug and ANY mention of that drug will retrieve hits. I expect that most people would expect to retrieve the drug itself, not all mentions ever of that drug. Maybe I’m wrong.


To take this to an extreme lets search for “Manganese” and see what we get in the NPC Browser…for one we see elemental manganese as the “compound” but associated with the label for the ionic form.

but also a LOT of organic compounds….one shown below. This is not an inherently obvious result but maybe exactly as its expected to be.

Of the 17 compounds retrieved with a search on Manganese only 5 actually have manganese in the formula. Do you find these results confusing? I would expect a Synonym only search would occur to retrieve just Mn++ (and maybe Mn if that is distinguished from Mn++ as a drug).



Review of NCGC Dataset in the NPC Browser Finished

For the past couple of weeks I have been looking at the NPC browser and the dataset contained within it. I am using it as an example of what type of data is finding its way into the public domain for use by Life Scientists. I had the “opportunity” to take a couple of LONG flights to and from Europe last week and late nights in hotels/ During the trip I finished my review of the data. This does NOT mean that I have a fully curated dataset …no chance. That would take a few weeks to assemble! However, it is enough data to insert some of the conclusions into a paper that has just returned from review as well as provide data for a paper presently being assembled. With that said I’m unlikely to report much more on the data until that paper is through review.

What I can comment is that the dataset does not seem to align with a lot of the comments in the original paper listed below.

R. Huang, N. Southall, Y. Wang, A. Yasgar, P. Shinn, A. Jadhav, D.-T. Nguyen, C. P. Austin. The NCGC Pharmaceutical Collection: A Comprehensive Resource of Clinically Approved Drugs Enabling Repurposing and Chemical Genomics. Science Translational Medicine, 2011; 3 (80): 80ps16 DOI: 10.1126/scitranslmed.3001862

The data has hardly been curated aqnd many of the suggested heuristics applied to the assembly of the dataset failed based on what came through the set that was issued. One of my favorite “drugs” in the screening set is shown below. I doubt Mn2+ is easily marketed as a drug, and having Mn2+ labeled as Selenium oxide, cadmium salt (1:1) seems a little strange. Having it labeled as Strontium tetraborate or barium tetraborate seems just as weird. This is one of many…many others will be discussed in a publication presently in development. Watch this space.

One of the "drugs" from the HTC Screening Set in the NPC Browser




Data Quality in the NCGC Pharmaceutical Collection Browser Part 4

I am now back in the US after a week in Europe and late at night, with a disrupted sleep pattern and looking for something to soothe me to sleep I have been wandering through the rest of the NPC Browser data set [1,2,3] looking for more patterns in the data to see if I can come up with some general advice and cautions about how to build datasets of chemistry. I already have my “conclusion” in my head as to the best advice I can give to any government organization, and others, who are trying to build chemical databases but I will save this for later in the week. For now I want to highlight some of the issues to be careful of. Tonight’s focus is “structural depictions”.

Achieving high quality algorithmic 2D structural layout is difficult across large databases is difficult. All of the cheminformatics vendors have layout tools whereby a structure can be “cleaned” so that the layout on the page is visually appealing. If you have ever used any of the cleaning tools you might have discovered that they fail dismally with certain structures and you have to perform a layout manually. You might have discovered that when you clean the structure that stereocenters flip (something that should NEVER happen but, believe me, does!) However, there are some good tools that are available. OpenEye provide layout algorithms as part of their cheminformatics toolkit and it is certainly one I have experience of.

In any case, when creating a database of chemicals it makes good sense to use algorithms for layout rather than accept what is submitted. PubChem and ChemSpider, with tens of millions of structures, have to use algorithmic cleaning but if you have a small database it won’t take long to visually inspect and spot errors very quickly. There are many examples that SHOULD have been caught with the recent NGC HTS screening set.

The worst one of these that I found while simply browsing was that associated with “Silidianin” as shown below.

The "0D" Structure of Silidianin in the NPC Browser

Here we see the structure of a bare hydroxyl group, but without a negative charge. However, the CAS number and various synonyms certainly don’t support the compound being a hydroxyl group. Indeed, this is a “0D structure” with all xy coordinates set to 0,0 and if the structure is cleaned in a drawing package then you see the structure shown on the left below. While it is the connection table for Sildianin it does not have the appropriate stereochemistry for Sildianin encoded into the structure with wedge and dashed-wedge bonds. The structure on the right has that shown.

Silidianin: Left Structure is Cleaned and Right Structure is with Stereocenters Added

I think you would agree it is more aesthetically pleasing and does communicate the bridged nature of the compound and carries the stereochemical information. However, there is something wrong with even this picture. Can anyone say what’s wrong?

Accurate structural representation in any database takes time, effort and, often, a skilled and careful eye to get right. Clearly the 0D structure is simply wrong and should have been caught. There are other offenders in the database that should have been caught also as shown below. There are LOTS more.

Various Examples of Structures Requiring Layout Improvements

Building high quality chemical databases certainly is tricky…and can be very time-consuming to solve all of these issues.



Support for Common Compounds in the NPC Browser. Data Quality Part 3

Time for bed but a simple observation. There is no need that potassium dichromate cannot be accurately represented in a database. We do it on ChemSpider with no issue here. But for some reason it equates to Cr(I) hydroxide in the NPC Browser? Ummm….nope. Definitely not.


The structure for potassium is not Cr(I) hydroxide


What is a Drug? Data Quality in the NCGC Pharmaceutical Collection Browser Part 2

My initial post on the data quality in the NCGC Pharmaceutical Collection Browser (the NPC Browser) drew some interesting comments. I have continued to review the data and will post various aspects of this to be digested. Because I want to provide feedback to the hosts of the database, and am committed to help with review of the data, what I want to do is try to determine whether the chemical structure provided is consistent with the descriptors associated with the name. In this case by descriptor I mean the chemical name(s) and the CAS number(s) associated with a particular drug. In a number of cases however it is difficult to understand what in a record is meant to be a particular drug. If the database is to be curated some clarity around what a particular record is meant to be curated against is necessary. For example, if we were given the  synonym aspirin then we would know that the associated chemical structure is meant to be aspirin. We could also validate the CAS numbers provided for that chemical. In the case of Picobenzide below, this is easy as I am looking for consistency between resources of the chemical structure as shown, the CAS number and the list of chemical names. There is certainly enough information online to suggest that Picobenzide is a drug.


However, in some cases the synonyms and associated CAS numbers are meshed together from multiple sources and it is difficult to distinguish what the chemical is meant to be. For example, see the two records below.

1) “Gulose”

The structure of a sugar and the long list of associated names and CAS Numbers

A seaerch on Gulose brings up the record above with the list of synonyms and CAS numbers below.

Synonyms: C.m.c.; Aquaplast; Gulose; Sodium carboxymethylcellulose; Idose, (l)-isomer; Polycell; Carmellose sodium; Carboxymethylcellulose, sodiu; Polymannose; Cellulose, carboxymethy; Croscarmelosa; Carboxymethylcellulose sodium; Carmellosum; Croscarmellose sodium; Who no. 4950; Idose, 14c-labeled, (alpha-d)-isomer; Ruspol; Allose, (l)-isomer; Cethylose; Croscarmellose; Celluvisc; Allose; Mannose homopolymer; Gulose, (d)-isomer; Cellolax; Carmelosa; Idose, (alpha-d)-isomer; Carboxymethylcellulose; Gulose, (l)-isomer; Orabase; Cellulose gum; Croscarmellosum; Idose; Poly(mannose); Idopyranose; Allose, (d)-isomer; Sodium, croscarmellos; Sodium, carboxymethylcellulos
CAS Numbers : 9000-11-7; 9004-32-4; 2152-76-3; 5934-56-5; 7282-82-8; 771-89-1; 30142-85-9; 81209-86-1; 2595-97-3; 7635-11-2; 6038-51-3; 4205-23-6; 6027-89-0; 19163-87-2; 1990-29-0; 10030-80-5; 15572-79-9; 23567-25-1; 2595-98-4; 1949-88-8; 5978-95-0; 68400-63-5; 62057-26-5; 25191-16-6; 37370-41-5; 117385-93-0; 12624-09-8; 198084-97-8; 37231-14-4; 37231-15-5; 50642-44-9; 54018-17-6; 55607-96-0; 73699-63-5; 80296-93-1; 82197-79-3; 9045-95-8; 9085-26-1; 177317-30-5; 191616-54-3; 196886-89-2; 204336-41-4; 56727-45-8; 3573-62-4; 56050-40-9; 28823-03-2; 815-92-9; 2535-38-8; 68951-61-1; 42396-95-2; 3615-68-7; 4005-41-8; 10326-73-5; 14049-06-0; 19030-38-7; 22348-49-8; 68784-18-9; 33417-97-9; 68784-15-6

Following any of the CAS Numbers out to PubChem gives a record that tends to have distinct stereochemistry. For example clicking on one of the CAS numbers takes us to this record on PubChem that has names including “Polymannose, Poly(mannose),Mannose homopolymer, D-Mannose polymers” and specific stereochemistry for D-Mannose. The associated molecule is the monomer. Similarly, on ChemSpider we do not host polymers at present so I understand why this would be. This record was sourced from ChemIDPlus which also doesn’t support polymers. By following the rules of data assembly the associated stereochemistry was lost. But many of the names collide also as we can see both the L- and D-isomers listed. So, it’s hard to confirm what the chemical structure should be as I don’t know what to validate against.

This list is fairly long but is much longer for adipic acid as can be seen below. Read BELOW the list for more comments..and yes, I know it’s a long list.

2) Synonyms: Sodium hydrogen adipate-adipic acid; Camin ap; Piperazine adipinate; Piperazine adipate; Amphetamine adipate; Adipic acid; Adipate-adipic acid; Poly(propylene adipate)
CAS Numbers: 124-04-9; 142-88-1; 22322-28-7; 25666-61-9; 3385-41-9; 40975-75-5; 51137-10-1; 5683-79-4; 7486-38-6; 134886-82-1; 68258-78-6; 67905-77-5; 137315-44-7; 40959-29-3; 41366-44-3; 59518-84-2; 60865-39-6; 61630-89-5; 62271-81-2; 62548-83-8; 63410-51-5; 66787-20-0; 67859-52-3; 25101-03-5; 41366-45-4; 42767-90-8; 50981-28-7; 55157-42-1; 52496-38-5; 53123-38-9; 53989-20-1; 54335-11-4; 54688-53-8; 55012-14-1; 127195-46-4; 161865-24-3; 162281-11-0; 55398-96-4; 56551-71-4; 51178-67-7; 73817-40-0; 9046-11-1; 68411-87-0; 68368-50-3; 67989-20-2; 67953-53-1; 60608-99-3; 141490-13-3; 221695-34-7; 60961-73-1; 92680-63-2; 180324-85-0; 61256-56-2; 61680-38-4; 61840-27-5; 62118-43-8; 62197-02-8; 62942-09-0; 64365-96-4; 65970-51-6; 29534-39-2; 66525-94-8; 67784-94-5; 67939-68-8; 67989-19-9; 19147-16-1; 68937-27-9; 56509-15-0; 74350-54-2; 52627-55-1; 68894-40-6; 73018-29-8; 21697-94-9; 93203-03-3; 15511-81-6; 16031-83-7; 19628-28-5; 19628-29-6; 23311-84-4; 7486-39-7; 24938-37-2; 89468-83-7; 105866-32-8; 25103-87-1; 73816-44-1; 31699-72-6; 31699-74-8; 13425-34-8; 764-65-8; 160886-56-6; 3323-53-3; 94289-34-6; 93505-75-0; 60368-40-3; 41222-49-5; 42133-47-1; 42603-22-5; 49792-84-9; 64927-24-8; 50327-77-0; 51601-35-5; 51912-17-5; 52235-79-7; 52349-42-5; 7486-40-0; 19584-53-3; 40798-45-6; 40989-36-4; 52656-15-2; 52738-38-2; 53351-10-3; 55231-26-0; 58891-19-3; 55447-58-0; 56816-51-4; 73018-30-1; 68238-77-7; 72270-78-1; 159309-70-3; 64091-34-5; 233661-81-9; 55636-50-5; 56266-32-1; 35919-04-1; 67989-13-3; 68389-68-4; 68527-44-6; 68583-79-9; 68583-87-9; 69011-30-9; 69331-29-9; 70775-82-5; 94167-19-8; 26702-48-7; 26780-60-9; 26876-10-8; 103439-11-8; 27925-07-1; 29403-67-6; 29408-39-7; 30110-00-0; 87397-36-2; 179809-38-2; 30376-45-5; 32505-78-5; 163205-75-2; 195889-46-4; 24937-93-7; 86438-03-1; 63149-70-2; 31698-46-1; 112651-27-1; 25931-01-5; 37129-62-7; 39389-41-8; 53302-95-7; 119471-35-1; 126879-92-3; 26375-23-5; 50938-99-3; 65916-90-7; 76199-80-9; 79230-10-7; 103842-92-8; 106097-11-4; 118817-20-2; 26570-73-0; 66167-60-0; 72993-61-4; 73070-76-5; 97621-66-4; 5423-61-0; 9011-80-7; 167856-49-7; 9017-08-7; 11116-57-7; 151486-17-8; 37324-51-9; 51281-06-2; 53241-28-4; 64927-22-6; 64972-64-1; 9019-92-5; 9052-53-3; 42610-80-0; 9019-93-6; 9019-94-7; 52213-54-4; 9063-78-9; 39277-72-0; 9068-94-4; 9068-96-6; 37277-51-3; 85138-64-3; 9044-95-5; 9080-04-0; 18621-94-8; 19090-60-9; 100359-19-1; 11139-74-5; 149984-55-4; 157971-18-1; 178252-44-3; 24993-04-2; 27030-82-6; 51248-46-5; 51555-86-3; 71119-52-3; 72506-60-6; 9049-00-7; 9049-01-8; 25212-06-0; 511272-89-2; 25212-19-5; 37280-34-5; 51329-77-2; 52932-31-7; 72246-34-5; 74504-46-4; 9036-70-8; 110737-13-8; 25214-14-6; 82785-46-4; 156014-73-2; 25214-18-0; 25748-37-2; 333388-26-4; 25950-35-0; 26140-99-8; 26141-00-4; 26523-14-8; 9036-87-7; 32732-51-7; 33338-25-9; 34012-85-6; 35164-40-0; 36089-13-1; 37310-98-8; 52504-11-7; 66593-97-3; 39281-13-5; 55231-08-8; 28132-94-7; 83890-02-2; 202974-01-4; 28209-35-0; 28301-90-8; 63623-33-6; 63623-34-7; 247906-35-0; 28407-73-0; 30525-45-2; 30580-35-9; 31587-43-6; 76649-35-9; 76649-45-1; 58481-42-8; 63549-52-0; 85169-08-0; 68298-57-7; 68212-31-7; 68140-61-4; 68133-07-3; 68855-39-0; 68954-46-1; 68956-51-4; 68937-26-8; 68989-90-2; 65916-86-1; 27083-55-2; 53526-58-2; 64873-15-0; 76199-81-0; 79921-25-8; 9087-79-0; 32238-28-1; 133544-04-4; 37208-77-8; 68212-32-8; 85646-06-6; 9017-16-7; 97649-50-8; 150747-01-6; 25053-13-8; 73561-43-0; 74083-22-0; 12619-99-7; 12688-24-3; 25191-90-6; 27082-56-0; 37228-90-3; 37275-97-1; 39470-93-4; 51258-14-1; 52228-27-0; 52350-25-1; 52932-19-1; 55777-57-6; 62253-12-7; 70213-58-0; 9038-28-2; 9048-01-5; 9050-55-9; 25464-21-5; 25950-34-9; 26282-28-0; 127004-49-3; 26777-62-8; 39385-68-7; 51822-29-8; 60605-01-8; 61673-82-3; 65357-52-0; 68859-52-9; 83712-77-0; 102561-56-8; 26936-72-1; 27417-33-0; 28430-17-3; 212271-21-1; 28472-89-1; 28477-54-5; 29295-79-2; 30662-91-0; 31048-26-7; 152103-09-8; 31075-20-4; 34313-71-8; 34557-94-3; 67892-88-0; 35561-07-0; 38702-16-8; 38702-18-0; 40471-09-8; 74748-99-5; 50821-59-5; 112310-22-2; 51293-82-4; 85441-42-5; 51365-12-9; 83740-03-8; 52004-58-7; 52247-59-3; 139989-39-2; 52270-22-1; 53184-55-7


The “adipate” appears to be the counterion is some cases (amphetamine adipate) and, based on the definition in Wikipedia, comes along for the ride in most cases as a coating. It is not a drug per se. I am not sure whether this should be removed from the list for screening as I could see the value in screening such a common chemical but I don’t know if it would be an ME based on the definitions of the paper : “the term drug referes to a molecular entity (ME) that interacts with one of more molecular targets and effects a change in biological state.”

“Adipic acid has been incorporated into controlled-release formulation matrix tablets to obtain pH-independent release for both weakly basic and weakly acidic drugs. It has also been incorporated into the polymeric coating of hydrophilic monolithic systems to modulate the intragel pH, resulting in zero-order release of a hydrophilic drug. The disintegration at intestinal pH of the enteric polymer shellac has been reported to improve when adipic acid was used as a pore-forming agent without affecting release in the acidic media. Other controlled-release formulations have included adipic acid with the intention of obtaining a late-burst release profile”

The NPC Browser Result for Adipic Acid. Notice the label (there is no sodium associated compound) but this is due to the challenges of aggregating synonyms.

The NCGC have taken on an enormous challenge aggregating these data and will need our feedback on the system moving forward as they intend to improve it based on crowdsourced feedback. Please don’t let them down. If you see an issue in the data use the feedback box!