Search results for ‘NPC’

Can Chicken Feathers and American Cockroaches be pharmaceutically active?

I’ve been writing a lot about the NPC Browser and NCGC data collection during the past few weeks. Today I was chatting about the software and content with a fellow advocate of online data for chemistry and I was asked for examples of “ridiculous content” that he might be able to refer to. He’d already read some of my earlier posts. It’s worth considering what the NPC Browser is supposed to deliver.

From the website the data collection is defined as:

What is the NCGC Pharmaceutical Collection (NPC)?

The NCGC Pharmaceutical Collection (NPC) is a comprehensive, publically-accessible collection of approved and investigational drugs for high-throughput screening that provides a valuable resource for both validating new models of disease and better understanding the molecular basis of disease pathology and intervention. The NPC has already generated several useful probes for studying a diverse cross section of biology, including novel targets and pathways. NCGC provides access to its set of approved drugs and bioactives through the Therapeutics for Rare and Neglected Diseases (TRND) program and as part of the compound collection for the Tox21 initiative, a collaborative effort for toxicity screening among several government agencies including the US Environmental Protection Agency (EPA), the National Toxicology Program (NTP), the US Food and Drugs Administration (FDA), and the NCGC. Of the nearly 2750 small molecular entities (MEs) that have been approved for clinical use by US (FDA), EU (EMA), Japanese (NHI), and Canadian (HC) authorities and that are amenable to HTS screening, we currently possess 2400 as part of our screening collection.”

Some very interesting items have found there way into the database as mentioned previously. Sean Ekins also pointed out in a comment left out on a blog post that “American Cockroach” was also in the list. Really? Strangely enough….yes. See below.


There are many natural products that have become drugs…not so many insects though! Pop two under the tongue….likely to cause indigestion rather than cure it. drugs on the market....

Other things included in the database are listed below…all as part of the NCGC pharmaceutical collection…bear bile??? Agh



Reengineering Translational Science

I am reposting a blog post from Sean Ekins regarding a recent communication we submitted….

“Recently, Francis Collins published a commentary in Science Translational Medicine entitled “reengineering translational science: the time is right”. and I would encourage people to read it and tell me what you think, for no other reason than this is one of the most accomplished scientists of our time. Antony Williams and I wanted to respond to some of Dr. Collins’ comments. The journal has the ability to post an E-letter so we took this route and so far it has not been published. For that matter it seems no one has responded. Does this mean that everyone agrees with his comments? If there is truly to be a dialog on the topic surely this journal should publish E-letters it receives.

Here are our comments which were submitted as an E-letter but not as yet published- we are restricted to 400 words:

Reengineering Translational Science: Is the NIH the right place for this?

Sean Ekins(1) and Antony J. Williams(2)

1 Collaborations in Chemistry, 5616 Hilltop Needmore Road, Fuquay-Varina, NC27526, U.S.A.

2 Royal Society of Chemistry, 904 Tamaras Circle, Wake Forest, NC27587, U.S.A.

We read with interest Dr. Collins’ Commentary on Reengineering Translational Science: The time is Right. We agree that there is an urgent need to revamp the way that drugs are developed, bring them to market faster and provide incentives to generate treatments for neglected and rare diseases. We question however whether the NIH as it stands can adequately pursue these goals when an entire industry is struggling with the same challenges. We wish to raise several issues that this article promotes. Many of the techniques described will not dramatically impact the process alone and be any more successful than combinatorial chemistry and high throughput screening were touted to be. Why would we want the tax payer to follow the same route as big pharma? Surely the NIH is funded to a large extent to take more exploratory directions (1, 2), and to come up with next generation discoveries and approaches, not simply apply existing technologies and hope to be more successful? It would also be more convincing if there were co-authors on the paper from outside the NIH lauding the value of repeating approaches already in use. Collaboration is important and this commentary did not demonstrate how collaborations (with academic labs or with the pharmaceutical industry) would be facilitated and data shared.

Finally, we are also concerned that an initiative described in this paper, namely the recently released NPC browser (“a comprehensive resource of clinically approved drugs to enable repurposing and chemical genomics”) from the NIH Chemical Genomics Center (NCGC) (3) may not have set the case too soundly (3). Within 24 hrs of release our analysis of the molecules in the database showed that fundamental errors were present, with valency issues, charge imbalances and stereochemistry, (4-6) to name just a few. It took over a month for NCGC to acknowledge these errors and they will still be fixing them for the foreseeable future (7). This software application and content was released in a very raw state with extremely poor quality data (7-9). While software development may appear easy and fast to do, what is required to produce the best solution is the right ideas, the right people and the right tools. The parallel with NCATS is clear. While the NIH is staffed with many clever people there are many more willing to collaborate with fresh ideas outside (10) . We define ourselves as two such willing participants.


1.         S. Ekins, A.J. Williams, M.D. Krasowski, J.S. Freundlich. In silico repositioning of approved drugs for rare and neglected diseases. Drug Disc Today. 16, 298-310 (2011).

2.         S. Ekins, A.J. Williams. Finding promiscuous old drugs for new uses. Pharm Res. 28, 1786-1791 (2011).

3.         R. Huang, N. Southall, Y. Wang, A. Yasgar, P. Shinn, A. Jadhav, D.T. Nguyen, C.P. Austin. The NCGC Pharmaceutical Collection: A Comprehensive Resource of Clinically Approved Drugs Enabling Repurposing and Chemical Genomics. Science translational medicine. 3, 80ps16 (2011).

4.         A.J. Williams. Reviewing Data Quality in the NCGC Pharmaceutical Collection Browser.

5.         A.J. Williams. What is a Drug? Data Quality in the NCGC Pharmaceutical Collection Browser Part 2.

6.         A.J. Williams. Support for Common Compounds in the NPC Browser. Data Quality Part 3.

7.         A.J. Williams. Unreported results. Manuscript in preparation (2011).

8.         A.J. Williams, S. Ekins. A quality alert for chemistry databases. Drug Disc Today. In Press,  (2011).

9.         A.J. Williams. Rabbits, Potatoes and other Vegetables in the NCGC Database.

10.       S. Ekins, A.J. Williams. Reaching out to collaborators: crowdsourcing for pharmaceutical research. Pharm Res. 27, 393-395 (2010).

Conflicts of Interest

AJW is employed by the Royal Society of Chemistry which owns ChemSpider and associated technologies.

SE Consults for Collaborative Drug Discovery

Leave a comment

Posted by on July 25, 2011 in General Communications


Tags: ,

Rabbits, Potatoes and other Vegetables in the NCGC Database

No wonder I’m so healthy…I like rabbits, eat lots of potatoes, drink “green sludge in a bottle” to get my vegetables and also have lots of other “ingredients” in my diet. According to the NCGC data collection that I have been browsing through the NPC Browser on my desktop these are all parts of the NCGC data collection. While chatting with a number of pharmaceutical scientists last week regarding data quality in public domain databases the NPC Browser was used as an example of data content “to review”. If you’d like to review the contents yourself you will find many issues regarding stereochemistry, valency, charge balance, incorrect associations between structures and identifiers but you can also review the data in table format and look at the content that doesn’t have structures associated and scratch your head at some of the content. To see the errors go to tabular mode as shown below and scroll down past record 8000.

How to display the tabular format for the NCGC Data in the NPC Browser

Notice that the drugs “water Lily”, “Water Cress” and “Water Hemlock” are listed. I wonder what water cress is used for? It all becomes much more fun when you see some of the others listed below. Rabbit…now that’s a good…take two in the morning, without food, and repeat dosage for 7 days. Cures “big sharp pointy teeth”. Vegetables…ah yes, nothing specific. Just “vegetables”. Kind of a cure all really. Take 5 portions per day, with food, obviously. And potatoes…good for a stiff neck (starch collar syndrome). If you browse through you’ll also find “ingredients” listed as a compound. Glad about that really…most drugs have ingredients. I am sure there is a reason that these are listed, though I cannot imagine what the reason would be. If there is no good reason it is time to decommission this dataset until it is cleaned up in a major way. Clearly the contents are suspect at best.

A Rabbit in the NCGC Collection - I pity the rabbit during high-throughput screening

Potatoes - the drug of choice for many McDonald's visitors. Fry-style


Momma was right - eat your vegetables and you'll be healthy. They are drugs?


Comments from @UntangledHealth re. Data Quality

I have recently blogged about the quality of data in the NCGC Dataset that was made available with the NPC Browser. Jeff Harris from the UntangledHealth Blog , has made some interesting comments about how this carries over to healthcare and I am posting them below as I thought they were interesting enough to be exposed to this community in case you missed them in the comments…

“I want to thank you for including those of us on the front lines as practitioners and patients in your thoughtful research. Whether you recognize it or not (I am sure you do!) the profound discovery of issues in data integrity within the life sciences translates all the way down to the level of a therapeutic outcome such as blood pressure and ultimately what could emerge as what we tend to call a therapeutic misadventure (read my blog: My first experience with Computer Assisted Clinical Decision Support) On the Untangled Health Blog, it dates back to 1982.

I am sure you are aware of the current pressures (in both carrot and stick form) from our government to deploy electronic health records which include the elements of clinical and administrative data exchange between providers, clinics, patients and various registries. The Office of the National Coordinator for HIT is accountable for managing multiple advisory committees who are setting the requirements for the technology we ultimately use. The Health Information Technology Standards Panel has done an excellent job over the last several years developing use cases, standard nomenclature and message structures.

What I find alarming in your work is the fact that we continue to have issues with the credibility of the foundation of our communication; the source data.

• In your world of life science innovation these issues may sort themselves out during primary investigation phases but our data are best thought of as meta-data which are used by humans AND computers for critical decisions relating to individual patient management and population health targeting.
• For example: we utilize HL7 (Health Level 7) as a structure to embed data such as CCD (Continuity of Care Records, NCPDP standards for prescription data and HIPAA X12 standards for the electronic representation of claims between entities. Lately we have also started new standards for Quality Reporting (QRDA) and Geocoded Population Summary Exchanges (GIPSE). We have also chosen vocabulary standards including: SNOMED CT (Systematized Nomenclature of Medicine — Clinical Terms), used for clinical problems and procedures; UNII (Unique Ingredient Identifier), used for ingredient allergies. LOINC (Logical Observation Identifiers Names and Codes), used for Lab tests. facilitate the exchange and pooling of clinical results for clinical care, outcomes management, and research by providing a set of universal codes and names to identify laboratory and other clinical observations and UCUM (Unified Code for Units of Measure), used for units of measure. A code system intended to include all units of measures being contemporarily used in international science, engineering, and business. The purpose is to facilitate unambiguous electronic communication of quantities together with their units. The focus is on electronic communication, as opposed to communication between humans. A typical application of The Unified Code for Units of Measure are electronic data interchange (EDI) protocols.
Herein lies the rub: We still have not sorted out from an industrial perspective how rules that impact patient treatment, hence safety will be sorted by reliability of the source record. For example: an insurance company attempting to identify individuals with hypertension might use their X12 transaction sets which include ICD codes (International classification for diseases) and CPT codes (procedural codes for payment); then target those individuals for disease management by a special team of patient advocates. The reliability of X12 transactions is always a debate since practitioners who are paid for their services based on the complexity of the encounter will often add every diagnostic code (ICD) that applies to the patient to maximize reimbursement, we are working on these ethical issues but they persist.
I personally received a call from my insurance company nurse after she had enrolled me in a depression management program because their Pharmacy Benefits Management Company recorded that I was taking Cymbalta as coded in their NCPDP data. Cymbalta as a single identifier for depression does not work from an algorithmic perspective since it is also used for diabetic neuropathy (the reason for my treatment). The nurse and I had a good laugh over this. Ideally a face to face claims encounter with at least two occurrences of the ICD or DSM for depression should be included in the equation prior to contacting the patient. In this case, the insurance group had made a big mistake and if I had less of a sense of humor it would have ended differently.
So, we are working on algorithms, yet I can assure you that the logical code used can be quite different between manufactures.
To add another issue: Having worked in the industry, I have had experiences where legacy HL7 data had been customized to use an empty field assigned for one clinical parameter and replaced it with another. This was fine until the next generation of employees came along and tried to run reports on the lab data using basic HL7 standards. Kind of like discovering that your average patient has an average blood glucose value of October 31st, 2001.
What you are unveiling at the molecular level runs rampant in our industry at a time when we are forcing technology into the market that IMHO still requires a lot of validation as opposed to deploying beta product.
The The Office of the National Coordinator for HIT Strategic Plan for 20011 through 2015 has the following objectives:
Goal I: Achieve Adoption and Information Exchange through Meaningful Use of Health IT
Goal II: Improve Care, Improve Population Health, and Reduce Health Care Costs through the Use of Health IT
Goal III: Inspire Confidence and Trust in Health IT
Goal IV: Empower Individuals with Health IT to Improve their Health and the Health Care System
You can predict the problems with Goal III if we do not perform stellar validation and reliability testing across all manufacturers. I doubt that this is possible given the number of players in the market place.”

Leave a comment

Posted by on May 15, 2011 in Purpose and People


Data Quality in the NCGC Pharmaceutical Collection Browser Part 4

I am now back in the US after a week in Europe and late at night, with a disrupted sleep pattern and looking for something to soothe me to sleep I have been wandering through the rest of the NPC Browser data set [1,2,3] looking for more patterns in the data to see if I can come up with some general advice and cautions about how to build datasets of chemistry. I already have my “conclusion” in my head as to the best advice I can give to any government organization, and others, who are trying to build chemical databases but I will save this for later in the week. For now I want to highlight some of the issues to be careful of. Tonight’s focus is “structural depictions”.

Achieving high quality algorithmic 2D structural layout is difficult across large databases is difficult. All of the cheminformatics vendors have layout tools whereby a structure can be “cleaned” so that the layout on the page is visually appealing. If you have ever used any of the cleaning tools you might have discovered that they fail dismally with certain structures and you have to perform a layout manually. You might have discovered that when you clean the structure that stereocenters flip (something that should NEVER happen but, believe me, does!) However, there are some good tools that are available. OpenEye provide layout algorithms as part of their cheminformatics toolkit and it is certainly one I have experience of.

In any case, when creating a database of chemicals it makes good sense to use algorithms for layout rather than accept what is submitted. PubChem and ChemSpider, with tens of millions of structures, have to use algorithmic cleaning but if you have a small database it won’t take long to visually inspect and spot errors very quickly. There are many examples that SHOULD have been caught with the recent NGC HTS screening set.

The worst one of these that I found while simply browsing was that associated with “Silidianin” as shown below.

The "0D" Structure of Silidianin in the NPC Browser

Here we see the structure of a bare hydroxyl group, but without a negative charge. However, the CAS number and various synonyms certainly don’t support the compound being a hydroxyl group. Indeed, this is a “0D structure” with all xy coordinates set to 0,0 and if the structure is cleaned in a drawing package then you see the structure shown on the left below. While it is the connection table for Sildianin it does not have the appropriate stereochemistry for Sildianin encoded into the structure with wedge and dashed-wedge bonds. The structure on the right has that shown.

Silidianin: Left Structure is Cleaned and Right Structure is with Stereocenters Added

I think you would agree it is more aesthetically pleasing and does communicate the bridged nature of the compound and carries the stereochemical information. However, there is something wrong with even this picture. Can anyone say what’s wrong?

Accurate structural representation in any database takes time, effort and, often, a skilled and careful eye to get right. Clearly the 0D structure is simply wrong and should have been caught. There are other offenders in the database that should have been caught also as shown below. There are LOTS more.

Various Examples of Structures Requiring Layout Improvements

Building high quality chemical databases certainly is tricky…and can be very time-consuming to solve all of these issues.



What is a Drug? Data Quality in the NCGC Pharmaceutical Collection Browser Part 2

My initial post on the data quality in the NCGC Pharmaceutical Collection Browser (the NPC Browser) drew some interesting comments. I have continued to review the data and will post various aspects of this to be digested. Because I want to provide feedback to the hosts of the database, and am committed to help with review of the data, what I want to do is try to determine whether the chemical structure provided is consistent with the descriptors associated with the name. In this case by descriptor I mean the chemical name(s) and the CAS number(s) associated with a particular drug. In a number of cases however it is difficult to understand what in a record is meant to be a particular drug. If the database is to be curated some clarity around what a particular record is meant to be curated against is necessary. For example, if we were given the  synonym aspirin then we would know that the associated chemical structure is meant to be aspirin. We could also validate the CAS numbers provided for that chemical. In the case of Picobenzide below, this is easy as I am looking for consistency between resources of the chemical structure as shown, the CAS number and the list of chemical names. There is certainly enough information online to suggest that Picobenzide is a drug.


However, in some cases the synonyms and associated CAS numbers are meshed together from multiple sources and it is difficult to distinguish what the chemical is meant to be. For example, see the two records below.

1) “Gulose”

The structure of a sugar and the long list of associated names and CAS Numbers

A seaerch on Gulose brings up the record above with the list of synonyms and CAS numbers below.

Synonyms: C.m.c.; Aquaplast; Gulose; Sodium carboxymethylcellulose; Idose, (l)-isomer; Polycell; Carmellose sodium; Carboxymethylcellulose, sodiu; Polymannose; Cellulose, carboxymethy; Croscarmelosa; Carboxymethylcellulose sodium; Carmellosum; Croscarmellose sodium; Who no. 4950; Idose, 14c-labeled, (alpha-d)-isomer; Ruspol; Allose, (l)-isomer; Cethylose; Croscarmellose; Celluvisc; Allose; Mannose homopolymer; Gulose, (d)-isomer; Cellolax; Carmelosa; Idose, (alpha-d)-isomer; Carboxymethylcellulose; Gulose, (l)-isomer; Orabase; Cellulose gum; Croscarmellosum; Idose; Poly(mannose); Idopyranose; Allose, (d)-isomer; Sodium, croscarmellos; Sodium, carboxymethylcellulos
CAS Numbers : 9000-11-7; 9004-32-4; 2152-76-3; 5934-56-5; 7282-82-8; 771-89-1; 30142-85-9; 81209-86-1; 2595-97-3; 7635-11-2; 6038-51-3; 4205-23-6; 6027-89-0; 19163-87-2; 1990-29-0; 10030-80-5; 15572-79-9; 23567-25-1; 2595-98-4; 1949-88-8; 5978-95-0; 68400-63-5; 62057-26-5; 25191-16-6; 37370-41-5; 117385-93-0; 12624-09-8; 198084-97-8; 37231-14-4; 37231-15-5; 50642-44-9; 54018-17-6; 55607-96-0; 73699-63-5; 80296-93-1; 82197-79-3; 9045-95-8; 9085-26-1; 177317-30-5; 191616-54-3; 196886-89-2; 204336-41-4; 56727-45-8; 3573-62-4; 56050-40-9; 28823-03-2; 815-92-9; 2535-38-8; 68951-61-1; 42396-95-2; 3615-68-7; 4005-41-8; 10326-73-5; 14049-06-0; 19030-38-7; 22348-49-8; 68784-18-9; 33417-97-9; 68784-15-6

Following any of the CAS Numbers out to PubChem gives a record that tends to have distinct stereochemistry. For example clicking on one of the CAS numbers takes us to this record on PubChem that has names including “Polymannose, Poly(mannose),Mannose homopolymer, D-Mannose polymers” and specific stereochemistry for D-Mannose. The associated molecule is the monomer. Similarly, on ChemSpider we do not host polymers at present so I understand why this would be. This record was sourced from ChemIDPlus which also doesn’t support polymers. By following the rules of data assembly the associated stereochemistry was lost. But many of the names collide also as we can see both the L- and D-isomers listed. So, it’s hard to confirm what the chemical structure should be as I don’t know what to validate against.

This list is fairly long but is much longer for adipic acid as can be seen below. Read BELOW the list for more comments..and yes, I know it’s a long list.

2) Synonyms: Sodium hydrogen adipate-adipic acid; Camin ap; Piperazine adipinate; Piperazine adipate; Amphetamine adipate; Adipic acid; Adipate-adipic acid; Poly(propylene adipate)
CAS Numbers: 124-04-9; 142-88-1; 22322-28-7; 25666-61-9; 3385-41-9; 40975-75-5; 51137-10-1; 5683-79-4; 7486-38-6; 134886-82-1; 68258-78-6; 67905-77-5; 137315-44-7; 40959-29-3; 41366-44-3; 59518-84-2; 60865-39-6; 61630-89-5; 62271-81-2; 62548-83-8; 63410-51-5; 66787-20-0; 67859-52-3; 25101-03-5; 41366-45-4; 42767-90-8; 50981-28-7; 55157-42-1; 52496-38-5; 53123-38-9; 53989-20-1; 54335-11-4; 54688-53-8; 55012-14-1; 127195-46-4; 161865-24-3; 162281-11-0; 55398-96-4; 56551-71-4; 51178-67-7; 73817-40-0; 9046-11-1; 68411-87-0; 68368-50-3; 67989-20-2; 67953-53-1; 60608-99-3; 141490-13-3; 221695-34-7; 60961-73-1; 92680-63-2; 180324-85-0; 61256-56-2; 61680-38-4; 61840-27-5; 62118-43-8; 62197-02-8; 62942-09-0; 64365-96-4; 65970-51-6; 29534-39-2; 66525-94-8; 67784-94-5; 67939-68-8; 67989-19-9; 19147-16-1; 68937-27-9; 56509-15-0; 74350-54-2; 52627-55-1; 68894-40-6; 73018-29-8; 21697-94-9; 93203-03-3; 15511-81-6; 16031-83-7; 19628-28-5; 19628-29-6; 23311-84-4; 7486-39-7; 24938-37-2; 89468-83-7; 105866-32-8; 25103-87-1; 73816-44-1; 31699-72-6; 31699-74-8; 13425-34-8; 764-65-8; 160886-56-6; 3323-53-3; 94289-34-6; 93505-75-0; 60368-40-3; 41222-49-5; 42133-47-1; 42603-22-5; 49792-84-9; 64927-24-8; 50327-77-0; 51601-35-5; 51912-17-5; 52235-79-7; 52349-42-5; 7486-40-0; 19584-53-3; 40798-45-6; 40989-36-4; 52656-15-2; 52738-38-2; 53351-10-3; 55231-26-0; 58891-19-3; 55447-58-0; 56816-51-4; 73018-30-1; 68238-77-7; 72270-78-1; 159309-70-3; 64091-34-5; 233661-81-9; 55636-50-5; 56266-32-1; 35919-04-1; 67989-13-3; 68389-68-4; 68527-44-6; 68583-79-9; 68583-87-9; 69011-30-9; 69331-29-9; 70775-82-5; 94167-19-8; 26702-48-7; 26780-60-9; 26876-10-8; 103439-11-8; 27925-07-1; 29403-67-6; 29408-39-7; 30110-00-0; 87397-36-2; 179809-38-2; 30376-45-5; 32505-78-5; 163205-75-2; 195889-46-4; 24937-93-7; 86438-03-1; 63149-70-2; 31698-46-1; 112651-27-1; 25931-01-5; 37129-62-7; 39389-41-8; 53302-95-7; 119471-35-1; 126879-92-3; 26375-23-5; 50938-99-3; 65916-90-7; 76199-80-9; 79230-10-7; 103842-92-8; 106097-11-4; 118817-20-2; 26570-73-0; 66167-60-0; 72993-61-4; 73070-76-5; 97621-66-4; 5423-61-0; 9011-80-7; 167856-49-7; 9017-08-7; 11116-57-7; 151486-17-8; 37324-51-9; 51281-06-2; 53241-28-4; 64927-22-6; 64972-64-1; 9019-92-5; 9052-53-3; 42610-80-0; 9019-93-6; 9019-94-7; 52213-54-4; 9063-78-9; 39277-72-0; 9068-94-4; 9068-96-6; 37277-51-3; 85138-64-3; 9044-95-5; 9080-04-0; 18621-94-8; 19090-60-9; 100359-19-1; 11139-74-5; 149984-55-4; 157971-18-1; 178252-44-3; 24993-04-2; 27030-82-6; 51248-46-5; 51555-86-3; 71119-52-3; 72506-60-6; 9049-00-7; 9049-01-8; 25212-06-0; 511272-89-2; 25212-19-5; 37280-34-5; 51329-77-2; 52932-31-7; 72246-34-5; 74504-46-4; 9036-70-8; 110737-13-8; 25214-14-6; 82785-46-4; 156014-73-2; 25214-18-0; 25748-37-2; 333388-26-4; 25950-35-0; 26140-99-8; 26141-00-4; 26523-14-8; 9036-87-7; 32732-51-7; 33338-25-9; 34012-85-6; 35164-40-0; 36089-13-1; 37310-98-8; 52504-11-7; 66593-97-3; 39281-13-5; 55231-08-8; 28132-94-7; 83890-02-2; 202974-01-4; 28209-35-0; 28301-90-8; 63623-33-6; 63623-34-7; 247906-35-0; 28407-73-0; 30525-45-2; 30580-35-9; 31587-43-6; 76649-35-9; 76649-45-1; 58481-42-8; 63549-52-0; 85169-08-0; 68298-57-7; 68212-31-7; 68140-61-4; 68133-07-3; 68855-39-0; 68954-46-1; 68956-51-4; 68937-26-8; 68989-90-2; 65916-86-1; 27083-55-2; 53526-58-2; 64873-15-0; 76199-81-0; 79921-25-8; 9087-79-0; 32238-28-1; 133544-04-4; 37208-77-8; 68212-32-8; 85646-06-6; 9017-16-7; 97649-50-8; 150747-01-6; 25053-13-8; 73561-43-0; 74083-22-0; 12619-99-7; 12688-24-3; 25191-90-6; 27082-56-0; 37228-90-3; 37275-97-1; 39470-93-4; 51258-14-1; 52228-27-0; 52350-25-1; 52932-19-1; 55777-57-6; 62253-12-7; 70213-58-0; 9038-28-2; 9048-01-5; 9050-55-9; 25464-21-5; 25950-34-9; 26282-28-0; 127004-49-3; 26777-62-8; 39385-68-7; 51822-29-8; 60605-01-8; 61673-82-3; 65357-52-0; 68859-52-9; 83712-77-0; 102561-56-8; 26936-72-1; 27417-33-0; 28430-17-3; 212271-21-1; 28472-89-1; 28477-54-5; 29295-79-2; 30662-91-0; 31048-26-7; 152103-09-8; 31075-20-4; 34313-71-8; 34557-94-3; 67892-88-0; 35561-07-0; 38702-16-8; 38702-18-0; 40471-09-8; 74748-99-5; 50821-59-5; 112310-22-2; 51293-82-4; 85441-42-5; 51365-12-9; 83740-03-8; 52004-58-7; 52247-59-3; 139989-39-2; 52270-22-1; 53184-55-7


The “adipate” appears to be the counterion is some cases (amphetamine adipate) and, based on the definition in Wikipedia, comes along for the ride in most cases as a coating. It is not a drug per se. I am not sure whether this should be removed from the list for screening as I could see the value in screening such a common chemical but I don’t know if it would be an ME based on the definitions of the paper : “the term drug referes to a molecular entity (ME) that interacts with one of more molecular targets and effects a change in biological state.”

“Adipic acid has been incorporated into controlled-release formulation matrix tablets to obtain pH-independent release for both weakly basic and weakly acidic drugs. It has also been incorporated into the polymeric coating of hydrophilic monolithic systems to modulate the intragel pH, resulting in zero-order release of a hydrophilic drug. The disintegration at intestinal pH of the enteric polymer shellac has been reported to improve when adipic acid was used as a pore-forming agent without affecting release in the acidic media. Other controlled-release formulations have included adipic acid with the intention of obtaining a late-burst release profile”

The NPC Browser Result for Adipic Acid. Notice the label (there is no sodium associated compound) but this is due to the challenges of aggregating synonyms.

The NCGC have taken on an enormous challenge aggregating these data and will need our feedback on the system moving forward as they intend to improve it based on crowdsourced feedback. Please don’t let them down. If you see an issue in the data use the feedback box!



Markush Misrepresentations in ChemSpider

Following on from the many comments made about the recent post about the NPC Browser Markus Sitzmann highlighted a “fun molecule” that he found on ChemSpider. It was here as ChemSpiderID 19053748 shown below but it has now been deprecated…I logged in and deprecated it .

A "fun structure" on ChemSpider

Markus also commented on Sean Ekin’s blog here:

“Well, particularly ChemSpider belongs to the group of “polluters” in PubChem. Count the number of Aspirin, Benzene or Ethanol structures submitted by ChemSpider to PubChem (only linking to a “deprecated” ChemSpider record). Or make an advanced search for ChemSpider records containing also Argon, here is an example:

There are many other examples.”

Markus is CORRECT. I have commented on this publicly myself on a number of occasions and many people have noticed that there are data in PubChem that are in error and originally came from ChemSpider. There’s no point denying it as it’s there for all to see ! We have had the intention for a LONG time to deprecate this data from PubChem and replace it with an updated deposition of cleaner data. The intention remains but the challenge is finding the time to do it. We will do it.

Where did the data came from? These “argon” issues are really NOT argon issues…they are the results of molfiles finding their way into ChemSpider from “patent molecules” where the -Ar is expected to represent a Markush structure where Ar means “Aryl”. This is like  -Alk meaning alkyl. Similar issues arise when molecules are drawn as -X, -Y and -Z and lists of X,Y,Z substitutions are give. For example X=CH3, C2H5, Y=F, Br and Z= Br, Cl. Unfortunately Y is not only a substitution it’s an element, Yttrium. So when a molecule is drawn with a supposed Markush bond to -Y then we have a REAL molecule with Yttrium attached. Agh.

A list of the examples of “interesting Ar molecules” are shown below.

At this point these have all been deprecated…takes about 30 seconds per molecule..but if they were in our original deposition to PubChem they are still there until we deprecate. Ahh…the ongoing joys of data curation.



Reviewing Data Quality in the NCGC Pharmaceutical Collection Browser

I wasn’t aware of the NCGC Pharmaceutical Collection Browser until today. The work behind the development of the database and the browser is discussed in the publication here:

R. Huang, N. Southall, Y. Wang, A. Yasgar, P. Shinn, A. Jadhav, D.-T. Nguyen, C. P. Austin. The NCGC Pharmaceutical Collection: A Comprehensive Resource of Clinically Approved Drugs Enabling Repurposing and Chemical Genomics. Science Translational Medicine, 2011; 3 (80): 80ps16 DOI: 10.1126/scitranslmed.3001862

As is usual with new databases that come online I always concern myself with data quality. In order to take a look at the data quality I looked at the HTS amenable compounds subset of data. It’s a dataset of >7600 compounds. I ran a couple of very simple filters to try and identify potential issues with the data. In particular I was looking for presence/absence/confusions in stereochemistry. The filters also checked for valency issues and charge imbalance. Based on these checks my estimates are, for the HTS amenable compounds at least, the errors in the data amount to a minimum of 5% and probably over 10%. This is an estimate of course and it would be a lot of work to clean it all up. I’ll try and take a look at the entire database shortly.

Some examples of the errors I saw are below…Unfortunately there are many hundreds of errors in just this subset of the database. We keep creating databases, and in this case a 90 Mbyte desktop browser solution, but WHO is curating and checking the data? What is the cost to develop software that keeps getting invested relative to building quality datasets to use in the various systems? And so it continues….

Charge Balance Issues

NCGC Browser Charge Balance Issues - Screenshots from Browser


Imperfect, Absent and Incorrect Stereochemistry

Stereo Issues: Left Hand Side NCGC Structures, Right Hand Side "Correct Structures"

Incorrect Valence Issues

Valence Issues for "Tannic Acid Glycerite" - Screenshot from NCGC Browser

And, just to clarify…I am not saying that our own database, ChemSpider, is perfect. It’s not. But the crowds can help us improve it and curate the data online and immediately. One thing I DO like is that the developers thought ahead about getting immediate feedback as shown below. Unfortunately when I tried to use it it threw a message that there was an error so I don’t know whether the message got through. I hope to get a response at my email address.

Feedback Screen in the NPC Browser



Posted by on April 28, 2011 in Data Quality, Quality and Content