Review of NCGC Dataset in the NPC Browser Finished

10 May

For the past couple of weeks I have been looking at the NPC browser and the dataset contained within it. I am using it as an example of what type of data is finding its way into the public domain for use by Life Scientists. I had the “opportunity” to take a couple of LONG flights to and from Europe last week and late nights in hotels/ During the trip I finished my review of the data. This does NOT mean that I have a fully curated dataset …no chance. That would take a few weeks to assemble! However, it is enough data to insert some of the conclusions into a paper that has just returned from review as well as provide data for a paper presently being assembled. With that said I’m unlikely to report much more on the data until that paper is through review.

What I can comment is that the dataset does not seem to align with a lot of the comments in the original paper listed below.

R. Huang, N. Southall, Y. Wang, A. Yasgar, P. Shinn, A. Jadhav, D.-T. Nguyen, C. P. Austin. The NCGC Pharmaceutical Collection: A Comprehensive Resource of Clinically Approved Drugs Enabling Repurposing and Chemical Genomics. Science Translational Medicine, 2011; 3 (80): 80ps16 DOI: 10.1126/scitranslmed.3001862

The data has hardly been curated aqnd many of the suggested heuristics applied to the assembly of the dataset failed based on what came through the set that was issued. One of my favorite “drugs” in the screening set is shown below. I doubt Mn2+ is easily marketed as a drug, and having Mn2+ labeled as Selenium oxide, cadmium salt (1:1) seems a little strange. Having it labeled as Strontium tetraborate or barium tetraborate seems just as weird. This is one of many…many others will be discussed in a publication presently in development. Watch this space.

One of the "drugs" from the HTC Screening Set in the NPC Browser




About tony

Antony (Tony) J. Williams received his BSc in 1985 from the University of Liverpool (UK) and PhD in 1988 from the University of London (UK). His PhD research interests were in studying the effects of high pressure on molecular motions within lubricant related systems using Nuclear Magnetic Resonance. He moved to Ottawa, Canada to work for the National Research Council performing fundamental research on the electron paramagnetic resonance of radicals trapped in single crystals. Following his postdoctoral position he became the NMR Facility Manager for Ottawa University. Tony joined the Eastman Kodak Company in Rochester, New York as their NMR Technology Leader. He led the laboratory to develop quality control across multiple spectroscopy labs and helped establish walk-up laboratories providing NMR, LC-MS and other forms of spectroscopy to hundreds of chemists across multiple sites. This included the delivery of spectroscopic data to the desktop, automated processing and his initial interests in computer-assisted structure elucidation (CASE) systems. He also worked with a team to develop the worlds’ first web-based LIMS system, WIMS, capable of allowing chemical structure searching and spectral display. With his developing cheminformatic skills and passion for data management he left corporate America to join a small start-up company working out of Toronto, Canada. He joined ACD/Labs as their NMR Product Manager and various roles, including Chief Science Officer, during his 10 years with the company. His responsibilities included managing over 50 products at one time prior to developing a product management team, managing sales, marketing, technical support and technical services. ACD/Labs was one of Canada’s Fast 50 Tech Companies, and Forbes Fast 500 companies in 2001. His primary passions during his tenure with ACD/Labs was the continued adoption of web-based technologies and developing automated structure verification and elucidation platforms. While at ACD/Labs he suggested the possibility of developing a public resource for chemists attempting to integrate internet available chemical data. He finally pursued this vision with some close friends as a hobby project in the evenings and the result was the ChemSpider database ( Even while running out of a basement on hand built servers the website developed a large community following that eventually culminated in the acquisition of the website by the Royal Society of Chemistry (RSC) based in Cambridge, United Kingdom. Tony joined the organization, together with some of the other ChemSpider team, and became their Vice President of Strategic Development. At RSC he continued to develop cheminformatics tools, specifically ChemSpider, and was the technical lead for the chemistry aspects of the Open PHACTS project (, a project focused on the delivery of open data, open source and open systems to support the pharmaceutical sciences. He was also the technical lead for the UK National Chemical Database Service ( and the RSC lead for the PharmaSea project ( attempting to identify novel natural products from the ocean. He left RSC in 2015 to become a Computational Chemist in the National Center of Computational Toxicology at the Environmental Protection Agency where he is bringing his skills to bear working with a team on the delivery of a new software architecture for the management and delivery of data, algorithms and visualization tools. The “Chemistry Dashboard” was released on April 1st, no fooling, at, and provides access to over 700,000 chemicals, experimental and predicted properties and a developing link network to support the environmental sciences. Tony remains passionate about computer-assisted structure elucidation and verification approaches and continues to publish in this area. He is also passionate about teaching scientists to benefit from the developing array of social networking tools for scientists and is known as the ChemConnector on the networks. Over the years he has had adjunct roles at a number of institutions and presently enjoys working with scientists at both UNC Chapel Hill and NC State University. He is widely published with over 200 papers and book chapters and was the recipient of the Jim Gray Award for eScience in 2012. In 2016 he was awarded the North Carolina ACS Distinguished Speaker Award.

5 Responses to Review of NCGC Dataset in the NPC Browser Finished

  1. Jeffrey Harris

    May 10, 2011 at 11:05 pm

    I want to thank you for including those of us on the front lines as practitioners and patients in your thoughtful research. Whether you recognize it or not (I am sure you do!) the profound discovery of issues in data integrity within the life sciences translates all the way down to the level of a therapeutic outcome such as blood pressure and ultimately what could emerge as what we tend to call a therapeutic misadventure (read my blog: My first experience with Computer Assisted Clinical Decision Support) On the Untangled Health Blog, it dates back to 1982.

    I am sure you are aware of the current pressures (in both carrot and stick form) from our government to deploy electronic health records which include the elements of clinical and administrative data exchange between providers, clinics, patients and various registries. The Office of the National Coordinator for HIT is accountable for managing multiple advisory committees who are setting the requirements for the technology we ultimately use. The Health Information Technology Standards Panel has done an excellent job over the last several years developing use cases, standard nomenclature and message structures.

    What I find alarming in your work is the fact that we continue to have issues with the credibility of the foundation of our communication; the source data.

    • In your world of life science innovation these issues may sort themselves out during primary investigation phases but our data are best thought of as meta-data which are used by humans AND computers for critical decisions relating to individual patient management and population health targeting.
    • For example: we utilize HL7 (Health Level 7) as a structure to embed data such as CCD (Continuity of Care Records, NCPDP standards for prescription data and HIPAA X12 standards for the electronic representation of claims between entities. Lately we have also started new standards for Quality Reporting (QRDA) and Geocoded Population Summary Exchanges (GIPSE). We have also chosen vocabulary standards including: SNOMED CT (Systematized Nomenclature of Medicine — Clinical Terms), used for clinical problems and procedures; UNII (Unique Ingredient Identifier), used for ingredient allergies. LOINC (Logical Observation Identifiers Names and Codes), used for Lab tests. facilitate the exchange and pooling of clinical results for clinical care, outcomes management, and research by providing a set of universal codes and names to identify laboratory and other clinical observations and UCUM (Unified Code for Units of Measure), used for units of measure. A code system intended to include all units of measures being contemporarily used in international science, engineering, and business. The purpose is to facilitate unambiguous electronic communication of quantities together with their units. The focus is on electronic communication, as opposed to communication between humans. A typical application of The Unified Code for Units of Measure are electronic data interchange (EDI) protocols.
    Herein lies the rub: We still have not sorted out from an industrial perspective how rules that impact patient treatment, hence safety will be sorted by reliability of the source record. For example: an insurance company attempting to identify individuals with hypertension might use their X12 transaction sets which include ICD codes (International classification for diseases) and CPT codes (procedural codes for payment); then target those individuals for disease management by a special team of patient advocates. The reliability of X12 transactions is always a debate since practitioners who are paid for their services based on the complexity of the encounter will often add every diagnostic code (ICD) that applies to the patient to maximize reimbursement, we are working on these ethical issues but they persist.
    I personally received a call from my insurance company nurse after she had enrolled me in a depression management program because their Pharmacy Benefits Management Company recorded that I was taking Cymbalta as coded in their NCPDP data. Cymbalta as a single identifier for depression does not work from an algorithmic perspective since it is also used for diabetic neuropathy (the reason for my treatment). The nurse and I had a good laugh over this. Ideally a face to face claims encounter with at least two occurrences of the ICD or DSM for depression should be included in the equation prior to contacting the patient. In this case, the insurance group had made a big mistake and if I had less of a sense of humor it would have ended differently.
    So, we are working on algorithms, yet I can assure you that the logical code used can be quite different between manufactures.
    To add another issue: Having worked in the industry, I have had experiences where legacy HL7 data had been customized to use an empty field assigned for one clinical parameter and replaced it with another. This was fine until the next generation of employees came along and tried to run reports on the lab data using basic HL7 standards. Kind of like discovering that your average patient has an average blood glucose value of October 31st, 2001.
    What you are unveiling at the molecular level runs rampant in our industry at a time when we are forcing technology into the market that IMHO still requires a lot of validation as opposed to deploying beta product.
    The The Office of the National Coordinator for HIT Strategic Plan for 20011 through 2015 has the following objectives:
    Goal I: Achieve Adoption and Information Exchange through Meaningful Use of Health IT
    Goal II: Improve Care, Improve Population Health, and Reduce Health Care Costs through the Use of Health IT
    Goal III: Inspire Confidence and Trust in Health IT
    Goal IV: Empower Individuals with Health IT to Improve their Health and the Health Care System
    You can predict the problems with Goal III if we do not perform stellar validation and reliability testing across all manufacturers. I doubt that this is possible given the number of players in the market place.
    I will leave you with one other note:
    As clinicians and patients we struggle to get our data from companies and agencies that tend to hold it hostage for a fee. For example: Sure Scripts holds our prescription data in RxHub which hospital emergency rooms have to pay for as a fee embedded in their e-prescribing software.
    But then again, perhaps it should remain that way since we may not be able to rely on the organic compound structure that the patient reports he or she is taking.
    By the way, your posts give me headaches that I recall from O-Chem. 35 years ago. Please keep the pictures and IUPAC stuff to a minimum. Now if you have any reliable distribution curves for population risk I will gladly review them.
    Jeff Harris

  2. Sean Ekins

    May 11, 2011 at 9:11 am

    Anyone else look at the other slices of data out of interest. As you are looking at them from record one onwards are they just as bad from the last record backwards?

  3. Noel Southall

    May 17, 2011 at 5:28 pm

    I took a look into the record on manganese. Keep in mind that we are trying to assemble a physical collection of HTS-amenable samples for screening that represent the unique molecular entities present in approved drugs. The browser provides the data that supports that physical collection. There are some grey areas in how Manganese +2 was decided to be the ME for Manganese Chloride, for instance. The details on the approval can be found, in the ‘product label’ listed under the record in the browser – so this is in fact a drug.
    Active Ingredient
    0.1 mg in 1 mL
    QQE170PANO 42Z2K6ZL8P
    That bit about the selenium is an error that crept into the record through an incorrect reference of a CAS number, that then pulled in quite a lot of other bad information. In any case, metals were generally excluded from the screening collection.

    Thanks as always for the comments, and please keep sending them in – they are quite helpful in improving this public resource. Please forward your full list of flagged records at your earliest convenience.


Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.