10 May

For the past couple of weeks I have been looking at the NPC browser and the dataset contained within it. I am using it as an example of what type of data is finding its way into the public domain for use by Life Scientists. I had the “opportunity” to take a couple of LONG flights to and from Europe last week and late nights in hotels/ During the trip I finished my review of the data. This does NOT mean that I have a fully curated dataset …no chance. That would take a few weeks to assemble! However, it is enough data to insert some of the conclusions into a paper that has just returned from review as well as provide data for a paper presently being assembled. With that said I’m unlikely to report much more on the data until that paper is through review.

What I can comment is that the dataset does not seem to align with a lot of the comments in the original paper listed below.

R. Huang, N. Southall, Y. Wang, A. Yasgar, P. Shinn, A. Jadhav, D.-T. Nguyen, C. P. Austin. The NCGC Pharmaceutical Collection: A Comprehensive Resource of Clinically Approved Drugs Enabling Repurposing and Chemical Genomics. Science Translational Medicine, 2011; 3 (80): 80ps16 DOI: 10.1126/scitranslmed.3001862

The data has hardly been curated aqnd many of the suggested heuristics applied to the assembly of the dataset failed based on what came through the set that was issued. One of my favorite “drugs” in the screening set is shown below. I doubt Mn2+ is easily marketed as a drug, and having Mn2+ labeled as Selenium oxide, cadmium salt (1:1) seems a little strange. Having it labeled as Strontium tetraborate or barium tetraborate seems just as weird. This is one of many…many others will be discussed in a publication presently in development. Watch this space.

One of the "drugs" from the HTC Screening Set in the NPC Browser




  1. Jeffrey Harris

    May 10, 2011 at 11:05 pm

    I want to thank you for including those of us on the front lines as practitioners and patients in your thoughtful research. Whether you recognize it or not (I am sure you do!) the profound discovery of issues in data integrity within the life sciences translates all the way down to the level of a therapeutic outcome such as blood pressure and ultimately what could emerge as what we tend to call a therapeutic misadventure (read my blog: My first experience with Computer Assisted Clinical Decision Support) On the Untangled Health Blog, it dates back to 1982.

    I am sure you are aware of the current pressures (in both carrot and stick form) from our government to deploy electronic health records which include the elements of clinical and administrative data exchange between providers, clinics, patients and various registries. The Office of the National Coordinator for HIT is accountable for managing multiple advisory committees who are setting the requirements for the technology we ultimately use. The Health Information Technology Standards Panel has done an excellent job over the last several years developing use cases, standard nomenclature and message structures.

    What I find alarming in your work is the fact that we continue to have issues with the credibility of the foundation of our communication; the source data.

    • In your world of life science innovation these issues may sort themselves out during primary investigation phases but our data are best thought of as meta-data which are used by humans AND computers for critical decisions relating to individual patient management and population health targeting.
    • For example: we utilize HL7 (Health Level 7) as a structure to embed data such as CCD (Continuity of Care Records, NCPDP standards for prescription data and HIPAA X12 standards for the electronic representation of claims between entities. Lately we have also started new standards for Quality Reporting (QRDA) and Geocoded Population Summary Exchanges (GIPSE). We have also chosen vocabulary standards including: SNOMED CT (Systematized Nomenclature of Medicine — Clinical Terms), used for clinical problems and procedures; UNII (Unique Ingredient Identifier), used for ingredient allergies. LOINC (Logical Observation Identifiers Names and Codes), used for Lab tests. facilitate the exchange and pooling of clinical results for clinical care, outcomes management, and research by providing a set of universal codes and names to identify laboratory and other clinical observations and UCUM (Unified Code for Units of Measure), used for units of measure. A code system intended to include all units of measures being contemporarily used in international science, engineering, and business. The purpose is to facilitate unambiguous electronic communication of quantities together with their units. The focus is on electronic communication, as opposed to communication between humans. A typical application of The Unified Code for Units of Measure are electronic data interchange (EDI) protocols.
    Herein lies the rub: We still have not sorted out from an industrial perspective how rules that impact patient treatment, hence safety will be sorted by reliability of the source record. For example: an insurance company attempting to identify individuals with hypertension might use their X12 transaction sets which include ICD codes (International classification for diseases) and CPT codes (procedural codes for payment); then target those individuals for disease management by a special team of patient advocates. The reliability of X12 transactions is always a debate since practitioners who are paid for their services based on the complexity of the encounter will often add every diagnostic code (ICD) that applies to the patient to maximize reimbursement, we are working on these ethical issues but they persist.
    I personally received a call from my insurance company nurse after she had enrolled me in a depression management program because their Pharmacy Benefits Management Company recorded that I was taking Cymbalta as coded in their NCPDP data. Cymbalta as a single identifier for depression does not work from an algorithmic perspective since it is also used for diabetic neuropathy (the reason for my treatment). The nurse and I had a good laugh over this. Ideally a face to face claims encounter with at least two occurrences of the ICD or DSM for depression should be included in the equation prior to contacting the patient. In this case, the insurance group had made a big mistake and if I had less of a sense of humor it would have ended differently.
    So, we are working on algorithms, yet I can assure you that the logical code used can be quite different between manufactures.
    To add another issue: Having worked in the industry, I have had experiences where legacy HL7 data had been customized to use an empty field assigned for one clinical parameter and replaced it with another. This was fine until the next generation of employees came along and tried to run reports on the lab data using basic HL7 standards. Kind of like discovering that your average patient has an average blood glucose value of October 31st, 2001.
    What you are unveiling at the molecular level runs rampant in our industry at a time when we are forcing technology into the market that IMHO still requires a lot of validation as opposed to deploying beta product.
    The The Office of the National Coordinator for HIT Strategic Plan for 20011 through 2015 has the following objectives:
    Goal I: Achieve Adoption and Information Exchange through Meaningful Use of Health IT
    Goal II: Improve Care, Improve Population Health, and Reduce Health Care Costs through the Use of Health IT
    Goal III: Inspire Confidence and Trust in Health IT
    Goal IV: Empower Individuals with Health IT to Improve their Health and the Health Care System
    You can predict the problems with Goal III if we do not perform stellar validation and reliability testing across all manufacturers. I doubt that this is possible given the number of players in the market place.
    I will leave you with one other note:
    As clinicians and patients we struggle to get our data from companies and agencies that tend to hold it hostage for a fee. For example: Sure Scripts holds our prescription data in RxHub which hospital emergency rooms have to pay for as a fee embedded in their e-prescribing software.
    But then again, perhaps it should remain that way since we may not be able to rely on the organic compound structure that the patient reports he or she is taking.
    By the way, your posts give me headaches that I recall from O-Chem. 35 years ago. Please keep the pictures and IUPAC stuff to a minimum. Now if you have any reliable distribution curves for population risk I will gladly review them.
    Jeff Harris

  2. Sean Ekins

    May 11, 2011 at 9:11 am

    Anyone else look at the other slices of data out of interest. As you are looking at them from record one onwards are they just as bad from the last record backwards?

  3. Noel Southall

    May 17, 2011 at 5:28 pm

    I took a look into the record on manganese. Keep in mind that we are trying to assemble a physical collection of HTS-amenable samples for screening that represent the unique molecular entities present in approved drugs. The browser provides the data that supports that physical collection. There are some grey areas in how Manganese +2 was decided to be the ME for Manganese Chloride, for instance. The details on the approval can be found, in the ‘product label’ listed under the record in the browser – so this is in fact a drug.
    Active Ingredient
    0.1 mg in 1 mL
    QQE170PANO 42Z2K6ZL8P
    That bit about the selenium is an error that crept into the record through an incorrect reference of a CAS number, that then pulled in quite a lot of other bad information. In any case, metals were generally excluded from the screening collection.

    Thanks as always for the comments, and please keep sending them in – they are quite helpful in improving this public resource. Please forward your full list of flagged records at your earliest convenience.


