RSS

What is a Drug? Data Quality in the NCGC Pharmaceutical Collection Browser Part 2

02 May

My initial post on the data quality in the NCGC Pharmaceutical Collection Browser (the NPC Browser) drew some interesting comments. I have continued to review the data and will post various aspects of this to be digested. Because I want to provide feedback to the hosts of the database, and am committed to help with review of the data, what I want to do is try to determine whether the chemical structure provided is consistent with the descriptors associated with the name. In this case by descriptor I mean the chemical name(s) and the CAS number(s) associated with a particular drug. In a number of cases however it is difficult to understand what in a record is meant to be a particular drug. If the database is to be curated some clarity around what a particular record is meant to be curated against is necessary. For example, if we were given the  synonym aspirin then we would know that the associated chemical structure is meant to be aspirin. We could also validate the CAS numbers provided for that chemical. In the case of Picobenzide below, this is easy as I am looking for consistency between resources of the chemical structure as shown, the CAS number and the list of chemical names. There is certainly enough information online to suggest that Picobenzide is a drug.

 

However, in some cases the synonyms and associated CAS numbers are meshed together from multiple sources and it is difficult to distinguish what the chemical is meant to be. For example, see the two records below.

1) “Gulose”

The structure of a sugar and the long list of associated names and CAS Numbers

A seaerch on Gulose brings up the record above with the list of synonyms and CAS numbers below.

Synonyms: C.m.c.; Aquaplast; Gulose; Sodium carboxymethylcellulose; Idose, (l)-isomer; Polycell; Carmellose sodium; Carboxymethylcellulose, sodiu; Polymannose; Cellulose, carboxymethy; Croscarmelosa; Carboxymethylcellulose sodium; Carmellosum; Croscarmellose sodium; Who no. 4950; Idose, 14c-labeled, (alpha-d)-isomer; Ruspol; Allose, (l)-isomer; Cethylose; Croscarmellose; Celluvisc; Allose; Mannose homopolymer; Gulose, (d)-isomer; Cellolax; Carmelosa; Idose, (alpha-d)-isomer; Carboxymethylcellulose; Gulose, (l)-isomer; Orabase; Cellulose gum; Croscarmellosum; Idose; Poly(mannose); Idopyranose; Allose, (d)-isomer; Sodium, croscarmellos; Sodium, carboxymethylcellulos
CAS Numbers : 9000-11-7; 9004-32-4; 2152-76-3; 5934-56-5; 7282-82-8; 771-89-1; 30142-85-9; 81209-86-1; 2595-97-3; 7635-11-2; 6038-51-3; 4205-23-6; 6027-89-0; 19163-87-2; 1990-29-0; 10030-80-5; 15572-79-9; 23567-25-1; 2595-98-4; 1949-88-8; 5978-95-0; 68400-63-5; 62057-26-5; 25191-16-6; 37370-41-5; 117385-93-0; 12624-09-8; 198084-97-8; 37231-14-4; 37231-15-5; 50642-44-9; 54018-17-6; 55607-96-0; 73699-63-5; 80296-93-1; 82197-79-3; 9045-95-8; 9085-26-1; 177317-30-5; 191616-54-3; 196886-89-2; 204336-41-4; 56727-45-8; 3573-62-4; 56050-40-9; 28823-03-2; 815-92-9; 2535-38-8; 68951-61-1; 42396-95-2; 3615-68-7; 4005-41-8; 10326-73-5; 14049-06-0; 19030-38-7; 22348-49-8; 68784-18-9; 33417-97-9; 68784-15-6

Following any of the CAS Numbers out to PubChem gives a record that tends to have distinct stereochemistry. For example clicking on one of the CAS numbers takes us to this record on PubChem that has names including “Polymannose, Poly(mannose),Mannose homopolymer, D-Mannose polymers” and specific stereochemistry for D-Mannose. The associated molecule is the monomer. Similarly, on ChemSpider we do not host polymers at present so I understand why this would be. This record was sourced from ChemIDPlus which also doesn’t support polymers. By following the rules of data assembly the associated stereochemistry was lost. But many of the names collide also as we can see both the L- and D-isomers listed. So, it’s hard to confirm what the chemical structure should be as I don’t know what to validate against.

This list is fairly long but is much longer for adipic acid as can be seen below. Read BELOW the list for more comments..and yes, I know it’s a long list.

2) Synonyms: Sodium hydrogen adipate-adipic acid; Camin ap; Piperazine adipinate; Piperazine adipate; Amphetamine adipate; Adipic acid; Adipate-adipic acid; Poly(propylene adipate)
CAS Numbers: 124-04-9; 142-88-1; 22322-28-7; 25666-61-9; 3385-41-9; 40975-75-5; 51137-10-1; 5683-79-4; 7486-38-6; 134886-82-1; 68258-78-6; 67905-77-5; 137315-44-7; 40959-29-3; 41366-44-3; 59518-84-2; 60865-39-6; 61630-89-5; 62271-81-2; 62548-83-8; 63410-51-5; 66787-20-0; 67859-52-3; 25101-03-5; 41366-45-4; 42767-90-8; 50981-28-7; 55157-42-1; 52496-38-5; 53123-38-9; 53989-20-1; 54335-11-4; 54688-53-8; 55012-14-1; 127195-46-4; 161865-24-3; 162281-11-0; 55398-96-4; 56551-71-4; 51178-67-7; 73817-40-0; 9046-11-1; 68411-87-0; 68368-50-3; 67989-20-2; 67953-53-1; 60608-99-3; 141490-13-3; 221695-34-7; 60961-73-1; 92680-63-2; 180324-85-0; 61256-56-2; 61680-38-4; 61840-27-5; 62118-43-8; 62197-02-8; 62942-09-0; 64365-96-4; 65970-51-6; 29534-39-2; 66525-94-8; 67784-94-5; 67939-68-8; 67989-19-9; 19147-16-1; 68937-27-9; 56509-15-0; 74350-54-2; 52627-55-1; 68894-40-6; 73018-29-8; 21697-94-9; 93203-03-3; 15511-81-6; 16031-83-7; 19628-28-5; 19628-29-6; 23311-84-4; 7486-39-7; 24938-37-2; 89468-83-7; 105866-32-8; 25103-87-1; 73816-44-1; 31699-72-6; 31699-74-8; 13425-34-8; 764-65-8; 160886-56-6; 3323-53-3; 94289-34-6; 93505-75-0; 60368-40-3; 41222-49-5; 42133-47-1; 42603-22-5; 49792-84-9; 64927-24-8; 50327-77-0; 51601-35-5; 51912-17-5; 52235-79-7; 52349-42-5; 7486-40-0; 19584-53-3; 40798-45-6; 40989-36-4; 52656-15-2; 52738-38-2; 53351-10-3; 55231-26-0; 58891-19-3; 55447-58-0; 56816-51-4; 73018-30-1; 68238-77-7; 72270-78-1; 159309-70-3; 64091-34-5; 233661-81-9; 55636-50-5; 56266-32-1; 35919-04-1; 67989-13-3; 68389-68-4; 68527-44-6; 68583-79-9; 68583-87-9; 69011-30-9; 69331-29-9; 70775-82-5; 94167-19-8; 26702-48-7; 26780-60-9; 26876-10-8; 103439-11-8; 27925-07-1; 29403-67-6; 29408-39-7; 30110-00-0; 87397-36-2; 179809-38-2; 30376-45-5; 32505-78-5; 163205-75-2; 195889-46-4; 24937-93-7; 86438-03-1; 63149-70-2; 31698-46-1; 112651-27-1; 25931-01-5; 37129-62-7; 39389-41-8; 53302-95-7; 119471-35-1; 126879-92-3; 26375-23-5; 50938-99-3; 65916-90-7; 76199-80-9; 79230-10-7; 103842-92-8; 106097-11-4; 118817-20-2; 26570-73-0; 66167-60-0; 72993-61-4; 73070-76-5; 97621-66-4; 5423-61-0; 9011-80-7; 167856-49-7; 9017-08-7; 11116-57-7; 151486-17-8; 37324-51-9; 51281-06-2; 53241-28-4; 64927-22-6; 64972-64-1; 9019-92-5; 9052-53-3; 42610-80-0; 9019-93-6; 9019-94-7; 52213-54-4; 9063-78-9; 39277-72-0; 9068-94-4; 9068-96-6; 37277-51-3; 85138-64-3; 9044-95-5; 9080-04-0; 18621-94-8; 19090-60-9; 100359-19-1; 11139-74-5; 149984-55-4; 157971-18-1; 178252-44-3; 24993-04-2; 27030-82-6; 51248-46-5; 51555-86-3; 71119-52-3; 72506-60-6; 9049-00-7; 9049-01-8; 25212-06-0; 511272-89-2; 25212-19-5; 37280-34-5; 51329-77-2; 52932-31-7; 72246-34-5; 74504-46-4; 9036-70-8; 110737-13-8; 25214-14-6; 82785-46-4; 156014-73-2; 25214-18-0; 25748-37-2; 333388-26-4; 25950-35-0; 26140-99-8; 26141-00-4; 26523-14-8; 9036-87-7; 32732-51-7; 33338-25-9; 34012-85-6; 35164-40-0; 36089-13-1; 37310-98-8; 52504-11-7; 66593-97-3; 39281-13-5; 55231-08-8; 28132-94-7; 83890-02-2; 202974-01-4; 28209-35-0; 28301-90-8; 63623-33-6; 63623-34-7; 247906-35-0; 28407-73-0; 30525-45-2; 30580-35-9; 31587-43-6; 76649-35-9; 76649-45-1; 58481-42-8; 63549-52-0; 85169-08-0; 68298-57-7; 68212-31-7; 68140-61-4; 68133-07-3; 68855-39-0; 68954-46-1; 68956-51-4; 68937-26-8; 68989-90-2; 65916-86-1; 27083-55-2; 53526-58-2; 64873-15-0; 76199-81-0; 79921-25-8; 9087-79-0; 32238-28-1; 133544-04-4; 37208-77-8; 68212-32-8; 85646-06-6; 9017-16-7; 97649-50-8; 150747-01-6; 25053-13-8; 73561-43-0; 74083-22-0; 12619-99-7; 12688-24-3; 25191-90-6; 27082-56-0; 37228-90-3; 37275-97-1; 39470-93-4; 51258-14-1; 52228-27-0; 52350-25-1; 52932-19-1; 55777-57-6; 62253-12-7; 70213-58-0; 9038-28-2; 9048-01-5; 9050-55-9; 25464-21-5; 25950-34-9; 26282-28-0; 127004-49-3; 26777-62-8; 39385-68-7; 51822-29-8; 60605-01-8; 61673-82-3; 65357-52-0; 68859-52-9; 83712-77-0; 102561-56-8; 26936-72-1; 27417-33-0; 28430-17-3; 212271-21-1; 28472-89-1; 28477-54-5; 29295-79-2; 30662-91-0; 31048-26-7; 152103-09-8; 31075-20-4; 34313-71-8; 34557-94-3; 67892-88-0; 35561-07-0; 38702-16-8; 38702-18-0; 40471-09-8; 74748-99-5; 50821-59-5; 112310-22-2; 51293-82-4; 85441-42-5; 51365-12-9; 83740-03-8; 52004-58-7; 52247-59-3; 139989-39-2; 52270-22-1; 53184-55-7

 

The “adipate” appears to be the counterion is some cases (amphetamine adipate) and, based on the definition in Wikipedia, comes along for the ride in most cases as a coating. It is not a drug per se. I am not sure whether this should be removed from the list for screening as I could see the value in screening such a common chemical but I don’t know if it would be an ME based on the definitions of the paper : “the term drug referes to a molecular entity (ME) that interacts with one of more molecular targets and effects a change in biological state.”

“Adipic acid has been incorporated into controlled-release formulation matrix tablets to obtain pH-independent release for both weakly basic and weakly acidic drugs. It has also been incorporated into the polymeric coating of hydrophilic monolithic systems to modulate the intragel pH, resulting in zero-order release of a hydrophilic drug. The disintegration at intestinal pH of the enteric polymer shellac has been reported to improve when adipic acid was used as a pore-forming agent without affecting release in the acidic media. Other controlled-release formulations have included adipic acid with the intention of obtaining a late-burst release profile”

The NPC Browser Result for Adipic Acid. Notice the label (there is no sodium associated compound) but this is due to the challenges of aggregating synonyms.

The NCGC have taken on an enormous challenge aggregating these data and will need our feedback on the system moving forward as they intend to improve it based on crowdsourced feedback. Please don’t let them down. If you see an issue in the data use the feedback box!

 

 

About tony

Antony (Tony) J. Williams received his BSc in 1985 from the University of Liverpool (UK) and PhD in 1988 from the University of London (UK). His PhD research interests were in studying the effects of high pressure on molecular motions within lubricant related systems using Nuclear Magnetic Resonance. He moved to Ottawa, Canada to work for the National Research Council performing fundamental research on the electron paramagnetic resonance of radicals trapped in single crystals. Following his postdoctoral position he became the NMR Facility Manager for Ottawa University. Tony joined the Eastman Kodak Company in Rochester, New York as their NMR Technology Leader. He led the laboratory to develop quality control across multiple spectroscopy labs and helped establish walk-up laboratories providing NMR, LC-MS and other forms of spectroscopy to hundreds of chemists across multiple sites. This included the delivery of spectroscopic data to the desktop, automated processing and his initial interests in computer-assisted structure elucidation (CASE) systems. He also worked with a team to develop the worlds’ first web-based LIMS system, WIMS, capable of allowing chemical structure searching and spectral display. With his developing cheminformatic skills and passion for data management he left corporate America to join a small start-up company working out of Toronto, Canada. He joined ACD/Labs as their NMR Product Manager and various roles, including Chief Science Officer, during his 10 years with the company. His responsibilities included managing over 50 products at one time prior to developing a product management team, managing sales, marketing, technical support and technical services. ACD/Labs was one of Canada’s Fast 50 Tech Companies, and Forbes Fast 500 companies in 2001. His primary passions during his tenure with ACD/Labs was the continued adoption of web-based technologies and developing automated structure verification and elucidation platforms. While at ACD/Labs he suggested the possibility of developing a public resource for chemists attempting to integrate internet available chemical data. He finally pursued this vision with some close friends as a hobby project in the evenings and the result was the ChemSpider database (www.chemspider.com). Even while running out of a basement on hand built servers the website developed a large community following that eventually culminated in the acquisition of the website by the Royal Society of Chemistry (RSC) based in Cambridge, United Kingdom. Tony joined the organization, together with some of the other ChemSpider team, and became their Vice President of Strategic Development. At RSC he continued to develop cheminformatics tools, specifically ChemSpider, and was the technical lead for the chemistry aspects of the Open PHACTS project (http://www.openphacts.org), a project focused on the delivery of open data, open source and open systems to support the pharmaceutical sciences. He was also the technical lead for the UK National Chemical Database Service (http://cds.rsc.org/) and the RSC lead for the PharmaSea project (http://www.pharma-sea.eu/) attempting to identify novel natural products from the ocean. He left RSC in 2015 to become a Computational Chemist in the National Center of Computational Toxicology at the Environmental Protection Agency where he is bringing his skills to bear working with a team on the delivery of a new software architecture for the management and delivery of data, algorithms and visualization tools. The “Chemistry Dashboard” was released on April 1st, no fooling, at https://comptox.epa.gov, and provides access to over 700,000 chemicals, experimental and predicted properties and a developing link network to support the environmental sciences. Tony remains passionate about computer-assisted structure elucidation and verification approaches and continues to publish in this area. He is also passionate about teaching scientists to benefit from the developing array of social networking tools for scientists and is known as the ChemConnector on the networks. Over the years he has had adjunct roles at a number of institutions and presently enjoys working with scientists at both UNC Chapel Hill and NC State University. He is widely published with over 200 papers and book chapters and was the recipient of the Jim Gray Award for eScience in 2012. In 2016 he was awarded the North Carolina ACS Distinguished Speaker Award.

One Response to What is a Drug? Data Quality in the NCGC Pharmaceutical Collection Browser Part 2

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

 
Stop SOPA