Category Archives: Data Quality

Open innovation and chemistry data management contributions from the Royal Society of Chemistry resulting from the Open PHACTS project at #ACSsanfran

This is my presentation on Thursday 14th August at the ACS Meeting in San Francisco

Open innovation and chemistry data management contributions from the Royal Society of Chemistry resulting from the Open PHACTS project

The Royal Society of Chemistry was pleased to contribute to the Open PHACTS project, a 3 year project funded by the Innovative Medicines Initiative fund from the European Union. For three years we developed our existing platforms, created new and innovative widgets and data platforms to handle chemistry data, extended existing chemistry ontologies and embraced the semantic web open standards. As a result RSC served as the centralized chemistry data hub for the project. With the conclusion of the Open PHACTS project we will report on our experiences resulting from our participation in the project and provide an overview of what tools, capabilities and data have been released into the community as a result of our participation and how this may influence future projects. This will include the Open PHACTS open chemistry data dump including the chemistry related data in chemistry and semantic web consumable formats as well as some of the resulting chemistry software released to the community. The Open PHACTS project resulted in significant contributions to the chemistry community as well as the supporting pharmaceutical companies and biomedical community.


Current Initiatives in Developing Research Data Repositories at the Royal Society of Chemistry

I presented at the Food and Drug Administration today regarding some of our efforts to develop a research data repository for the community. The abstract and presentation from Slideshare is below.

Current Initiatives in Developing Research Data Repositories at the Royal Society of Chemistry

Access to scientific information has changed in a manner that was likely never even imagined by the early pioneers of the internet. The quantities of data, the array of tools available to search and analyze, the devices and the shift in community participation continues to expand while the pace of change does not appear to be slowing. RSC hosts a number of chemistry data resources for the community including ChemSpider, one of the community’s primary online public compound databases. Containing tens of millions of chemical compounds and its associated data ChemSpider serves data tens of thousands of chemists every day and it serves as the foundation for many important international projects to integrate chemistry and biology data, facilitate drug discovery efforts and help to identify new chemicals from under the ocean. This presentation will provide an overview of the expanding reach of this cheminformatics platform and the nature of the solutions that it helps to enable including structure validation and text mining and semantic markup. ChemSpider is limited in scope as a chemical compound database and we are presently architecting the RSC Data Repository, a platform that will enable us to extend our reach to include chemical reactions, analytical data, and diverse data depositions from chemists across various domains. We will also discuss the possibilities it offers in terms of supporting data modeling and sharing. The future of scientific information and communication will be underpinned by these efforts, influenced by increasing participation from the scientific community.


The importance of standards for data exchange and interchange on the Royal Society of Chemistry eScience platforms

This is my seventh and LAST talk at the ACS Meeting in Indianapolis:

The importance of standards for data exchange and interchange on the Royal Society of Chemistry eScience platforms

The Royal Society of Chemistry provides access to a number of databases hosting chemicals data, reactions, spectroscopy data and prediction services. These databases and services can be accessed via web services utilizing queries using standard data formats such as InChI and molfiles. Data can then be downloaded in standard structure and spectral formats allowing for reuse and repurposing. The ChemSpider database integrates to a number of projects external to RSC including Open PHACTS that integrates chemical and biological data. This project utilizes semantic web data standards including RDF. This presentation will provide an overview of how structure and spectral data standards have been critical in allowing us to integrate many open source tools, ease of integration to a myriad of services and underpin many of our future developments.


Digitizing documents to provide a public spectroscopy database

This is my sixth presentation at the ACS Fall Meeting in Indianapolis:

Digitizing documents to provide a public spectroscopy database

RSC hosts a number of platforms providing free access to chemistry related data. The content includes chemical compounds and associated experimental and predicted data, chemical reactions and, increasingly, spectral data. The ChemSpider database primarily contains electronic spectral data generated at the instrument, converted into standard formats such as JCAMP, then uploaded for the community to access. As a publisher RSC holds a rich source of spectral data within our scientific publications and associated electronic supplementary information. We have undertaken a project to Digitally Enable the RSC Archive (DERA) and as part of this project are converting figures of spectral data into standard spectral data formats for storage in our ChemSpider database. This presentation will report on our progress in the project and some of the challenges we have faced to date.




Accessing chemical health and safety data online using Royal Society of Chemistry resources

This is the second presentation I gave at the ACS Meeting in Indianapolis

Accessing chemical health and safety data online using Royal Society of Chemistry resources

The internet has opened up access to large amounts of chemistry related data that can be harvested and assembled into rich resources of value to chemists. The Royal Society of Chemistry’s ChemSpider database has assembled an electronic collection of over 28 million chemicals from over 400 data sources and some of the assembled data is certainly of value to those searching for chemical health and safety information. Since ChemSpider is a text and structure searchable database chemists are able to find relevant information using both of their general search approaches. This presentation will provide an overview of the types of chemical health and safety data and information made available via ChemSpider and discuss how the data are sourced, aggregated and validated. We will examine how the data can be made available via mobile devices and examine the issue of data quality and its potential impacts on such a database.



What data do we trust now in the world of high-throughput screening and public compound databases

Let’s face it, the world of experimentation is fun, rewarding, challenging and depressing. Ok, that has been MY experience of the world of lab-based experimentation. I have made many discoveries and celebrated the true joy of being a lab-rat. Love it…always did. I remain polarized to this day by the number of hours I spent around large NMR magnets. No bias, but still polarized. But lab work is also challenging..sometimes not in a good way. Hours of “experiences”…read that as wasted time because of bad preparation on my part, or on a collaborator’s part, or bad chemicals, poorly calibrated equipment, the “person who came before me” scenario etc. Then there is the truly depressing that I experienced in some of my lab experience. Repeating work that someone else in my lab had done but the lack of a LIMS system didn’t allow me to know that; colleagues not checking materials shipped to them at a crucial stage of a synthesis and finding out what was ordered was not in the bottle (still their fault for not checking!); NMR solvents being really wet and causing nasty side effects on the compound; and, in my life….two magnet quenches in one day….a 500MHz and a 300Mhz. I shrugged and went home…

Some of my lab experiences were depressing but then I moved into cheminformatics. And in the past few years I have been depressed by the sad state of our public compound databases and the quality of data online. I have given dozens of presentations on the matter of data quality and these two blog posts are representative. We’ve also published on the issues of chemical compounds in the public databases and their correctness.

A Quality Alert and Call for Improved Curation of Public Chemistry Databases, A.J. Williams and S.Ekins, Drug Discovery Today, Link

Towards a gold standard: regarding quality in public domain chemistry databases and approaches to improving the situation, A.J. Williams, S. Ekins, V. Tkachenko, Drug discovery today, 5, 2012 Link

This work was always focused on chemical compound structure representations and their matches with synonyms, names etc. Were they what their names said they should be was the common question. After a couple of years of working on this, and publishing with Sean Ekins, we wondered about the data quality of the measured experimental data, especially in the public domain assay screening databases, PubChem of course being the granddaddy of them all. While work could be done to confirm name-structure relationships in PubChem the experimental data is what it is, as submitted. How to check for the data quality of measured experimental data – reproducibility, comparison between labs etc. Not easy.

When the opportunity came to investigate the possibilities of errors in experimental data we didn’t quite expect the results we obtained. Rather than explain the work in detail I encourage you to read the paper, Open Access on PLOS One and available here. The article, entitled “Dispensing Processes Impact Apparent Biological Activity as Determined by Computational and Statistical Analyses” can be summarized as follows:

* Serial dilution and dispensing using pipette tips versus acoustic dispensing with direct dilution can differ by orders of magnitude with no correlation

* The resulting computational 3D pharmacophores generated from data from both acoustic and tip-based transfer differ significantly

* Traditional dispensing processes are another important source of error in high-throughput screening that impacts computational and statistical analyses.

Derek Lowe on the “In the Pipeline” blog made some strong comments in his post about the paper. He called it a “truly disturbing paper” and said “…people who’ve actually done a lot of biological assays may well feel a chill at the thought, because this is just the sort of you’re-kidding variable that can make a big difference.” And he’s right. There is cause for concern. First of all we don’t know enough yet from this very small study to understand what classes of compounds are going to exhibit this effect of pipette vs. acoustic discrepancy. Secondly, there is no meta data associated with the assay data itself (that we are aware of) that captures the distinction in the dispensing process and this paper SHOULD encourage screeners to include this info in their data.

The difference in the tip vs. acoustic dispensing are of course only one of many issues that can accompany data measurements for compounds. Other obvious issues include what’s the purity of what’s being screened – is it one component or many….is an impurity showing the response and in terms of modeling does the compound being screened match the suggested compound that was purchased/synthesized? Classify this as analytical data required prior to screening. Reproducibility and replicates, assay performance, decomposition in storage, etc. Check out the comments on Derek’s blog as responses to his post and clearly the screening community understand many of the challenges and have to deal with them.

Once upon a time someone from pharma made a couple of comments that I found very interesting….1) it likely costs more to store the screening data long term and support the informatics systems that it does to regenerate the data with new and improved assays on an ongoing basis. 2) As assay performance is understood, and assuming that materials are available it is likely appropriate to flush any data older than three years and remeasure. Certainly with this observation of pipette vs. acoustic bias data measured with tips may need to get flushed and remeasured with acoustic dispensing methods.

This work describes the observed differences between tips and acoustic methods and improved pharmacophore correlations. It highlights issues that likely exist in the data sitting in the assay screening databases (compounded with chemistry issues) and brings into focus the question of what can be trusted in the data. For sure not all the data is bad but how to separate good from bad and what of the models that can be derived? As Derek summarized in his blog post “How many other datasets are hosed up because of this effect? Now there’s an important question, and one that we’re not going to have an answer for any time soon.” And it’s depressing to think about how many data sets might be hosed….

There is an entire back story to this publication also…that is the challenges that we had getting the work published and the multiple rejections we had in the process. But Sean has told that story in detail here. There’s also the story about the press release …and how editorial control extended from the paper itself to the press release (described here), a situation that I found inappropriate, over-reaching and simply not right. But it happened anyways…..

So…data quality is an issue. It is confusing, hard to tease out and identify for all its complexities. But it’s science, it’s incremental learning and it’s trial by fire. And we have to wonder how many projects might have been burned simply by the dispensing processes




Tags: , , ,

Navigating scientific resources using wiki based resources

Presentation given at ACS New Orleans Spring Meeting

There is an overwhelming number of new resources for chemistry that would likely benefit both librarians and students in terms of improving access to data and information. While commercial solutions provided by an institution may be the primary resources there is now an enormous range of online tools, databases, resources, apps for mobile devices and, increasingly, wikis. This presentation will provide an overview of how wiki-based resources for scientists are developing and will introduce a number of developing wikis. These include wikis that are being used to teach chemistry to students as well as to source information about scientists, scientific databases and mobile apps.


Mining public domain data as a basis for drug repurposing #ACSPhilly

Second talk delivered today at ACS Philadelphia…

Mining public domain data as a basis for drug repurposing

Online databases containing high throughput screening and other property data continue to proliferate in number. Many pharmaceutical chemists will have used databases such as PubChem, ChemSpider, DrugBank, BindingDB and many others. This work will report on the potential value of these databases for providing data to be used to repurpose drugs using cheminformatics-based approaches (e.g. docking, ligand-based machine learning methods). This work will also discuss the potentially related applications of the Open PHACTS project, a European Union Innovative Medicines Initiative project, that is utilizing semantic web based approaches to integrate large scale chemical and biological data in new ways. We will report on how compound and data quality should be taken into account when utilizing data from online databases and how their careful curation can provide high quality data that can be used to underpin the delivery of molecular models that can in turn identify new uses for old drugs.



Continuing the investigations of Fluvastatin

When I started to investigate Fluvastatin as an example of what’s present in different databases I thought it would be quite easy. Alas…not so…

The investigation got me as far as distinguishing different stereoforms in different databases and then asking the question is fluvastatin supposed to be the ACTIVE form of the compound or the trade name of the compound? i.e. Should the name apply only to the active stereoform or to what is shipped in the bottle? The answer appears to be… “yes”. It is both.

The World Health Organization (WHO) lists in their International Nonproprietary Name for fluvastatin the form as shown in the image below.

The representation of Fluvastatin according to its INN

So according to the WHO fluvastatin is an enantiomeric  pair, so the (3R,5S,6E) and (3S,5R,6E) pair. This coincides with the marketed drug form as on Dailymed here. This does not agree with what is listed on Wikipedia   which shows the “active” stereoisomer.

So, I think the answer is “yes” it is both the marketed drug form as well as the active form. That makes retrieval of EXPLICIT answers from database searches difficult. What would you expect as a user if you searched on Fluvastatin? Both answers to show up?


Posted by on July 30, 2012 in Data Quality


Will the correct structure of Fluvastatin please stand up

Eventually there will be simple answers to the question commonly asked by chemists. “What is the chemical structure of INSERT NAME?” This is going to be true for drugs as the various online databases work together to clean up, curate, qualify and declare what a chemical structure is for a particular drug. While we can have the purists argument about structure drawings not representing reality, for example that compounds are atoms bonded together by shared clouds of electrons that at any point in time may be changing, reorganizing, tautomerizing etc the reality is also that we need a common language for information exchange and in the world of visual depictions for chemistry the layout in a 2D structure diagram is it. As we come together as a community to agree on preferred ways to standardize chemicals to assist in representations in databases for example, this situation will improve. The efforts of the FDA to define structure representation standards, with the support of pharma, will contribute. For now we are left with the challenges of different representations in different databases as well as simply the quality of data being fed into these databases. These are some of the issues we are trying to resolve as we build Open PHACTS. We are trying to link data from various resurces, noting and resolving conflicts when we can, and curating as necessary with the ultimate intention that this information will flow out into the community and be picked up by the database hosts and addressed, fixed, challenged as appropriate.

I’ve been looking for a new example showing the challenges of data integration considering that in Open PHACTS at present we are integrating chemistry from three primary data sets (for now)… DrugBank, ChEBI, ChEMBL. So, let’s consider Fluvastatin. The usual challenges of trying to determine what the “correct” chemical structure representation is for the compound is an iterative loop but let’s see what we can find in our datasets as we iterate. I KNOW from 4 years of looking at chemistry on Wikipedia that the data quality for chemical compound representations is very good. So, starting there we find the Wikipedia record here. The DrugBox links to a number of records in other databases.

One of these is ChemSpider and it has the SAME representation. On ChEBI the representation is inconsistent with no defined stereochemistry (except the E- double bond). Since ChEBI is manually curated and the compound carries 3 stars this should be correct. There are two records LINKED from this ChEBI record.

rel-(3R,5R)-fluvastatin (CHEBI:38566)

rel-(3R,5S)-fluvastatin (CHEBI:38561)

On Drugbank the compound has INVERTED stereochemistry from that on ChemSPider and Wikipedia… WP and ChemSpider has 3R,5S while DrugBank has 3S,5R but it DOES say in the pharmacodynamics sectionb “It is prepared as a racemate of two erythro enantiomers of which the 3R,5S enantiomer exerts the pharmacologic effect. ” confirming that the 3R,5S form is the ACTIVE form.

ChEMBL matches Wikipedia and ChemSpider here.

So, to summarize what we get when we search for Fluvastatin

Stereo 3R,5S for Wikipedia, ChemSPider, ChEMBL

Stereo 3S,5R for DrugBank

No stereo for ChEBI

Welcome to the complexities of name-structure relationships. These are some of the challenges we need to deal with on Open PHACTS. Dailymed defines the sodium fluvastatin as “Fluvastatin sodium is sodium (±)-(3R*, 5S*, 6E)-7-[3-(p-fluorophenyl)-1-isopropylindol-2-yl]-3,5-dihydroxy-6-heptanoate” so the relative form….



1 Comment

Posted by on July 29, 2012 in Data Quality, Quality and Content