What data do we trust now in the world of high-throughput screening and public compound databases

03 May

Let’s face it, the world of experimentation is fun, rewarding, challenging and depressing. Ok, that has been MY experience of the world of lab-based experimentation. I have made many discoveries and celebrated the true joy of being a lab-rat. Love it…always did. I remain polarized to this day by the number of hours I spent around large NMR magnets. No bias, but still polarized. But lab work is also challenging..sometimes not in a good way. Hours of “experiences”…read that as wasted time because of bad preparation on my part, or on a collaborator’s part, or bad chemicals, poorly calibrated equipment, the “person who came before me” scenario etc. Then there is the truly depressing that I experienced in some of my lab experience. Repeating work that someone else in my lab had done but the lack of a LIMS system didn’t allow me to know that; colleagues not checking materials shipped to them at a crucial stage of a synthesis and finding out what was ordered was not in the bottle (still their fault for not checking!); NMR solvents being really wet and causing nasty side effects on the compound; and, in my life….two magnet quenches in one day….a 500MHz and a 300Mhz. I shrugged and went home…

Some of my lab experiences were depressing but then I moved into cheminformatics. And in the past few years I have been depressed by the sad state of our public compound databases and the quality of data online. I have given dozens of presentations on the matter of data quality and these two blog posts are representative. We’ve also published on the issues of chemical compounds in the public databases and their correctness.

A Quality Alert and Call for Improved Curation of Public Chemistry Databases, A.J. Williams and S.Ekins, Drug Discovery Today, Link

Towards a gold standard: regarding quality in public domain chemistry databases and approaches to improving the situation, A.J. Williams, S. Ekins, V. Tkachenko, Drug discovery today, 5, 2012 Link

This work was always focused on chemical compound structure representations and their matches with synonyms, names etc. Were they what their names said they should be was the common question. After a couple of years of working on this, and publishing with Sean Ekins, we wondered about the data quality of the measured experimental data, especially in the public domain assay screening databases, PubChem of course being the granddaddy of them all. While work could be done to confirm name-structure relationships in PubChem the experimental data is what it is, as submitted. How to check for the data quality of measured experimental data – reproducibility, comparison between labs etc. Not easy.

When the opportunity came to investigate the possibilities of errors in experimental data we didn’t quite expect the results we obtained. Rather than explain the work in detail I encourage you to read the paper, Open Access on PLOS One and available here. The article, entitled “Dispensing Processes Impact Apparent Biological Activity as Determined by Computational and Statistical Analyses” can be summarized as follows:

* Serial dilution and dispensing using pipette tips versus acoustic dispensing with direct dilution can differ by orders of magnitude with no correlation

* The resulting computational 3D pharmacophores generated from data from both acoustic and tip-based transfer differ significantly

* Traditional dispensing processes are another important source of error in high-throughput screening that impacts computational and statistical analyses.

Derek Lowe on the “In the Pipeline” blog made some strong comments in his post about the paper. He called it a “truly disturbing paper” and said “…people who’ve actually done a lot of biological assays may well feel a chill at the thought, because this is just the sort of you’re-kidding variable that can make a big difference.” And he’s right. There is cause for concern. First of all we don’t know enough yet from this very small study to understand what classes of compounds are going to exhibit this effect of pipette vs. acoustic discrepancy. Secondly, there is no meta data associated with the assay data itself (that we are aware of) that captures the distinction in the dispensing process and this paper SHOULD encourage screeners to include this info in their data.

The difference in the tip vs. acoustic dispensing are of course only one of many issues that can accompany data measurements for compounds. Other obvious issues include what’s the purity of what’s being screened – is it one component or many….is an impurity showing the response and in terms of modeling does the compound being screened match the suggested compound that was purchased/synthesized? Classify this as analytical data required prior to screening. Reproducibility and replicates, assay performance, decomposition in storage, etc. Check out the comments on Derek’s blog as responses to his post and clearly the screening community understand many of the challenges and have to deal with them.

Once upon a time someone from pharma made a couple of comments that I found very interesting….1) it likely costs more to store the screening data long term and support the informatics systems that it does to regenerate the data with new and improved assays on an ongoing basis. 2) As assay performance is understood, and assuming that materials are available it is likely appropriate to flush any data older than three years and remeasure. Certainly with this observation of pipette vs. acoustic bias data measured with tips may need to get flushed and remeasured with acoustic dispensing methods.

This work describes the observed differences between tips and acoustic methods and improved pharmacophore correlations. It highlights issues that likely exist in the data sitting in the assay screening databases (compounded with chemistry issues) and brings into focus the question of what can be trusted in the data. For sure not all the data is bad but how to separate good from bad and what of the models that can be derived? As Derek summarized in his blog post “How many other datasets are hosed up because of this effect? Now there’s an important question, and one that we’re not going to have an answer for any time soon.” And it’s depressing to think about how many data sets might be hosed….

There is an entire back story to this publication also…that is the challenges that we had getting the work published and the multiple rejections we had in the process. But Sean has told that story in detail here. There’s also the story about the press release …and how editorial control extended from the paper itself to the press release (described here), a situation that I found inappropriate, over-reaching and simply not right. But it happened anyways…..

So…data quality is an issue. It is confusing, hard to tease out and identify for all its complexities. But it’s science, it’s incremental learning and it’s trial by fire. And we have to wonder how many projects might have been burned simply by the dispensing processes




About tony

Antony (Tony) J. Williams received his BSc in 1985 from the University of Liverpool (UK) and PhD in 1988 from the University of London (UK). His PhD research interests were in studying the effects of high pressure on molecular motions within lubricant related systems using Nuclear Magnetic Resonance. He moved to Ottawa, Canada to work for the National Research Council performing fundamental research on the electron paramagnetic resonance of radicals trapped in single crystals. Following his postdoctoral position he became the NMR Facility Manager for Ottawa University. Tony joined the Eastman Kodak Company in Rochester, New York as their NMR Technology Leader. He led the laboratory to develop quality control across multiple spectroscopy labs and helped establish walk-up laboratories providing NMR, LC-MS and other forms of spectroscopy to hundreds of chemists across multiple sites. This included the delivery of spectroscopic data to the desktop, automated processing and his initial interests in computer-assisted structure elucidation (CASE) systems. He also worked with a team to develop the worlds’ first web-based LIMS system, WIMS, capable of allowing chemical structure searching and spectral display. With his developing cheminformatic skills and passion for data management he left corporate America to join a small start-up company working out of Toronto, Canada. He joined ACD/Labs as their NMR Product Manager and various roles, including Chief Science Officer, during his 10 years with the company. His responsibilities included managing over 50 products at one time prior to developing a product management team, managing sales, marketing, technical support and technical services. ACD/Labs was one of Canada’s Fast 50 Tech Companies, and Forbes Fast 500 companies in 2001. His primary passions during his tenure with ACD/Labs was the continued adoption of web-based technologies and developing automated structure verification and elucidation platforms. While at ACD/Labs he suggested the possibility of developing a public resource for chemists attempting to integrate internet available chemical data. He finally pursued this vision with some close friends as a hobby project in the evenings and the result was the ChemSpider database ( Even while running out of a basement on hand built servers the website developed a large community following that eventually culminated in the acquisition of the website by the Royal Society of Chemistry (RSC) based in Cambridge, United Kingdom. Tony joined the organization, together with some of the other ChemSpider team, and became their Vice President of Strategic Development. At RSC he continued to develop cheminformatics tools, specifically ChemSpider, and was the technical lead for the chemistry aspects of the Open PHACTS project (, a project focused on the delivery of open data, open source and open systems to support the pharmaceutical sciences. He was also the technical lead for the UK National Chemical Database Service ( and the RSC lead for the PharmaSea project ( attempting to identify novel natural products from the ocean. He left RSC in 2015 to become a Computational Chemist in the National Center of Computational Toxicology at the Environmental Protection Agency where he is bringing his skills to bear working with a team on the delivery of a new software architecture for the management and delivery of data, algorithms and visualization tools. The “Chemistry Dashboard” was released on April 1st, no fooling, at, and provides access to over 700,000 chemicals, experimental and predicted properties and a developing link network to support the environmental sciences. Tony remains passionate about computer-assisted structure elucidation and verification approaches and continues to publish in this area. He is also passionate about teaching scientists to benefit from the developing array of social networking tools for scientists and is known as the ChemConnector on the networks. Over the years he has had adjunct roles at a number of institutions and presently enjoys working with scientists at both UNC Chapel Hill and NC State University. He is widely published with over 200 papers and book chapters and was the recipient of the Jim Gray Award for eScience in 2012. In 2016 he was awarded the North Carolina ACS Distinguished Speaker Award.

Tags: , , ,

One Response to What data do we trust now in the world of high-throughput screening and public compound databases

Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.