The needs for chemistry standards, database tools and data curation at the chemical-biology interface
This presentation was given at the Society of Laboratory Automation and Screening in San Diego, California on January 25th 2016.
The needs for chemistry standards, database tools and data curation at the chemical-biology interface
This presentation will highlight known challenges with the production of high quality chemical databases and outline recent efforts made to address these challenges. Specific examples will be provided illustrating these challenges within the U.S. Environmental Protection Agency (EPA) Computational Toxicology Program. This includes consolidating EPA’s ACToR and DSSTox databases, augmenting computed properties and list search features, and introducing quality metrics to assess confidence in chemical structure assignments across hundreds of thousands of chemical substance records. The past decade has seen enormous investments in the generation and release of data from studies of chemicals and their toxicological effects. There is, however, commonly little concern given to provenance and, more generally, to the quality of the data. The presentation will emphasize the importance of rigorous data review procedures, progress in web-based public access to accurate chemical data sets for use in predictive modeling, and the benefits that these efforts will deliver to toxicologists to embrace the “Big Data” era.
This abstract does not necessarily represent the views of the U.S. Environmental Protection Agency
The presentation is available from the EPA’s Science Inventory site as a PDF file here.
Scientists from EPA, NTP and NCATS have used high-throughput screening (HTS) assays to evaluate the potential health effects of thousands of chemicals. The Transform Tox Testing Challenge: Innovating for Metabolism is calling on innovative thinkers to find new ways to incorporate physiological levels of chemical metabolism into HTS assays. Since current HTS assays do not fully incorporate chemical metabolism, they may miss chemicals that are metabolized to a more toxic form. Adding metabolic competence to HTS assays will help researchers more accurately assess chemical effects and better protect human health.
A new paper hit Nature Chemistry today “Reversible Bergman cyclization by atomic manipulation” (The paper will be featured on the cover of the March Issue). I have so much appreciation for what these scientists are doing. Selfishly I want to continue to applaud them for the breakthrough science that they continue to produce. I have never met the “IBM molecular microscopy” team (my chosen label) but I have had a chance to work with them on two separate occasions. One high profile one was on Olympicene, a fun story reminisced here: Olympicene From Concept to Completion. It was a lot of fun to work with scientists who found the work interesting and in reality it is NOT just a marketing story for RSC as some people mocked at the time, including some of my own colleagues! In fact, if you look at the number of articles that I have now linked (and continue to add to) on my Kudos page you will see a LOT of publications came out of the work (Kudos’ed Olympicene article plus linked articles) so not just “fun science”. In reality science is fun and real utility and understanding can come out of researching fun science, clearly.
The other chance I had to work with the team was on one of my personal interests: Structure Elucidation by NMR and the applications of Computer-Assisted Structure Elucidation (CASE) software/algorithms. The work “A Combined Atomic Force Microscopy and Computational Approach for the Structural Elucidation of Breitfussin A and B: Highly Modified Halogenated Dipeptides from Thuiaria breitfussi” combined CASE-based approaches with single molecule microscopy to elucidate new structures.
Now the team is demonstrates a reversible Bergman cyclization for the first time using atomic manipulation and verification of the products by non-contact atomic force microscopy with atomic resolution. I will let the movie below tell the story and reference you to the original paper. FASCINATING WORK. Congrats to all. How many reactions will now come under the scrutiny and validation of the team now? We will see…
My blog has been fairly inactive for the past few months, driven primarily by my move from working on cheminformatics at the Royal Society of Chemistry to working at the National Center for Computational Toxicology at the Environmental Protection Agency. While I stopped working on ChemSpider about 18 months before I left RSC (to focus on the developing RSC Data Repository) my interest and focus on data quality and a long-standing interest in “accuracy in chemical structure representations” has never dwindled. At the EPA-NCCT we are very focused on working to produce high quality chemical structure databases, following on from the work of my colleague Ann Richard who initiated work on DSSTox over a decade ago.
It was therefore with great interest that I became aware of the confusion in regards to the chemical structure of BIA-10-2474, a drug that has attracted a lot of interest because of a clinical trial with negative outcomes. I am entering the story late compared to my many time collaborators and friends Sean Ekins, Chris Southan and ALex Clark, but more about their work later. The news to date is best summarized at Derek’s In the Pipeline blog and on David Kroll’s post on Forbes.
Based on my previous history and work with helping to curate chemical structures on Wikipedia (starting one Christmas in 2008) my experience would be that Wikipedia is a GOOD PLACE to source high quality structures, especially after the work invested in curating chemical data over the years. The first structure for BIA-10-2474 that was reported on Wikipedia is shown below.
On January 16th Chris performed his usually thorough examination of structure integrity and links to public sources (he is a master in this domain!) but commented specifically ” The molecular identity of BIA-10-2474 can only be formally verified directly by BIAL or indirectly from regulatory documentation they may have submitted” as the chemical structure itself was inferred from the name.
Nevertheless my friends Sean Ekins and Alex Clark were already investigating what OPEN MODELS may be able to predict about the chemical: See here, here and here. You should be impressed regarding what is possible when running a molecular structure through several Bayesian models in Alex’s mobile app called PolyPharma!
By January 21st Chris was commenting that the structure had changed and highlighted the extract from what was exposed by Figaro and listing the chemical name: 3-(1-(cyclohexyl(methyl)carbamoyl)-1H-imidazol-4-yl)pyridine 1-oxide. Want to know what that name means as a structure? Take the name “3-(1-(cyclohexyl(methyl)carbamoyl)-1H-imidazol-4-yl)pyridine 1-oxide” and paste it into the free online service OPSIN. The results are shown below.
That structure has now found its way to Wikipedia (updated on the 21st January – check out the edits between the two forms of the article here).
Sean Ekins has maintained a running series of blog posts here. Using a stack of openly accessible algorithms and websites Sean has now produced a whole series of predictions for the “final molecule”. Chris Southan has also continued to expand his work and I direct you to his latest blogpost for more information. Nice stuff Chris.
It took days following the news starting to show up regarding the results of the drug trial before the chemical structure was actually identified (i.e. the structure was blinded). How much work, how much confusion was created by having the drug structures blind? We have to imagine that the authorities had faster access to the details!
It is understandable that companies keep their chemical structures hidden. Patents are intentionally obfuscating (with a compound going into a trial commonly hidden among hundreds if not tens of thousands of chemicals that could be enumerated from a Markush structure). Until then Chris Southan will continue to educate the world about how competitive intelligence investigations.
This is a talk I gave at the 5th Brazilian Conference on Natural Products as part of my “spare time” activities and to remain engaged with my passion of NMR, structure elucidation and computational spectroscopy applications
Integrating Cheminformatics and Spectroscopy to Elucidate the Structures of Natural Products
The structure elucidation of natural product structures from analytical data, specifically NMR and MS, remains a major challenge. With an enormous palette of NMR experiments to choose from, and supported by breakthrough technologies in hardware, the generation of high quality data to enable even the most complex of natural product structures to be determined is no longer the major hurdle. The challenge is in the analysis of the data. We are in a new era in terms of approaches to structure elucidation: one where computers, databases, and a synergy between scientists and algorithms can offer an accelerated path forward. Software tools are capable of digesting spectroscopic data to elucidate extremely complex natural products. Scientists can now elucidate chemical structures utilizing multinuclear chemical shift data, correlation data from an array of 2D NMR experiments and utilize existing data sets for the purpose of dereplication and computer-assisted structure elucidation. With the explosion of online data especially, in public databases such as PubChem and ChemSpider, many tens of millions of chemical structures are available to seed fragment databases to include in the elucidation process. This presentation will provide an overview of how cheminformatics and chemical databases have been brought together to assist in the identification of natural products. It will include an examination of the state-of-the-art developments in Computer-Assisted Structure Elucidation.
This is a presentation I gave at North Carolina State University hosted by Denis Fourches.
Data integration and building a profile for yourself as an online scientist
Many of us nowadays invest significant amounts of time in sharing our activities and opinions with friends and family via social networking tools. However, despite the availability of many platforms for scientists to connect and share with their peers in the scientific community the majority do not make use of these tools, despite their promise and potential impact and influence on our future careers. We are being indexed and exposed on the internet via our publications, presentations and data. We also have many more ways to contribute to science, to annotate and curate data, to “publish” in new ways, and many of these activities are as part of a growing crowdsourcing network. This presentation will provide an overview of the various types of networking and collaborative sites available to scientists and ways to expose your scientific activities online. Many of these can ultimately contribute to the developing measures of you as a scientist as identified in the new world of alternative metrics. Participating offers a great opportunity to develop a scientific profile within the community and may ultimately be very beneficial, especially to scientists early in their career.
My talk at ACS Boston: Value of the mediawiki platform for providing content to the chemistry community
At this time, and in a culture where online access is now an imperative, Wikipedia has become the definitive encyclopedia. In terms of its support for chemistry it is rich in many encyclopedic pages including named reactions, chemical and drug pages, articles about chemists, and many other forms of chemistry related information. Wikipedia is hosted on Mediawiki, an open source platform that can be utilized by anybody as the basis of their own hosted content collection. Mediawiki has been used as a collaborative environment by a number of chemists to create As a general contribution to the community Mediawiki has been used to create a number of resources that have become very popular with the chemistry community. These include VIPEr to support inorganic chemistry, ChemWiki as an online textbook and other educational resources and a Chemical Information Wikibook. Mediawiki has also been used by the author to host open source collections of data including scientists, scientific databases and mobile apps for science: the ScientistsDB, SciDBs and SciMobileApps wikis. This presentation will provide an overview of some of the chemistry resources that presently exist and celebrate the major contributions that Wikipedia and Mediawiki have made to the collaborative dissemination of chemistry.
ACS Boston: The driving needs for analytical data exchange standards and the potential impacts on the chemical sciences
This presentation was given at the ACS Boston meeting with the following abstract
Analytical science underpins so many different types of chemistry that it is clearly indispensable. Nuclear Magnetic Resonance and infrared spectroscopy, mass spectrometry and chromatography, and a myriad of other forms of analytical science are easily available to scientists today, commonly in open access walk up labs. While instrumentation is now compact and highly flexible, and the controlling software is both powerful and easy to use, significant challenges remain in terms of the management and integration of various forms of analytical data and, more importantly, the exchange of data between scientists. In general the reporting of data in peer-reviewed journals is limited to electronic supplementary information in the form of PDF files or, occasionally in the form of webpages. Many of the strengths in analytical data resides in the ability to database diverse data types and interrogate later performing searches based on metadata, spectral features and related chemical structure information. The need for file format export and conversions from binary file formats associated with the majority of analytical instrumentation remains a major objective in the field. While file formats such as JCAMP and NetCDF have enabled data exchange for a number of years the requirement for more advanced formats (such as AnIML and mzML) has continued. This presentation will review existing activities in the development of exchangeable formats and progress in utilizing existing formats for the delivery of reusable analytical data to the community.
Today is my last day of employment for the Royal Society of Chemistry. It will be almost six years since I joined RSC when ChemSpider was acquired. While ChemSpider was initially a “hobby project” and attempt to create a disruption in terms of access to chemistry data, crowdsourced contribution and data validation, it has gone from strength to strength and now serves ca. 40,000 unique users a day from around the world. It won three awards in the first few months that we joined RSC and was catalytic in RSC winning three grants to allow us to participate in the Open PHACTS project, the PharmaSea project and become the host of the UK National Chemical Database Service. Based on the feedback I have received over the years ChemSpider is much-loved and appreciated as a contribution to the scientific community and is recognized as one of the key players in the free chemistry resources arena. I am proud to have been associated with it.
We also got to set up the ChemSpider SyntheticPages micropublishing site and tried to get the community sharing syntheses that would likely not make it into mainstream papers but were still of value to science.
During my six years at RSC I have been involved with many discussions regarding the following areas of work, study and research and how they would benefit publishing, the society and, of course, the chemistry community at large. The list includes, in particularly random order:
- Chemistry databases – both commercial and free- and how to best mesh, commercialize and license data
- Data quality in publications and databases and development of tools for data validation
- Open Data, Open Access and Open Notebook Science
- Text-mining of the RSC archive to extract & mark up compounds, reactions, property data and analytical data.
- The potential of semantic web applications to scientific publishing
- Encouraging the use of Open Identifiers – especially ORCID and InChI
- The future of Micropublishing in the chemical sciences
- Analytical data and building an open spectral database for the community
- Social networking approaches to build online profiles – especially for young scientists
There are many, many more things of course but these are the big ones and, for me, bring clarity to what my interests are – chemistry data and making it available to the appropriate communities. It is with this in mind that I am excited to join the Environmental Protection Agency next week in the National Center of Computational Toxicology.
With every move forward into a new job we leave behind our old one. And I leave RSC with some sadness that I am leaving and excitement for the new opportunities. I have had the chance to work with so many good people at RSC, to engage with collaborators such as ACD/Labs, Mestre, NextMove, EBI, ChemAxon, Accelrys (as they were then), iChemLabs, Dotmatics and on and on. Apologies if you are not named but the list is very long. Thanks to everyone for your support, encouragement and opportunities to engage. It has been a blast.
And for everyone at RSC who catered to my strange diet of potatoes only…so long, and thanks for all the spuds.
Beyond the Paper CV (or how to build an online profile as a scientist)
This presentation was given at the UKICRS meeting (http://www.ukicrs.org/2015-symposium.html) on April 16th 2015 at the University of Nottingham. This presentation was in a workshop and focused on trying to inform attendees in the postgraduate phases of their careers how to use online tools to start building a reputation and profile in their field. It was good to get positive feedback from some of the attendees. Generally the comments were in regards to the number of different online tools they could use that I highlighted as well as them getting an understanding that they must take responsibility for their reputation and do it soon…there are benefits to starting early!