I will be delivering five presentations and a poster (twice) at the ACS Meeting in Philadelphia this week. These presentations will introduce the latest version of our CompTox Dashboard, renamed from the iCSS Chemistry Dashboard because now we are offering way more than just a large set of chemical structures! I look forward to introducing attendees to the latest and greatest.
DAY & TIME OF PRESENTATION: Sunday, August, 21, 2016 from 1:10 PM – 1:35 PM
ROOM & LOCATION: Room 105A – Pennsylvania Convention Center
Title: Structure Identification Using High Resolution Mass Spectrometry Data and the EPA’s Chemistry Dashboard
The iCSS Chemistry Dashboard is a publicly accessible dashboard provided by the National Center for Computation Toxicology at the US-EPA. It serves a number of purposes, including providing a chemistry database underpinning many of our public-facing projects (e.g. ToxCast and ExpoCast). The available data and searches provide a valuable path to structure identification using mass spectrometry as the source data. With an underlying database of over 720,000 chemicals, the dashboard has already been used to assist in identifying chemicals present in house dust. However, it can also be applied to many other purposes, e.g., the identification of agrochemicals in waste streams. This presentation will provide a review of the EPA’s platform and underlying algorithms used for the purpose of compound identification using high-resolution mass spectrometry data. We will also discuss progress towards a high-throughput non-targeted analysis platform for use by the mass spectrometry community. This abstract does not reflect U.S. EPA policy.
DAY & TIME OF PRESENTATION: Sunday, August, 21, 2016 from 4:10 PM – 4:30 PM
ROOM & LOCATION: Room 112B – Pennsylvania Convention Center
Title: Investigating Impact Metrics for Performance for the US-EPA National Center for Computational Toxicology
The U.S. Environmental Protection Agency (EPA) Computational Toxicology Program integrates advances in biology, chemistry, and computer science to help prioritize chemicals for further research based on potential human health risks. This work involves computational and data driven approaches that integrate chemistry, exposure and biological data. We have delivered public access to terabytes of open data, as well to a large number of publicly accessible databases and applications, to support the research efforts for a large community of scientists. Many of our contributions to science are summarily described in research papers but to date we have not optimized our contributions to inform altmetrics statistics associated with our work. Critically missing from altmetrics is access to our numerous software applications and web service accesses, as well as the growing importance of our experimental data and models (e.g ToxCast, ExpoCast, DSSTox and others) to the scientific and regulatory communities. This presentation will provide an overview of our efforts to more fully understand, and quantify, our impact on the environmental sciences using a combination of our measurement approaches and available altmetrics tools. This abstract does not reflect U.S. EPA policy.
DAY & TIME OF PRESENTATION: Wednesday, August, 24, 2016 from 9:40 AM – 10:00 AM
ROOM & LOCATION: Juniper’s Ballroom – Philadelphia Downtown Courtyard by Marriott
Title: Delivering The Benefits of Chemical-Biological Integration in Computational Toxicology at the EPA
Abstract: Researchers at the EPA’s National Center for Computational Toxicology integrate advances in biology, chemistry, and computer science to examine the toxicity of chemicals and help prioritize chemicals for further research based on potential human health risks. The intention of this research program is to quickly evaluate thousands of chemicals for potential risk but with much reduced cost relative to historical approaches. This work involves computational and data driven approaches including high-throughput screening, modeling, text-mining and the integration of chemistry, exposure and biological data. We have developed a number of databases and applications that are delivering on the vision of developing a deeper understanding of chemicals and their effects on exposure and biological processes that are supporting a large community of scientists in their research efforts. This presentation will provide an overview of our work to bring together diverse large scale data from the chemical and biological domains, our approaches to integrate and disseminate these data, and the delivery of models supporting computational toxicology. This abstract does not reflect U.S. EPA policy.
DAY & TIME OF PRESENTATION: Wednesday, August, 24, 2016 from 11:10 AM – 11:40 AM
ROOM & LOCATION: Ormandy East – DoubleTree by Hilton Hotel Philadelphia Center City
Title: Data Aggregation, Curation and Modeling Approaches to Deliver Prediction Models to Support Computational Toxicology at the EPA
The U.S. Environmental Protection Agency (EPA) Computational Toxicology Program develops and utilizes QSAR modeling approaches across a broad range of applications. In terms of physical chemistry we have a particular interest in the prediction of basic physicochemical parameters such as logP, aqueous solubility, vapor pressure and other parameters to invoke in our exposure models or for the purpose of modeling environmental toxicity. We are also interested in the development of models related to environmental fate. As a result of our efforts we have assembled and curated data sets for various physicochemical properties and, utilizing modern machine-learning modeling approaches, have developed a number of high performing models that we are now delivering to the public. Our website, the iCSS Chemistry Dashboard, provides access to data predicted for over 700,000 chemical compounds. The original training data are available for review and the details of prediction for each endpoint include the domain of applicability as well as a measure of performance accuracy. This presentation will provide an overview of the existing aggregated data, our approaches to data curation and our progress towards an interactive environment for prediction of physicochemical and environmental fate parameters. The utilization of these parameters to support read-across approaches will also be discussed. This abstract does not reflect U.S. EPA policy.
DAY & TIME OF PRESENTATION: Thursday, August, 25, 2016 from 3:00 PM – 3:20 PM
ROOM & LOCATION:: Room 104A – Pennsylvania Convention Center
Title: The EPA iCSS Chemistry Dashboard to Support Compound Identification Using High Resolution Mass Spectrometry Data
There is a growing need for rapid chemical screening and prioritization to inform regulatory decision-making on thousands of chemicals in the environment. We have previously used high-resolution mass spectrometry to examine household vacuum dust samples using liquid chromatography time-of-flight mass spectrometry (LC-TOF/MS). Using a combination of exact mass, isotope distribution, and isotope spacing, molecular features were matched with a list of chemical formulas from the EPA’s Distributed Structure-Searchable Toxicity (DSSTox) database. This has further developed our understanding of how openly available chemical databases, together with the appropriate searches, could be used for the purpose of compound identification. We report here on the utility of the EPA’s iCSS Chemistry Dashboard for the purpose of compound identification using searches against a database of over 720,000 chemicals. We also examine the benefits of QSAR prediction for the purpose of retention time prediction to allow for alignment of both chromatographic and mass spectral properties. This abstract does not reflect U.S. EPA policy.
SESSION TIME: August 22, 2016 from 8:00 PM to 10:00 PM
SESSION TIME: Wednesday, August, 24, 2016, 6:00 PM – 8:00 PM
ROOM & LOCATION: Hall D – Pennsylvania Convention Center
Poster Title: The EPA Online Prediction Physicochemical Prediction Platform to Support Environmental Scientists
As part of our efforts to develop a public platform to provide access to predictive models we have attempted to disentangle the influence of the quality versus quantity of data available to develop and validate QSAR models. Using a thorough manual review of the data underlying the well-known EPI Suite software, we developed automated processes for the validation of the data using a KNIME workflow. This includes: approaches to validate different chemical structure representations (e.g. molfile and SMILES), identifiers (chemical names and registry numbers), and methods to standardize the data into QSAR-consumable formats for modeling. Our efforts to quantify and segregate data into various quality categories has allowed us to thoroughly investigate the resulting models developed from these data slices, as well as allowing us to examine whether or not efforts into the development of large high-quality datasets has the expected pay-off in terms of prediction performance. Machine-learning approaches have been applied to create a series of models that have been used to generate predicted physicochemical and environmental parameters for over 700,000 chemicals. These data are available online via the EPA’s iCSS Chemistry Dashboard. This abstract does not reflect U.S. EPA policy.
I have been interested in the Zika Virus ever since I heard about it while visiting Brazil last year to give a talk at the Brazilian Natural Products conference. What I did not expect was the incredible surge in worldwide attention that Zika would attract. I am grateful to have been included in the work led by Sean Ekins (@collabchem) in the perspective “Open Drug Discovery for the Zika Virus” recently published on F1000Research. Up until last week the hypothesis was that Zika was a mosquito-borne disease but now the suggestion is that the disease may be related to a larvicide.
The chemical in question that is being named as the offending agent is Pyriproxyfen. I had never even heard of this chemical until a couple of days ago. At that time there was nothing on Wikipedia but, of course, it has since been updated with this
“In 2014, pyriproxifen was put into Brazilian water supplies to fight the proliferation of mosquito larvae. Some Brazilian doctors have hypothesized that pyriproxyfen, not the Zika virus, is the cause of the 2015-2016 microcephaly epidemic in Brazil. 
Consequently, in 2016, the Brazilian state of Rio Grande do Sul suspended pyriproxyfen’s use. The Health Minister of Brazil, Marcelo Castro, criticized this step, noting that the claim is “a rumor lacking logic and sense. It has no basis.” They also noted that the insecticide is approved by the National Sanitary Monitoring Agency and “all regulatory agencies in the whole world”. The manufacturer of the insecticide, Sumitomo Chemical, stated “”there is no scientific basis for such a claim” and also referred to the approval of pyriproxyfen by the World Health Organization since 2004 and the United States Environmental Protection Agency since 2001.
Noted skeptic David Gorski discussed the claim and pointed out that anti-vaccine proponents had also claimed that the Tdap vaccine was the cause of the microcephaly epidemic, due to its introduction in 2014, along with adding “One can’t help but wonder what else the Brazilian Ministry of Health did in 2014 that cranks can blame microcephaly on.” Gorski also pointed out the extensive physiochemical understanding of pyriproxyfen that the WHO has, which concluded in a past evaluation that the insecticide is not genotoxic, and that the doctor organization making the claim has been advocating against all pesticides since 2010, complicating their reliability.“
Because we live in a time of Open Data, and at a time when there is soooooo much information available on open databases, I thought I would go after any evidence-based identification of the chemical as a potential contributor to the explosion in Microcephaly.
PubChem exposes a LOT of useful data under the Safety and Hazards tab. The long-term exposure points to issues with blood and liver. FIFRA requirements are listed on PubChem and toxicity data is also available here. Reproductive toxicity is limited to reports in animals that reports
I am a fan of Altmetrics. At least in concept. But I starting to get very concerned with both the tools used to measure them and what the “numbers” are expected to indicate. We would expect that a high “number” in an Altmetric.com “donut” would be indicative, in some way, of the relative importance or “impact” of that article. One would hope it at least points to how well read the article is, whether the readers like the science and the potential for the article to, for example, move forward understanding or proliferate data into further usage. I am not sure this is true…at least for some of the articles I am involved with.
Let’s take for example the recent Zika Virus article that Sean Ekins led. The F1000 site gives us some stats in regards to Views and Downloads and the Metrics shows the Altmetric stats. I would assume that 48 DOWNLOADers would have at least some of them reading the article. Some of the VIEWers are likely to have read it and maybe printed it. For the Altmetric stats the 33 tweets are likely people pointing to the article and because of the way I use Twitter I am going to suggest that Tweets are less indicative of the number of readers of the article. There is a definition on the Altmetric site regarding how Twitter stats are compiled.
If we use the Altmetric Bookmarklet we can navigate to the page with a score
The score of “41” is essentially the sum of bloggers, tweets, Facebook posts etc. summarized below (1+1+1+33+1+3+1 for being on Altmetric.com???)
When I asked F1000Research via Twitter why they don’t show the “number” I appreciated their answer. I AGREE with their sentiment.
Yesterday I received an email about our Journal of Cheminformatics article “Ambiguity of non-systematic chemical identifiers within and between small-molecule databases“, part of which is shown below.
On the actual Journal of Cheminformatics page it says there have been 1444 accesses (not 2216 as cited in the email).
Also the Altmetric score is 8. So somewhere between 1400-2200 accesses (and it is safe to assume some proportion actual read it!). But it has a low Altmetric score of 8. This is versus an Altmetric score of >40 for the Zika Virus paper and a lot less accesses and probably a lot of the altmetrics for that article don’t necessarily indicate reads of the article as they are Tweets, many of them from the authors out to the world.
Using PlumX I am extremely disappointed regarding what it reflects about the JChemInf article! Only 10 HTML Views versus the 1400-2200 accesses reported above, and only 7 readers and 1 save! UGH. But 13 Tweets are noted so it seems so I would expect at least an Altmetric.com score of 13 or 14, instead of the 8 marked on the article?
I also tried to sign into ImpactStory to check stats but got a “Uh oh, looks like we’ve got a system error…feel free to let us know, and we’ll fix it.” message so will report back on that.
Altmetrics should be maturing now to a point where the metrics of reads, accesses, downloads should be fed into some overall metric. I think that reads/accesses/downloads should carry more weight than a Tweet in terms of impact of an article? At least if someone read it, whether they agree with it or not they are MORE aware of the content than if someone simply shared the link to an article, that then didn’t get read? The platforms themselves are so desync’ed in terms of the various numbers themselves that we must wonder how are things so badly broken? I would imagine that stats gathered in someway through CrossRef or ORCID will ultimately help this to mature but until then treat them all with a level of suspicion. I believe that AltMetrics will be an important part of helping to define impact for an article. But there is still a long way to go I’m afraid….
The needs for chemistry standards, database tools and data curation at the chemical-biology interface
This presentation was given at the Society of Laboratory Automation and Screening in San Diego, California on January 25th 2016.
The needs for chemistry standards, database tools and data curation at the chemical-biology interface
This presentation will highlight known challenges with the production of high quality chemical databases and outline recent efforts made to address these challenges. Specific examples will be provided illustrating these challenges within the U.S. Environmental Protection Agency (EPA) Computational Toxicology Program. This includes consolidating EPA’s ACToR and DSSTox databases, augmenting computed properties and list search features, and introducing quality metrics to assess confidence in chemical structure assignments across hundreds of thousands of chemical substance records. The past decade has seen enormous investments in the generation and release of data from studies of chemicals and their toxicological effects. There is, however, commonly little concern given to provenance and, more generally, to the quality of the data. The presentation will emphasize the importance of rigorous data review procedures, progress in web-based public access to accurate chemical data sets for use in predictive modeling, and the benefits that these efforts will deliver to toxicologists to embrace the “Big Data” era.
This abstract does not necessarily represent the views of the U.S. Environmental Protection Agency
The presentation is available from the EPA’s Science Inventory site as a PDF file here.
Scientists from EPA, NTP and NCATS have used high-throughput screening (HTS) assays to evaluate the potential health effects of thousands of chemicals. The Transform Tox Testing Challenge: Innovating for Metabolism is calling on innovative thinkers to find new ways to incorporate physiological levels of chemical metabolism into HTS assays. Since current HTS assays do not fully incorporate chemical metabolism, they may miss chemicals that are metabolized to a more toxic form. Adding metabolic competence to HTS assays will help researchers more accurately assess chemical effects and better protect human health.
A new paper hit Nature Chemistry today “Reversible Bergman cyclization by atomic manipulation” (The paper will be featured on the cover of the March Issue). I have so much appreciation for what these scientists are doing. Selfishly I want to continue to applaud them for the breakthrough science that they continue to produce. I have never met the “IBM molecular microscopy” team (my chosen label) but I have had a chance to work with them on two separate occasions. One high profile one was on Olympicene, a fun story reminisced here: Olympicene From Concept to Completion. It was a lot of fun to work with scientists who found the work interesting and in reality it is NOT just a marketing story for RSC as some people mocked at the time, including some of my own colleagues! In fact, if you look at the number of articles that I have now linked (and continue to add to) on my Kudos page you will see a LOT of publications came out of the work (Kudos’ed Olympicene article plus linked articles) so not just “fun science”. In reality science is fun and real utility and understanding can come out of researching fun science, clearly.
The other chance I had to work with the team was on one of my personal interests: Structure Elucidation by NMR and the applications of Computer-Assisted Structure Elucidation (CASE) software/algorithms. The work “A Combined Atomic Force Microscopy and Computational Approach for the Structural Elucidation of Breitfussin A and B: Highly Modified Halogenated Dipeptides from Thuiaria breitfussi” combined CASE-based approaches with single molecule microscopy to elucidate new structures.
Now the team is demonstrates a reversible Bergman cyclization for the first time using atomic manipulation and verification of the products by non-contact atomic force microscopy with atomic resolution. I will let the movie below tell the story and reference you to the original paper. FASCINATING WORK. Congrats to all. How many reactions will now come under the scrutiny and validation of the team now? We will see…
My blog has been fairly inactive for the past few months, driven primarily by my move from working on cheminformatics at the Royal Society of Chemistry to working at the National Center for Computational Toxicology at the Environmental Protection Agency. While I stopped working on ChemSpider about 18 months before I left RSC (to focus on the developing RSC Data Repository) my interest and focus on data quality and a long-standing interest in “accuracy in chemical structure representations” has never dwindled. At the EPA-NCCT we are very focused on working to produce high quality chemical structure databases, following on from the work of my colleague Ann Richard who initiated work on DSSTox over a decade ago.
It was therefore with great interest that I became aware of the confusion in regards to the chemical structure of BIA-10-2474, a drug that has attracted a lot of interest because of a clinical trial with negative outcomes. I am entering the story late compared to my many time collaborators and friends Sean Ekins, Chris Southan and ALex Clark, but more about their work later. The news to date is best summarized at Derek’s In the Pipeline blog and on David Kroll’s post on Forbes.
Based on my previous history and work with helping to curate chemical structures on Wikipedia (starting one Christmas in 2008) my experience would be that Wikipedia is a GOOD PLACE to source high quality structures, especially after the work invested in curating chemical data over the years. The first structure for BIA-10-2474 that was reported on Wikipedia is shown below.
On January 16th Chris performed his usually thorough examination of structure integrity and links to public sources (he is a master in this domain!) but commented specifically ” The molecular identity of BIA-10-2474 can only be formally verified directly by BIAL or indirectly from regulatory documentation they may have submitted” as the chemical structure itself was inferred from the name.
Nevertheless my friends Sean Ekins and Alex Clark were already investigating what OPEN MODELS may be able to predict about the chemical: See here, here and here. You should be impressed regarding what is possible when running a molecular structure through several Bayesian models in Alex’s mobile app called PolyPharma!
By January 21st Chris was commenting that the structure had changed and highlighted the extract from what was exposed by Figaro and listing the chemical name: 3-(1-(cyclohexyl(methyl)carbamoyl)-1H-imidazol-4-yl)pyridine 1-oxide. Want to know what that name means as a structure? Take the name “3-(1-(cyclohexyl(methyl)carbamoyl)-1H-imidazol-4-yl)pyridine 1-oxide” and paste it into the free online service OPSIN. The results are shown below.
That structure has now found its way to Wikipedia (updated on the 21st January – check out the edits between the two forms of the article here).
Sean Ekins has maintained a running series of blog posts here. Using a stack of openly accessible algorithms and websites Sean has now produced a whole series of predictions for the “final molecule”. Chris Southan has also continued to expand his work and I direct you to his latest blogpost for more information. Nice stuff Chris.
It took days following the news starting to show up regarding the results of the drug trial before the chemical structure was actually identified (i.e. the structure was blinded). How much work, how much confusion was created by having the drug structures blind? We have to imagine that the authorities had faster access to the details!
It is understandable that companies keep their chemical structures hidden. Patents are intentionally obfuscating (with a compound going into a trial commonly hidden among hundreds if not tens of thousands of chemicals that could be enumerated from a Markush structure). Until then Chris Southan will continue to educate the world about how competitive intelligence investigations.
This is a talk I gave at the 5th Brazilian Conference on Natural Products as part of my “spare time” activities and to remain engaged with my passion of NMR, structure elucidation and computational spectroscopy applications
Integrating Cheminformatics and Spectroscopy to Elucidate the Structures of Natural Products
The structure elucidation of natural product structures from analytical data, specifically NMR and MS, remains a major challenge. With an enormous palette of NMR experiments to choose from, and supported by breakthrough technologies in hardware, the generation of high quality data to enable even the most complex of natural product structures to be determined is no longer the major hurdle. The challenge is in the analysis of the data. We are in a new era in terms of approaches to structure elucidation: one where computers, databases, and a synergy between scientists and algorithms can offer an accelerated path forward. Software tools are capable of digesting spectroscopic data to elucidate extremely complex natural products. Scientists can now elucidate chemical structures utilizing multinuclear chemical shift data, correlation data from an array of 2D NMR experiments and utilize existing data sets for the purpose of dereplication and computer-assisted structure elucidation. With the explosion of online data especially, in public databases such as PubChem and ChemSpider, many tens of millions of chemical structures are available to seed fragment databases to include in the elucidation process. This presentation will provide an overview of how cheminformatics and chemical databases have been brought together to assist in the identification of natural products. It will include an examination of the state-of-the-art developments in Computer-Assisted Structure Elucidation.
This is a presentation I gave at North Carolina State University hosted by Denis Fourches.
Data integration and building a profile for yourself as an online scientist
Many of us nowadays invest significant amounts of time in sharing our activities and opinions with friends and family via social networking tools. However, despite the availability of many platforms for scientists to connect and share with their peers in the scientific community the majority do not make use of these tools, despite their promise and potential impact and influence on our future careers. We are being indexed and exposed on the internet via our publications, presentations and data. We also have many more ways to contribute to science, to annotate and curate data, to “publish” in new ways, and many of these activities are as part of a growing crowdsourcing network. This presentation will provide an overview of the various types of networking and collaborative sites available to scientists and ways to expose your scientific activities online. Many of these can ultimately contribute to the developing measures of you as a scientist as identified in the new world of alternative metrics. Participating offers a great opportunity to develop a scientific profile within the community and may ultimately be very beneficial, especially to scientists early in their career.
My talk at ACS Boston: Value of the mediawiki platform for providing content to the chemistry community
At this time, and in a culture where online access is now an imperative, Wikipedia has become the definitive encyclopedia. In terms of its support for chemistry it is rich in many encyclopedic pages including named reactions, chemical and drug pages, articles about chemists, and many other forms of chemistry related information. Wikipedia is hosted on Mediawiki, an open source platform that can be utilized by anybody as the basis of their own hosted content collection. Mediawiki has been used as a collaborative environment by a number of chemists to create As a general contribution to the community Mediawiki has been used to create a number of resources that have become very popular with the chemistry community. These include VIPEr to support inorganic chemistry, ChemWiki as an online textbook and other educational resources and a Chemical Information Wikibook. Mediawiki has also been used by the author to host open source collections of data including scientists, scientific databases and mobile apps for science: the ScientistsDB, SciDBs and SciMobileApps wikis. This presentation will provide an overview of some of the chemistry resources that presently exist and celebrate the major contributions that Wikipedia and Mediawiki have made to the collaborative dissemination of chemistry.