RSS

Open Source Data, Testing Quality and Returning Value – Interactions with NMRSHIFTDB and the Blue Obelisk Community

03 May

I give a thumbs up to the quality of the NMRSHIFTDB. We’ve validated it. Why would I care? I’m an NMR jock at heart. I also work for a commercial software company innovating NMR prediction software and compiling NMR databases as the basis of our work. Does this mean that the commercial software vendors and Open Source/Access communities can coexist and have mutual admiration. I believe so!

After 18 months work I finally signed off on one of those infamous copyright transfers for Elsevier, now the publishers of Progress in NMR. After over 18 months of work (that after hours style…much like blogging) a 360 page review article is finally submitted – “Computer-Assisted Structure Verification and Elucidation Tools In NMR-Based Structure Elucidation”. Proofs will arrive before end of month. It’s the culmination of over ten years of our own work as well as that of many contributors in the domain of CASE (Computer Assisted Structure Elucidation) systems. The complexity of structures that can be solved by computer algorithms is impressive…see examples here. Recently the StrucEluc CASE system solved the structure of an antibiotic of Mw>1150. Three NMR spectroscopists couldn’t solve it…a symbiotic relationship with software is VERY enabling!

One very active player in CASE is Christoph Steinbeck, a member of Blue Obelisk, one of the more active blogging groups on the net today.

Christoph’s group host NMRSHIFTDB. Recently ChemSpider linked to NMRSHIFTDB. In parallel I took interest in the recent critique of the quality of data published by Wolfgang Robien especially since during my “day job” I am directly involved with NMR prediction, structure verification using NMR and of course, CASE systems.

What was interesting about Robien’s post was the fact that it focused on the application of Neural Networks to prediction. With the availability of a public dataset we were able to repeat the analysis using our own Neural Networks as well as our classical approaches. The results will be reported elsewhere. What I want to confirm is PMR’s post regarding the quality of data. Peter commented, relative to our own efforts at ChemSpider. “There is little point in collecting 10 million structures if you cannot rely on any of them. It actually detracts from the hard work of people like Stefan, Christoph and others on NMRShiftDB as the general user of the database will judge all entries by the lowest common denominator.” After analyzing the data, over 200,000 individual chemical shifts I can say DON’T judge by the lowest common denominator. There is some junk on there..as seen by Wolgang Robien. But our estimates after our analysis is likely less than 250 data points in error. These are truly excellent statistics if you consider that this is an open access system where people are depositing data, that these data are free to download and utilize even for the development of derivative algorithms and that such systems can work. The addition or improvement of rigorous checking algorithms to the NMRSHIFTDB is the next natural step and flagging data to the submitter will have them check and validate the quality of their input. This will catch many errors during the submission process.

So, my compliments to Christoph and the team. The quality is excellent and there are “large errors” but minimum in number. I’ve already sent him a report to help cleanse the database though didn’t compare it with that of Robien…likely we saw the same things since they were very obvious. These errors should not detract from the effort ..with >200,000 data points it is obvious that there would be some. For ChemSpider we have the same problem…with >10 million structures there are errors….lots of them. But it’s very useful all the same!

 

About tony

Antony (Tony) J. Williams received his BSc in 1985 from the University of Liverpool (UK) and PhD in 1988 from the University of London (UK). His PhD research interests were in studying the effects of high pressure on molecular motions within lubricant related systems using Nuclear Magnetic Resonance. He moved to Ottawa, Canada to work for the National Research Council performing fundamental research on the electron paramagnetic resonance of radicals trapped in single crystals. Following his postdoctoral position he became the NMR Facility Manager for Ottawa University. Tony joined the Eastman Kodak Company in Rochester, New York as their NMR Technology Leader. He led the laboratory to develop quality control across multiple spectroscopy labs and helped establish walk-up laboratories providing NMR, LC-MS and other forms of spectroscopy to hundreds of chemists across multiple sites. This included the delivery of spectroscopic data to the desktop, automated processing and his initial interests in computer-assisted structure elucidation (CASE) systems. He also worked with a team to develop the worlds’ first web-based LIMS system, WIMS, capable of allowing chemical structure searching and spectral display. With his developing cheminformatic skills and passion for data management he left corporate America to join a small start-up company working out of Toronto, Canada. He joined ACD/Labs as their NMR Product Manager and various roles, including Chief Science Officer, during his 10 years with the company. His responsibilities included managing over 50 products at one time prior to developing a product management team, managing sales, marketing, technical support and technical services. ACD/Labs was one of Canada’s Fast 50 Tech Companies, and Forbes Fast 500 companies in 2001. His primary passions during his tenure with ACD/Labs was the continued adoption of web-based technologies and developing automated structure verification and elucidation platforms. While at ACD/Labs he suggested the possibility of developing a public resource for chemists attempting to integrate internet available chemical data. He finally pursued this vision with some close friends as a hobby project in the evenings and the result was the ChemSpider database (www.chemspider.com). Even while running out of a basement on hand built servers the website developed a large community following that eventually culminated in the acquisition of the website by the Royal Society of Chemistry (RSC) based in Cambridge, United Kingdom. Tony joined the organization, together with some of the other ChemSpider team, and became their Vice President of Strategic Development. At RSC he continued to develop cheminformatics tools, specifically ChemSpider, and was the technical lead for the chemistry aspects of the Open PHACTS project (http://www.openphacts.org), a project focused on the delivery of open data, open source and open systems to support the pharmaceutical sciences. He was also the technical lead for the UK National Chemical Database Service (http://cds.rsc.org/) and the RSC lead for the PharmaSea project (http://www.pharma-sea.eu/) attempting to identify novel natural products from the ocean. He left RSC in 2015 to become a Computational Chemist in the National Center of Computational Toxicology at the Environmental Protection Agency where he is bringing his skills to bear working with a team on the delivery of a new software architecture for the management and delivery of data, algorithms and visualization tools. The “Chemistry Dashboard” was released on April 1st, no fooling, at https://comptox.epa.gov, and provides access to over 700,000 chemicals, experimental and predicted properties and a developing link network to support the environmental sciences. Tony remains passionate about computer-assisted structure elucidation and verification approaches and continues to publish in this area. He is also passionate about teaching scientists to benefit from the developing array of social networking tools for scientists and is known as the ChemConnector on the networks. Over the years he has had adjunct roles at a number of institutions and presently enjoys working with scientists at both UNC Chapel Hill and NC State University. He is widely published with over 200 papers and book chapters and was the recipient of the Jim Gray Award for eScience in 2012. In 2016 he was awarded the North Carolina ACS Distinguished Speaker Award.
Leave a comment

Posted by on May 3, 2007 in Quality and Content

 

0 Responses to Open Source Data, Testing Quality and Returning Value – Interactions with NMRSHIFTDB and the Blue Obelisk Community

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

 
Stop SOPA