RSS

Depositing data at ChemSpider – what gets deposited?

06 May

I was in an exchange with a friend this weekend about his interest in depositing data onto ChemSpider. Due to our travel schedules and family commitments we rarely talk by phone. This gentlemen is a retired chemist, though highly active. He is an expert in nomenclature and has an incredible eye for quality and is a master curator of chemistry databases.

So, he is very interested in ChemSpider and the potential of exposing his databases. However, his expressed concern is that he will lose all the efforts he has invested in developing the databases. Again, these are manually curated, with an experts eye and, based on my experiences working with him are of the highest quality. They amount to tens if not hundreds of hours of work and are a source of revenue for this gentleman.

WIth this in mind, and based on other blog posts I have seen, it appears that we have not clearly defined the intention of ChemSpider. What we are NOT doing is aggregating all data from all publicly available data sources or even supplied databases. Our intention for the immediate future is to form a structure centric environment linking out to the initial data source providers via the chemical structure. The individual providers continue to provide their content and retain their value proposition.

For example, The NIST webbook is a container for a lot of information including spectral data. As discussed in another post about the sodium chloride dimer ChemSpider will provide the link to the webbook to display relevant data for this gas phase species. A search for diazepam will provide links out to all original data sources as shown here and they include ChemBank, ChEBI, NIST Webbook and many others.

ChemSpider is an aggregator of chemical structures and associated identifiers (enabling connectivity to other sites). We are NOT duplicating all content available at other sites. This removes the burden of updating associated data across multiple data sources as individual providers curate and update their own sources. It also keeps ChemSpider on task of linking together multiple sources of data via chemical structures rather than grabbing the work of other groups and reposting.

So, back to my friend who is worried about depositing data on ChemSpider. All we will be taking delivery of are the structures, the structure IDs (if available) and a link to information about the database. In this way we are directing individuals to rich sources of information for ChemSpider users to pursue as they see fit. Just as many depositions into the public online databases are from chemical vendors intending to potentially sell their materials the same model applies to database providers. After all, if information content is of value it is up to the user to choose to pay for the right to access.

Taking this one step further one has to consider the following question. For the large database providers (Beilstein – now MDL, Derwent, CAS, Cambridge Crystallogrpahic Databases, DiscoveryGate and others) why not put their structure collections into the public domain for the purpose of searching and connecting back to the actual content of value. The structures themselves, as far as I know, are in all cases in the public domain since they are published (I might be wrong here but I cannot find statements to the contrary). The value comes from the information associated with the structures – one or more publictaions, reaction details, experimental or predicted properties, connection to a patent, and other such content association.

What’s at risk to provide public access to the structure database(s) for searching and charging the appropropriate fees to access the information once identified? There is little value in simply knowing that a structure exists in a database is there? Isn’t it the information associated that has value? If this wasn’t true then that would suggest that a large database of algorithmically generated structures created with something like MolGen or the structure generator in Structure Elucidator would have value. In fact it does….see the work of Reymond et al in their “Virtual Exploration of the Chemical Universe up to 11 Atoms of C, N, O, F“. The value however comes not from the computer structure itself but rather the virtual screening response.

I judge there are two challenges – a decision at the management level to expose the large structural repositories and the enormous hurdles in migrating certain classes of chemical structures to SDF format to be hosted by general services – specifically polymers, organometallic complexes and inorganics (also all challenges for ChemSpider!). I think the primary challenge is the decision to expose the data…I judge it’s the right decision to make with the increasing availability of Open Access databases such as ChemSpider. It’s a BIG decision …

 

About tony

Antony (Tony) J. Williams received his BSc in 1985 from the University of Liverpool (UK) and PhD in 1988 from the University of London (UK). His PhD research interests were in studying the effects of high pressure on molecular motions within lubricant related systems using Nuclear Magnetic Resonance. He moved to Ottawa, Canada to work for the National Research Council performing fundamental research on the electron paramagnetic resonance of radicals trapped in single crystals. Following his postdoctoral position he became the NMR Facility Manager for Ottawa University. Tony joined the Eastman Kodak Company in Rochester, New York as their NMR Technology Leader. He led the laboratory to develop quality control across multiple spectroscopy labs and helped establish walk-up laboratories providing NMR, LC-MS and other forms of spectroscopy to hundreds of chemists across multiple sites. This included the delivery of spectroscopic data to the desktop, automated processing and his initial interests in computer-assisted structure elucidation (CASE) systems. He also worked with a team to develop the worlds’ first web-based LIMS system, WIMS, capable of allowing chemical structure searching and spectral display. With his developing cheminformatic skills and passion for data management he left corporate America to join a small start-up company working out of Toronto, Canada. He joined ACD/Labs as their NMR Product Manager and various roles, including Chief Science Officer, during his 10 years with the company. His responsibilities included managing over 50 products at one time prior to developing a product management team, managing sales, marketing, technical support and technical services. ACD/Labs was one of Canada’s Fast 50 Tech Companies, and Forbes Fast 500 companies in 2001. His primary passions during his tenure with ACD/Labs was the continued adoption of web-based technologies and developing automated structure verification and elucidation platforms. While at ACD/Labs he suggested the possibility of developing a public resource for chemists attempting to integrate internet available chemical data. He finally pursued this vision with some close friends as a hobby project in the evenings and the result was the ChemSpider database (www.chemspider.com). Even while running out of a basement on hand built servers the website developed a large community following that eventually culminated in the acquisition of the website by the Royal Society of Chemistry (RSC) based in Cambridge, United Kingdom. Tony joined the organization, together with some of the other ChemSpider team, and became their Vice President of Strategic Development. At RSC he continued to develop cheminformatics tools, specifically ChemSpider, and was the technical lead for the chemistry aspects of the Open PHACTS project (http://www.openphacts.org), a project focused on the delivery of open data, open source and open systems to support the pharmaceutical sciences. He was also the technical lead for the UK National Chemical Database Service (http://cds.rsc.org/) and the RSC lead for the PharmaSea project (http://www.pharma-sea.eu/) attempting to identify novel natural products from the ocean. He left RSC in 2015 to become a Computational Chemist in the National Center of Computational Toxicology at the Environmental Protection Agency where he is bringing his skills to bear working with a team on the delivery of a new software architecture for the management and delivery of data, algorithms and visualization tools. The “Chemistry Dashboard” was released on April 1st, no fooling, at https://comptox.epa.gov, and provides access to over 700,000 chemicals, experimental and predicted properties and a developing link network to support the environmental sciences. Tony remains passionate about computer-assisted structure elucidation and verification approaches and continues to publish in this area. He is also passionate about teaching scientists to benefit from the developing array of social networking tools for scientists and is known as the ChemConnector on the networks. Over the years he has had adjunct roles at a number of institutions and presently enjoys working with scientists at both UNC Chapel Hill and NC State University. He is widely published with over 200 papers and book chapters and was the recipient of the Jim Gray Award for eScience in 2012. In 2016 he was awarded the North Carolina ACS Distinguished Speaker Award.
Leave a comment

Posted by on May 6, 2007 in How ChemSpider Runs

 

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

 
Stop SOPA