RSS

Curating data on ChemSpider…should it be supported by the community?

05 May

ChemSpider has been online since March 24th 2007, about 6 weeks. We opened the ability to curate the data one month later.
Is there a need to curate the data? ChemSpider is built up of a series of databases. The list of contributors continues to increase
and there will be some very exciting announcements made in the next few days about new contributors. One of the largest components is the PubChem database. Peter Murray-Rust recently blogged about the quality of the name-structure pairs inside the PubChem database. He used as an example methane… I point you to the original blog for his comments. For my purposes I will use water. Here is the list of names, synonyms and registry numbers posted for Water at PubChem. Certainly a number of these have carried over to ChemSpider. Out of interest it is worth comparing the results of the searches for the word “water” at both PubChem and ChemSpider. Search Pubchem for water title=”Water on PubChem”>here and ChemSpider for water title=”Water on ChemSPider”>here. 228 hits versus 1. Looking at ChemSpider we get the following list of names, synonyms and registry numbers. The hyperlinks below are those links to wikipedia.

“water; Water vapor; Dihydrogen oxide; Distilled water; Purified water; Water, purified; hydrogen oxide; Deionized water; Oxygen atom; dihydridooxygen; ether; ethers; hydroxide; oxidane; Monooxygen; Photooxygen; Wasser; Singlet oxygen; Atomic oxygen; Deuterated water; Dihydrogen Monoxide; Oxygen, atomic; Water, mineral; Water, deionized; Water, distilled; Water, heavy; Water-t; DHMO; See Remark 8; HYDROXY GROUP; Water for injection; BOUND OXYGEN; BOUND WATER; Oxygen(sup 3P); 3H-Water; OXO GROUP; UNKNOWN; Water-18O; Sterile purified water; Tritiated water, mono-; Tritiated water (HTO); DISORDERED SOLVENT; Water, purified (JAN); Purified water (JP15); Water (JP15/USP); Type 2 Copper Site Water; Type 2 Copper Site Waters; CCRIS 6115; Oxygen O8 Of 8-Oxoadenine; GLUCOSE 4-O4 GROUP; Oxygen Of Oxidized
Methionine; Water for injection (JP15); Oxygen Bound To Cys 83 Sg; Oxygen Bond To Sg Cys A 67; Sterile purified water (JP15); CHEBI:15377; CHEBI:25698; [OH2]; H(2)O; Disordered Solvent – See Remark 8; EINECS 231-791-2; Oxygen Bound To +a B 17 At C8; Disordered Solvent – See Remark 10; Disordered Solvent – See Remark 11; Disordered Solvent – See Remark 12; NSC147337; NSC 147337; The Oxygen Is Linked To The Haem Iron; Hydroxy Group Bond To Sg Cys B 67.; Oxygen Bound To Cys 25 Sg – Remark 4; The Oxo Group Is Linked To The Haem Iron; C00001; D00001; 7732-18-5; Water, distilled, conductivity or of similar purity; H2O; HOH; Ice; DIS; GTE; H20; HYD; MTO; OX; OXO; UNK; 13670-17-2; 14314-42-2; 17778-80-2; 558440-22-5; DOD; DUM; glc; O2; OH; OX1; TIP; UNL”

NOT what I would call a quality set of names. These will be curated, some will be done with appropriate robots and some manually.
This is an extreme. Let’s look at other examples already identified by curators. Below is an example of curation in process.

Some examples of curated data

Returning to Peter’s blog…an excerpt states “Pubchem faithfully reflects the broken nature of chemical infomation. It cannot mend
it – there are only ca. 20 people – and anyway the commercial chemical information world prefers to work with a broken system. But
could social computing change it? Like Wikipedia has? [..] I think chemistry is different. And I think we could do it almost effortlessly
– rather like the Internet Movie Database. Here every participant can vote for popularity or tomatos. A greasemonkey-like system could allow us to flag “unuseful names” or to vote for the preferred names and structures. And this doesn’t have to be done on PubChem – it could be a standoff site [..].” I happen to agree. I believe social computing can change it. That is the purpose of the curating process on ChemSpider. When we set up the system we were not sure that people would care or help in curating the data. Why? Here’s why people might NOT want to help us curate the data:

  1. ChemSpider is not PubChem. The data cannot be downloaded.
  2. ChemSpider is a business…why should people help a business increase the quality of the data they host?
  3. ChemSpider is new. Who says that the efforts made to curate data will be of value to others? How long will ChemSpider be around to allow peoples work to benefit others?

All valid questions. And they likely ARE deterrents to people helping improve the quality of data on ChemSpider. So, what are the
answers to these questions.. are they enough to convince ChemSpider users to assist in curating the database? Our responses to the
questions above are as follows:

  1. We do not have permission from all depositors to ChemSpider to allow their data to be downloaded, only viewed. However, we WILL redeposit all curated data originally sourced from PubChem back to PubChem. In an email exchange this past week with Steve Bryant from PubChem commented that they would willingly accept curated data back to their database. We will also make available a downloadable database of all curated data originally sourced from public sources. We will also provide feedback to other depositors when we find errors.
  2. I have done my utmost to explain this in a previous post here.
  3. ChemSpider has traction. It is getting lots of use. Based on interest we believe that our initial efforts have already provided enough response to have us continue this work. We have challenges as discussed previously but we are busily addressing these now. We believe that every effort made to improve the quality of data on the ChemSpider database will benefit all users and the community in general with our giveaway to PubChem and other database providers of the curated data.

I have outlined only a small number of possible concerns above. There may be more. I welcome any other questions you may have about our intentions.

 

About tony

Antony (Tony) J. Williams received his BSc in 1985 from the University of Liverpool (UK) and PhD in 1988 from the University of London (UK). His PhD research interests were in studying the effects of high pressure on molecular motions within lubricant related systems using Nuclear Magnetic Resonance. He moved to Ottawa, Canada to work for the National Research Council performing fundamental research on the electron paramagnetic resonance of radicals trapped in single crystals. Following his postdoctoral position he became the NMR Facility Manager for Ottawa University. Tony joined the Eastman Kodak Company in Rochester, New York as their NMR Technology Leader. He led the laboratory to develop quality control across multiple spectroscopy labs and helped establish walk-up laboratories providing NMR, LC-MS and other forms of spectroscopy to hundreds of chemists across multiple sites. This included the delivery of spectroscopic data to the desktop, automated processing and his initial interests in computer-assisted structure elucidation (CASE) systems. He also worked with a team to develop the worlds’ first web-based LIMS system, WIMS, capable of allowing chemical structure searching and spectral display. With his developing cheminformatic skills and passion for data management he left corporate America to join a small start-up company working out of Toronto, Canada. He joined ACD/Labs as their NMR Product Manager and various roles, including Chief Science Officer, during his 10 years with the company. His responsibilities included managing over 50 products at one time prior to developing a product management team, managing sales, marketing, technical support and technical services. ACD/Labs was one of Canada’s Fast 50 Tech Companies, and Forbes Fast 500 companies in 2001. His primary passions during his tenure with ACD/Labs was the continued adoption of web-based technologies and developing automated structure verification and elucidation platforms. While at ACD/Labs he suggested the possibility of developing a public resource for chemists attempting to integrate internet available chemical data. He finally pursued this vision with some close friends as a hobby project in the evenings and the result was the ChemSpider database (www.chemspider.com). Even while running out of a basement on hand built servers the website developed a large community following that eventually culminated in the acquisition of the website by the Royal Society of Chemistry (RSC) based in Cambridge, United Kingdom. Tony joined the organization, together with some of the other ChemSpider team, and became their Vice President of Strategic Development. At RSC he continued to develop cheminformatics tools, specifically ChemSpider, and was the technical lead for the chemistry aspects of the Open PHACTS project (http://www.openphacts.org), a project focused on the delivery of open data, open source and open systems to support the pharmaceutical sciences. He was also the technical lead for the UK National Chemical Database Service (http://cds.rsc.org/) and the RSC lead for the PharmaSea project (http://www.pharma-sea.eu/) attempting to identify novel natural products from the ocean. He left RSC in 2015 to become a Computational Chemist in the National Center of Computational Toxicology at the Environmental Protection Agency where he is bringing his skills to bear working with a team on the delivery of a new software architecture for the management and delivery of data, algorithms and visualization tools. The “Chemistry Dashboard” was released on April 1st, no fooling, at https://comptox.epa.gov, and provides access to over 700,000 chemicals, experimental and predicted properties and a developing link network to support the environmental sciences. Tony remains passionate about computer-assisted structure elucidation and verification approaches and continues to publish in this area. He is also passionate about teaching scientists to benefit from the developing array of social networking tools for scientists and is known as the ChemConnector on the networks. Over the years he has had adjunct roles at a number of institutions and presently enjoys working with scientists at both UNC Chapel Hill and NC State University. He is widely published with over 200 papers and book chapters and was the recipient of the Jim Gray Award for eScience in 2012. In 2016 he was awarded the North Carolina ACS Distinguished Speaker Award.
Leave a comment

Posted by on May 5, 2007 in Quality and Content

 
 
Stop SOPA