RSS

The Curation of Almost 5000 Structures on Wikipedia

08 Mar

I recently commented on the statement made by Eric Shively of CAS about the CAS Validation Project going on at Wikipedia. The basic premise of the work is the need to validate CAS numbers to ensure that the CAS numbers listed in a chemical box are associated with the appropriate structure shown in the chemical box. So, if the structure has stereochemistry make sure that the CAS number is for the form of the structure with stereochemistry. If the CAS Number is for a neutral compound then the structure displayed should not be the salt. And so on, and so on. There are many sources of CAS Numbers online. In fact there are many places to search for them to confirm. Type in “CAS Number search” online and you’ll find a lot of hits, though admittedly not all of them related to Chemical Abstracts Services.

Some examples on “online CAS number searches” are excellent. In the order that I see them in my search:

The NIST Webbook – much loved by many scientists and very useful.

ChemIndustry – An excellent resource for chemists and gaining a good following in the market I believe

ChemFinder – Cambridgesoft’s online search system

A Buyers Guide – A German Chemical Buyer’s Guide.

PennSylvania Department of Environmental Protection

California Department of Pesticide regulation

And on and on. There are likely legal reasons for a number of these databases to have CAS Numbers. As I continued to peruse the list I was more than impressed by the number of databases serving up CAS numbers online, and, I believe, a number of them containing over 10,000 numbers which, as I have commented before, is rather a magic number. Should Wikipedia be concerned about the 10,000 CAS number issue with some of the other issues being discussed now?

PMR recently commented on my blogpost here. He said “PMR: Wikipedia has between 1000 and 2000 chemical substances (ca 0.01% of the total number of substances in CAS).”

The number of chemical substances in Wikipedia is actually MUCH higher than that…I know since I’ve been looking at them, in detail as described here. To clarify, I am building an SDF file from the chemicals on Wikipedia so that it can be deposited on ChemSpider hooked up back to Wikipedia. This was done earlier by linking up chemical names but it was far from perfect so we are doing it in this more “curated” manner. The outcome from the work, and thanks to multiple other sets of eyes from WP:CHEM, will be a curated SDF file. I will return the SDF file with the following fields generated: SMILES string, Systematic Name, InChIString, InChIKey. These can then be used to homogenize the available fields in the Chemical Boxes etc.

In doing the work (I have already worked through the whole alphabet) I have over 4900 compounds already curated at a first level. I have disregarded the majority of inorganics and organometallics for this pass. ca. 5000 organics manually curated is ENOUGH of a challenge. I estimate the number of chemical compounds to be about 6500-7000, and it’s growing. So, it’s about a factor of 3-4 times bigger than PMR’s estimate. The vast majority do have CAS numbers. While it hasn’t hit 10,000 yet… it’ coming.

 
4 Comments

Posted by on March 8, 2008 in Uncategorized

 

Tags: , ,

4 Responses to The Curation of Almost 5000 Structures on Wikipedia

  1. Chris Singleton

    March 17, 2008 at 9:12 pm

    Any plans on linking a search on chemspider with its Wikipedia counterpart? E.g. I look for toluene on chemspider, and then there will be a link that will load wikipedia and perform the search for toluene on wikipedia, without the user having to do it explicitly. Or are we going to have to manually put in each Wikipedia address?

     
  2. tony

    March 27, 2008 at 8:17 pm

    Chris….ChemSpider is already linked to Wikipedia. Go to http://www.chemspider.com/q/toluene and look for the name with [Wiki] in the brackets. Click on the word Wiki and you will go directly to Wikipedia.

    We did this about 8 months ago. The links will be much “purer” when the Wikipedia data is curated.

     

Leave a Reply

Your email address will not be published. Required fields are marked *