09 Jan

I’ll confess that despite the lure of Christmas candy, repeats of oldie-but-goodie movies and the urge to go hack down a Xmas tree I found it difficult to stay away from my computer over the holidays. While I stayed silent in the blogosphere I probably spent more nighttime hours with my laptop than I have in the past few months. I had a conversation with Walkerma from the Wikipedia Chemistry group in December and confessed my interest in curating Wikipedia chemical structures. For those of you who read the ChemSpider blog you’ll know I have rather a passion for curation. And I’d done a significant amount of it on ChemSpider and also, of late, on Wikipedia….see the taxol and diazonamide stories.

We have recently announced our intention to rollout WiChempedia over on the ChemSpider blog. Now, before we go grab the chemistry content I wanted to make sure that we could grab “clean” data. In keeping with the structure centric nature of the system we want to build my first charge was to check/validate/curate the structure-name pairs on Wikipedia. Using some CSV files provided to me by Martin Walker I went to work. First of all, those CSV files were dirty…the word Ethanol shows up in some obscure places. With the assistance of a good action movie, a glass of wine, some basic text queries and removals, and some delete-delete-delete keystrokes and I had removed the majority of “no way it could be a chemical” text strings. Then, I imported the list of chemical names into a desktop chemical structure databasing tool (more on the tools in a separate post) and I went to work. There were a few little tricks to make the whole process easier but that will be detailed elsewhere. I could actually manage to check a structure in about 2 minutes per in general. In some cases I had to redraw structures (some took a LONG time). I wandered between PubChem and ChemSpider, Chemrefer and Google looking for confirmation of structures and registry numbers.

I’ve made many edits to the Wikipedia entries already…you can see my contributions since Dec 15th online. I recently started to keep a mare detailed report of mistakes/suggestions/comments I have made on structures on Wikipedia structures (as a result of a conversation with Walkerma). The latest report is here. Walkerma is posting a version of this online for members of WP:CHEM to comment on.

My overall conclusions so far…my estimate is that about 2-3% of the structure records online have errors. What’s an error?

1) The structure does not match what it “should be” based on review of many other sources.

2) Systematic nomenclature can be poor…if the name displayed on Wikipedia is converted to a structure then sometimes it is inconsistent with the actual structure displayed

3) Sometimes the formula or mass displayed in the ChemBox are inconsistent with the actual mass or formula of the structure displayed

4) The SMILES or InChI String associated with the structure can produce a different structure when converted.

5) The registry number matches either a different structure or a different “form” of the structure. For example, the structure shown is a neutral form of the compound but the registry number is for the salt.

There are other issues but the ones are above are the most common.

It turns out that Peter Murray-Rust and his group have been doing similar work according to his post here . I appreciate his comment “We are very grateful for this work. We are also doing similar things and we’d be delighted to coordinate”.

While this is not exactly Open Notebook Science – as I do the work of curating Wikipedia records I am keeping records, putting them up online for others to check and comment on so this is Collaborative Science through curation. This IS actually having an impact on the Wikipedia records every 24 hours at present. Not only am “I” making edits of records as I find errors but when I open the conversation with others for their comments then they make decisions and appropriate edits. You can see WP users making edits according to my comments – see here for example. I’m interested to see the similar contributions from Peter’s team.

There is expected to be an IRC chat with the WP:CHEM team in the near future and hopefully a chance to compare notes, processes and the path forward for curation. I’m looking forward to the opportunity to hearing about Peter’s teams approach to curating the data and identifying how we differ and how we can mesh our efforts. It would be good if PMR’s group can adopt an Open Notebook Science approach to Wikipedia analysis as he did recently with the NMR analysis. In that way we’ll be able to jointly track our efforts as we work together to help the Wikipedia team. (Peter- if you are reading can you share your experiences of curating Wikipedia and what your team is observing. Can we do Collaborative Science on this project together?)


