Archive for category Community Building
I get interviewed quite regularly regarding ChemSpider, my views on Open Data and data quality on the internet, as well as general comments about the chemistry data explosion online. So, when I was interviewed recently for the online article “Chemistry’s web of data expands” I was more than happy to give my thoughts regarding patent data coming online, data quality and the need for standards for handling chemistry data.
One of the parts of the conversation was regarding the work put in to clean up chemistry data on Wikipedia. What seems like an eternity ago I did “Dedicate Christmas Time to the Cause of Curating Chemistry on Wikipedia” and initiated a project to check every chemical compound on Wikipedia, bond by bond, atom by atom. However, I very quickly connected with Walkerma who then introduced me to a number of other Wikipedia Chemistry people. I started participating in IRC Chats with this group and we started exchanging comments about how we could move the project along. It was a pleasure to work with the team and while I did continue to participate it was nowhere near the level that I had contributed in the early days of the project. The project was a collaborative effort for sure, one of the best I have been involved with over the past few years.
When the original article on Nature.Com was published it stated “In fact, notes Williams, Wikipedia proved the most reliable source of structure information in that experiment – largely because he had led an effort to clean up the site’s 13,000 structures.” I definitely didn’t want that statement in the article and had specifically requested that I was represented as being part of a collaborative effort. I did not lead the project…I was a part of it only. So, with a couple of email exchanges with the author of the article, Richard van Noorden, the language was changed to “In fact, notes Williams, Wikipedia proved the most reliable online source of structure information in that experiment – largely because of an effort to clean up the site’s 13,000 pages about drugs and chemicals”. It’s a subtle edit but I definitely did not want to carry the responsibility for leading a project that was an ideal representation of crowdsourcing, collaboration and caring for chemistry on Wikipedia. And, to clarify…I know for a fact that all pages are not fully curated and validated yet…it’s a long process!!!
This is one of my presentations at the ACS meeting today in San Diego regarding how to use social networking tools to expose yourself as a scientist
Social networking tools as public representations of a scientist
The web has revolutionized the manner by which we can represent ourselves online by providing us the ability to exposure our data, experiences and skills online via blogs, wikis and other crowdsourcing venues. As a result it is possible to contribute to the community while developing a social profile as a scientist. At present many scientists are still measured by their contributions using the classical method of citation statistics and a number of freely available online tools are now available for scientists to manage their profile. This presentation will provide an overview of tools including Google Scholar Citations and Microsoft Academic Search and will discuss how these are and other tools, when integrated with the ORCID identifier, may more fully recognize the collective contributions to science. I will also discuss how an increasingly public view of us as scientists online will likely contribute to our reputation above and beyond citations.
It’s going to be a busy meeting at the ACS Spring Meeting in San Diego. I am presenting five of my own talks and am co-author on 5 more. It’s going to be fun to get them all done! Read that as a challenge…and unfortunately despite my best intentions I NEVER get them written before I leave and am writing/tweaking them the night before. Such it is….
If you happen to be coming to the ACS and are interested in ChemSpider and how RSC informatics contributes to the world of chemistry please do find time to come and visit the RSC booth and, if you have time, let’s sit over a computer and a coffee and chat!
PAPER ID: 15442 PAPER TITLE: “ChemSpider as a chemical term resolver” (final paper number: 131) DAY & TIME OF PRESENTATION: March 29, 2012 from 10:00 am to 10:20 am LOCATION: San Diego Convention Center , Room: Room 27A
PAPER ID: 10915 PAPER TITLE: “Great promise of navigating the internet using InChIs” (final paper number: 101) DAY & TIME OF PRESENTATION: March 28, 2012 from 9:05 am to 9:35 am LOCATION: San Diego Convention Center , Room: Room 27A
PAPER ID: 10902 PAPER TITLE: “Chemistry made mobile – the expanding world of chemistry in the hand” (final paper number: 68) DAY & TIME OF PRESENTATION: March 26, 2012 from 2:45 pm to 3:20 pm LOCATION: San Diego Convention Center , Room: Room 25C
PAPER ID: 11299 PAPER TITLE: “Social networking tools as public representations of a scientist” (final paper number: 123) DAY & TIME OF PRESENTATION: March 28, 2012 from 2:25 pm to 2:50 pm LOCATION: San Diego Convention Center , Room: Room 25C
PAPER ID: 10893 PAPER TITLE: “Teaching NMR spectroscopy using online resources from the Royal Society of Chemistry” (final paper number: 61) DAY & TIME OF PRESENTATION: March 25, 2012 from 2:50 pm to 3:10 pm LOCATION: Westin San Diego , Room: Diamond II
There are many social networking tools for scientists that can be used to share information, engage the social network and move information about activities across the web. This presentation provides an overview of some of the tools available and how they can be used by scientists to expose their activities, manage their profile publicly and participate in the network.
I’m a BIG Wikipedia fan. It is one of my favorite sites, our 9 year old twins have spent many hours on the site with me, and I have personally spent a lot of time, including Christmas, curating chemistry on Wikipedia. I like what Wikipedia has achieved, have willingly contributed articles, but also enjoy a good laugh at Wikipedia’s expense when appropriate. In the past 24 hours I’ve giggled at the latest XKCD cartoon as well as this blog post about Jimmy Wales.
Despite my affection for Wikipedia this week I am annoyed about what’s going on for me on Wikipedia. I’ve read The Wikipedia Revolution and understand the editorial activities and I’ve had many discussions about how authors of Wikipedia articles have been “beaten up” in a friendly way. I’ve been warned about Conflict of Interest policies and yet, because I think it’s important, have tried to navigate the complexities of contributing articles. At present however my contributions on Wikipedia regarding scientists and projects I know about have all been flagged, either for deletion or for “notability”.
I’ve written the bulk of these articles: Gerhard Ecker, Sean Ekins and Gary Martin. Some of the flags on the articles include “It may have been edited by a contributor who has a close connection with its subject. Tagged since November 2011.”
Gary Martin and Sean Ekins are personal friends so YES, I have close connections with the subject. And I believe I can objectively write a good article about them. Just like I wrote about the village I grew up in…Afonwen. I only spent 12 years of my life there….so have a close connection with that too. I have known Gerhard Ecker for about three years, and know about his work from reading his articles and hearing him speak, and feel its valid to contribute an article as I JUDGE he’s a notable scientist. Gary Martin has almost 300 publications, and an h-index of 27. In the domain of NMR anyone who is doing small molecule structure elucidation is almost certainly using technology he has contributed too. He is notable. Sean Ekins is also notable, in my opinion. And surely Wikipedia is about collective opinions.
I have tried to follow notability guidelines for academics but have clearly failed so encourage anyone reading this post to help clean up the articles. If any of you out there happen to know Gerhard, Gary or Sean DON’T contribute though…you might get flagged as being a contributor who has a close connection. It’s much better to write about people you don’t know. Clearly I understand the possible bias …
If I look at the number of chemists on Wikipedia I find the following list of about 480 chemists. That article is a list of world-famous chemists. There is also a smaller list of Russian Chemists. The end of the list looks like this:
These are likely all NOTABLE chemists as I couldn’t find a single article in the list with a challenge on it…but I confess to not looking at each one one at a time. But that’s what we have for chemists….a list of world-famous chemists, biochemists and Russian chemists.
Many of us have heard about how “open” Wikipedia is including many of the exchanges regarding pornography on Wikipedia. In many cases I have to simply caution “welcome to the internet”. We all know its out there…how could we not. There is material on Wikipedia that is shocking, but at the same time educational. But where I take issue, just for comparison purposes, is that top-notch scientists, in my opinion (and I judge that of many others) can be flagged as not notable, yet pages like those listed below for pornstars can exist without question, without flagging but, I have to assume, are both encyclopedic and notable.
Similar to the list of chemists a search on pornstars gives a full article here but then these incredibly long lists!
- Category:Pornographic film actors
- Category:Lists of pornographic film actors
- List of British pornographic actors
- List of Asian pornographic actors
- List of African-American pornographic actors
- List of pornographic actors who appeared in mainstream films
- List of pornographic actresses by decade
The last one is quite a list! I guess its appropriate to list pornstars by decade but scientists tend to perform better over the longer term and can have 40-50 year careers whereas I don’t even want to imagine that for the other career! I struggle to see why the list of references for Ron Jeremy is any more notable/appropriate than the list of references for Gary Martin.
What’s ridiculous is that there is even an article about pornstar pets. What??? This has more of a place on Wikipedia than some of our worlds most published scientists? Is there something wrong with this picture?
While I may not fully understand what is deemed to be appropriate in terms of notability for a scientist, and I do understand the judgment that I might be too close to the scientists to be objective (but I challenge that!) I definitely challenge the status that ponstars deserve more exposure, pardon the pun, than the worlds chemists.
Despite my rants I understand the challenges that will likely show up as comments on this blogpost. I understand that I will be pointed to WP:COI and WP:Notability. I do not get to set the rules, I need to follow them as I am a small part of a very important community of crowdsourced improvement. But, overall, I remain surprised at how there appears to be so much diligence looking at the articles of scientists rather than those of pornstars. I think scientists are generally involved in very notable activities that generally distinguish them from the bulk of the population. I think pornstars are involved in activities that are not particularly notable as the bulk of the population will do them at some point in their life….well, not ALL activities that pornstars do I’m sure…..
I believe we need a change in policy. I believe that scientists deserve more notability than pornstars and that diligence, while appropriate, should be used in a more tempered manner.
There is an alternative solution…
I spend a lot of my night time hours browsing the internet looking for new chemistry resources that may be of value to the community. During my travels across the web I continue to stumble across new resources that I have never heard of before. Some of these databases focus on minerals, on drugs (pharmaceuticals and street varieties), on polymers or organometallics, crystal structures or ligands and targets. The number and diversity of databases out there on the internet touching just chemistry is incredibly large. I judge there are many tens of databases potentially of interest to chemistry (hundreds if we include “SDF files” from chemical vendors). It would not be possible to support all of these resources in ChemSpiderwhich has, presently, a limitation of supporting small organic molecules that can be represented by an InChI. ChemSpider has assumed the role of being the central hub for linking chemistry on the internet but when chemistry database resources cannot be indexed into the system there is still a value to the community, I believe, to provide a central resource for chemists to source information. This could be expanded to include other types of Scientific Databases including Biology, Physics etc. Welcome SciDBs….a wiki for Scientific Databases.
There are already three databases represented on the Wiki: Zinc, ChemSpider and DrugBank. We are hoping that the hosts of the databases, BOTH commercial and freely available, will add their own databases online. Else, with time, we will do it….
@mattoddchem has posted “Raw Data in Organic Chemistry Papers/Open Science” regarding his wanting to “share data in my field I have to think about how to best share those data”. I recommend you read his whole post before continuing.
Mat has made a number of comments (my annotations added as AJW>)
What are the downsides of posting raw data in organic chemistry, either in papers or to lab book posts:
1) You have to save the data and then upload them. Well, this was a problem in 1995, but not now.
AJW> The problem is definitely not in the technology of uploading to a website. It can be easy to do BUT there is still an activation barrier: wanting to share, can I share (do I have the rights or will my boss/supervisor be upset?), will there be negative repercussions for sharing.
2) The data files are large. Not really. A 1H NMR spectrum is ca. 200KB.
AJW> Depending on the magnetic field strength and chosen resolution file sizes can be less than half this size or twice this size. It depends a lot on format also. Binary file formats can give you 100k which blow up to 1/2 mbyte as JCAMP (assuming none of the compressed formats)
3) It’s a pain. Yes, a little. But we must suffer for things we love.
AJW> Yes indeed….
4) People might find mistakes in my spectra/assignments. Yes. You’re a scientist. This is a Good Thing.
AJW> If we were afraid of the reexamination of our data we shouldn’t get into science, we shouldn’t publish and we shouldn’t share. How many scientists are like that???
Mat went on to ask about what types of raw data should be posted?
“If we can establish that we should be posting raw data, then what kinds of data should we share, and how? This post is meant to outline an answer, and ask for feedback from anyone who’s already thought about this.
1) X-ray crystallography. This is the exception. Data are routinely deposited raw, and may be downloaded. Not always the case, but XRD blazes a trail here.
AJW> I’m not a crystallographer…I’m an NMR jock by training. So the term “raw data” to me means what comes off the instrument. I think that most crystallography data is processed to extract the CIF file so it’s not raw per se. In NMR raw would be the binary file format from the vendor as a FID file, the time domain data. A frequency domain, phased, baseline corrected and referenced spectrum would not be raw data. The FID is not the most useful form of the spectrum for sure, and real time processing of the FID with all necessary corrections (referencing and phasing etc) is not going to succeed for all spectra.
2) NMR spectroscopy. The big one. IUPAC recommends the JCAMP-DX file format. Jean-Claude Bradley has been a proponent of this format, and has demonstrated how it can be used in all kinds of applications.
AJW> JCAMP is the de facto standard for data exchange in NMR. Most of the vendors (probably all) can autosave their data in JCAMP or at least Export in JCAMP. The third part processing applications import and export JCAMP, some in all of their various flavors….for example, the ACD/NMR processor package has the ability to export all of the various JCAMP flavors….including real only, real and imaginary, with extensions for integrals, peaks, assignments etc. The five JCAMP forms are listed. There are of course no guarantees that all of these various JCAMP flavors can be interpreted by other programs. JCAMP has that type of complexity unfortunately.
AJW> I downloaded a number of these and read them easily into my NMR processing package that I use on my PC. I submitted the first structure in the series 2a (N-(3-Azidopropyl)-1-methylpyrrole-2-carboxamide) to ChemSpider (it wasn’t on the database already) and then added the H1 and C13 spectra with a few clicks (see here how). The spectra are flagged as Open Data and available for everyone to download and reuse. I am assuming this is okay since the PLoS article is Open Access, though the data itself isn’t flagged as Open. The data are on ChemSpider here and shown below.
My opinion of this file format (both generating it, and reading it) has not been great. There are two formats, I understand, and we found that if we saved the data in the wrong format, we couldn’t read the data with certain programs, but could with others. i.e. we had to get the generation of the file just right.
AJW> There are way more flavors than two! See above in the image. Over 10 years at ACD/Labs we had to change the JCAMP reader 10s of times to support all of the different implementations of the “standard” JCAMP. A standard set is not a standard followed…
That kind of trickiness, though small, just inevitably means people won’t bother to generate or use the files on a mass scale (unless the journals decide to back it). PDF’s popularity is based on the ubiquity of the reader. JCAMP-DX works well with Jspecview, a free, open source NMR data reader. We’ve not enjoyed our experiences with this, either, though it’s a wonderful endeavour. This led us to look at whether there was a need for saving the data in a particular format, or whether we could just save the raw data, and process those data with a free piece of software. After looking at this with our resident NMR guru, Ian Luck, we found that saving raw data is easy (it’s just a copy and paste of what’s produced by the machine) and that the raw data can be read by free software such as Spinworks or ACDLabs, obviously in addition to our in-house software. This seems ideal? Does anyone have the reason IUPAC prefers a derived data format over the raw data, other than JCAMP-DX is a single file? Aren’t raw data likely to be the most generically useful long-term?
AJW> Assuming that raw data means binary file formats from a spectrometer vendor I can comment that binary files have big issues in terms of longevity. Binary files are that way for a reason …one reason being that the vendors have proprietary information regarding their data acquisition and corrections in the format that are then efficiently dealt with by the vendors software. To try and process Varian, Bruker or Jeol files from many years ago, from their archaic software is likely not easy in even their own modern software. Depending on the vendors to provide their binary file formats to code against generally requires permission AND legal contracts. When I joined a place of work about 20 years ago we had a room of old Winchester drives in OLD Bruker and Varian formats. Not only could we NOT read the files we couldn’t mount the Winchester drives. What about 5.25″ drives, 3.5″ drives, optical disks etc. Lots of that data is almost certainly rotted at this point. JCAMP does offer longevity. Even when a spectrum cannot be read generally it is not difficult to hack the reader. Fixes are easy.
I don’t know if people have experience of this. I was in touch with one of the ACS journals recently, who indicated that their view was that the journal is not a data repository, and that posting of raw data (which was in their view to some extent desirable) should be posted elsewhere, e.g. to an institutional repository. This is an option. I think it’s less convenient. PLoS seem happy to host the data.
AJW> PLoS are hosting the data but in order to view the data you have to have a JCAMP viewer or processing package that will read the JCAMP spectrum. What a reader might like to be able to do is view the spectrum on the Supp Info page. This is of course possible as can be done using the Embed Functionality of ChemSpider’s spectra. The spectra could be posted to ChemSpider and embedded in the PLoS pages if the journal allowed it. There are multiple advantages to this including the spectra being made available to the Spectral Game.
3) IR data. Don’t know if there is a standard. If the file is small, saving raw data could be encouraged. Would allow easy comparisons of fingerprint regions.
AJW> JCAMP is also a standard for IR. In fact I think JCAMP was developed for IR first? See here for IR spectra on ChemSpider.
4) Mass spectrometry. It’s not clear to me there is a huge advantage here to sharing raw data, for a typical low res experiment?
AJW> Most mass spec for chemists is done using soft ionization for mass only. Fragmentation obviously is of more value. If its Electron Impact then JCAMP will suffice. See here. In general NetCDF if the preferred format but we don’t support that on ChemSpider at present.
5) HPLC data. Again, the outputs are fairly simple, and I’m not clear about the advantage of raw data (which I’m assuming would be absorbance vs. time table). Would (perhaps) permit verification that traces have not been cropped to remove pesky impurities.
AJW> Adding HPLC support would not be difficult as an HPLC curve. If it was 2D data of LC and UV that would be problematic.
6) Anything else?
AJW> Raman, Electron Spin Resonance, 2DNMR, Near infrared, Far infrared, UV-Vis etc. All of these, except ESR, are supported in ChemSpider. Check out this example of H1, C13 and THREE 2D NMR spectra!”
An alternative format of interest might be AnIML, the Analytical Information Markup Language. This format was started a number of years ago and the site does look like it is static but you can follow AnIML on Twitter and it is still alive. I am a little pessimistic that it will provide much value in the near future as the standard will first need to be released, then accepted, then tools will need to be developed/extended to support the format. The challenge then is either that some organization will need to write all the format converters for all of the vendor instrument binary file formats (and support them for tech support etc) OR the vendors themselves will need to export AnIML – with similar issues to those that arose with JCAMP where the vendors developed their flavor. Clearly a validation suite for checking AnIML files would be appropriate. Overall AnIML has promise but it has taken years to get here already and I would say it will have limited impact in the foreseeable future.
For Mat’s spectral data it would be great to deposit it en masse to ChemSpider rather than for us to deal with it one spectrum and one compound at a time. That is why we already have developed mass deposition tools. If we receive the JDX files in individual directories with the associated molfile then we can do mass deposition. The spectra are then available on ChemSpider immediately, are available to the Spectral Game and also will be fed to our NEW implementation of SpectraSchool that is presently under development and will use spectra under ChemSpider as the foundation data. I’d say the solution for Mat is mostly in place…
Tonight I was trying to add a couple of publications to my Google Scholar Citations account. The interface available to me is shown below.
It would make a lot more sense to provide the ability to input a DOI and use the Crossref Resolver to retrieve all of the associated information. There is a full DOI for this purpose (we use it on ChemSpider already) so if I wanted to put our new Melting Point Nature Precedings article into Google Scholar Citations then I would simply put 10.1038/npre.2011.6229.1 into the resolver and it would would link me, or for Google citations it would fill in the appropriate fields. Seems like a good marriage?
I have blogged recently about my experiences with Google Scholar Citations. (1,2). It has been useful in highlighting what science I have published that people might find interest as well as trends in citation patterns. It has also highlighted some potential issues in the data.
I must admit I was quite surprised to see that the top cited paper was one from Eastman Kodak company where we looked at interactions between Sodium Dodecyl sulfate and gelatin, followed by work I did at the University of Ottawa. This work was in 1994 and 1991 respectively. This work was almost 20 years ago so it does make sense that the aggregation of citations over the years might have reached those levels. However, I would have expected that my work in the areas of NMR prediction, Computer-Assisted Structure Elucidation (CASE) and Indirect Covariance would have garnered a lot more citations, but that work did come about 10 years later. It is good to see that the more recent papers, for example that from 2008 on internet-based tools for communication and collaboration in chemistry, has garnered a following.
Above is shown a list of my papers from as far back as 1990 that appear to not have any citations. There are also a lot of recent papers listed that I KNOW are cited, multiple times, as they have been referred to in some of my own publications. For example, the second one in the list, from 2009, entitled “Computer-assisted methods for molecular structure elucidation: realizing a spectroscopist’s dream” is an Open Access article and according to the journal statistics it is the top read article of all time on the Journal of Cheminformatics as shown here with, as of this writing, 10770 accesses. In fact, if I search the article directly on Google Scholar I find it IS cited 7 times as shown below.
I don’t know why it shows up as cited in Google Scholar but not in Google Scholar Citations. However, the same issue exists for the paper on the Spectral game. See below. Shows no citations on Google Schoalr Citations but shows 7 on Google Scholar.
Notice that in BOTH cases the article is listed as the Journal of Cheminformatics, not as the title of the paper. Maybe THIS is the reason the citations are missed. Maybe the publisher for the Journal of Cheminformatics is not exposed in a manner that has the publications indexed properly? Maybe….
On August 25/26 I will be attending the 5th Meeting on U.S. Government Chemical Databases and Open Chemistry. I will have the opportunity to spend time with people I appreciate for the contributions they are making to chemistry: Martin Walker, JC Bradley, Andy Lang, Markus Sitzmann, Ann Richard, Frank Switzer, Evan Bolton, Marc Zimmermann, Wolf Ihlenfeldt, Steve Heller, John Overington, Noel O’Boyle, and many others. It is surely going to be an excellent meeting. The agenda is given here.
Some of the people listed above are associated with “Washington-based databases”. Databases that are developed in or around Washington by government-funded organizations – the FDA, NIH, NCBI/NLM, NCI, NIST. There are also other government funded databases, non-Washington-based, represented – EPA and CDC. If you are not sure what all those three letter acronyms are then here you go.
FDA – Food and Drug Administration
NIH – National Institutes of Health
NCBI/NLM – National Center for Biotechnology Information/National Library of Medicine
NCI – National Cancer Institute
EPA – Environmental Protection Agency
CDC – Center of Disease Control
NIST – National Institute of Standards and Technology
One organization with a chemistry database conspicuous by its absence is the NCGC data collection contained in the NPC Browser. I’ve blogged a lot about this one on this blog.
NCGC – NIH Chemical Genomics Center
I am hoping to get to talk to some members of the team if they attend the meeting though.
There will be a LOT of government databases represented at this meeting. I have experience with many of the databases provided by these institutions. The DSSTox database is one of the most highly curated databases based on my review of the data. The NCI resolver is an excellent resource with good quality data in terms of the accuracy of name-structure relationships.
The various databases are developed independently of each other. True, some of the databases contain contents from some of the other databases but, as far as I can tell, there is not much collaboration in terms of coordinated curation of data. What would it be like if each of these organizations participated in a roundtable discussion to agree to a process by which to collaboratively validate and curate the data, once and for all? Maybe this meeting can catalyze such a discussion. I would encourage the organizations to take advantage of other data sources that can share their data – ChEBI/ChEMBL is one example! If these various groups coordinate their work then the result could be a massively improved quality dataset to share across the databases and across the community. If this work was done then the group that assembled the NPC Browser would likely have a lot less work to do in terms of assembling the data. The various database providers should certainly have provided clean, curated data for many of the top known drugs. While working on a manuscript reviewing the quality of public domain chemistry databases I assembled a table of 25 of the top selling drugs in the US and checked the data quality in the NPC Browser relative to a gold standard set. The assembly of the data will be discussed in its entirety in a later publication.
The errors listed in the table are:
1 Correct skeleton, No stereochemistry
2 Correct skeleton, Missing stereochemistry
3 Correct skeleton, Incorrect stereochemistry
4 Single component of multicomponent structure
5 Multiple components for single component structure
6 No structure returned based on Name Search
7 Incorrect skeleton
8 Multiple structures based on name search
Clearly there are a lot of errors in the structures associated with 25 of the best selling drugs on the US market. These should be the easy ones to get right as they are so well known!!! Collaboration between the domains top database providers would have helped, almost certainly. This would not necessarily be an issue of meshing technologies but agreeing on a common goal to have the highest quality data available. Since the government puts so much money into the development of these databases it would be appropriate to have some oversight and push for aligning efforts. Collaboration is essential!
With that in mind…a shameless pointer to how Sean Ekins, Maggie Hupcey and I BELIEVE in the need for collaboration…our book. If we can encourage others in the government chemistry databases to adopt active collaborative approaches wonderful things could happen.