Category Archives: General Communications

An InChIkey Collision is Discovered and NOT Based on Stereochemistry

InChI Strings and InChIKeys are very much the backbone of ChemSpider and have quickly become a way by which online databases are being connected online. The InChIKey is a hash of the InChiString and when the hash was adopted it was suggested that the likelihood that there would be a collision was very small, the estimate being, as quoted from the official InChI site:

“An example of InChI with its InChKey equivalent is shown below. There is a finite, but very small probability of finding two structures with the same InChIKey. For duplication of only the first block of 14 characters this is 1.3% in 109, equivalent to a single collision in one of 75 databases of 109 compounds each.”

At a previous ACS Meeting Prof Jonathan Goodman from University of Cambridge announced that he had identified a collision. The collision was for two isomers of spongistatin, a rather complex chemical structure with many stereocenters.

Jonathan has “done it again”…what a troublemaker he is (in a supremely gentlemanly way!). I was fortunate enough to receive the news about this collision from him just as I was getting on the flight from ACS Denver to home tonight and asked his permission to blog it as it is both exciting and, I believe, quite surprising news. Why? In this case the collision is for two distinctly different chemicals with totally different formulae and with NO stereochemistry! Very surprising!

As you can see in the figure below the two chemical compounds are simply long branched alkyl chains, one an alcohol and one a ketone.

In case Jonathan’s software tool that he was using to connect to the InChI generation software was doing something untoward with the molfile I confirmed the observation myself by drawing the structures in ACD/ChemSketch and generating the InChIKeys there. And, sure enough…I see exactly the same Standard InChIKeys for both molecules as shown in the movie below. VERY interesting!



Posted by on September 1, 2011 in General Communications, InChI, InChI


Tags: , , ,

A Response to Raw Data in Organic Chemistry Papers and Open Science

@mattoddchem has posted “Raw Data in Organic Chemistry Papers/Open Science” regarding his wanting to  “share data in my field I have to think about how to best share those data”. I recommend you read his whole post before continuing.

Mat has made a number of comments (my annotations added as AJW>)

What are the downsides of posting raw data in organic chemistry, either in papers or to lab book posts:

1) You have to save the data and then upload them. Well, this was a problem in 1995, but not now.

AJW> The problem is definitely not in the technology of uploading to a website. It can be easy to do BUT there is still an activation barrier: wanting to share, can I share (do I have the rights or will my boss/supervisor be upset?),  will there be negative repercussions for sharing.

2) The data files are large. Not really. A 1H NMR spectrum is ca. 200KB.

AJW> Depending on the magnetic field strength and chosen resolution file sizes can be less than half this size or twice this size. It depends a lot on format also. Binary file formats can give you 100k which blow up to 1/2 mbyte as JCAMP (assuming none of the compressed formats)

3) It’s a pain. Yes, a little. But we must suffer for things we love.

AJW> Yes indeed….

4) People might find mistakes in my spectra/assignments. Yes. You’re a scientist. This is a Good Thing.

AJW> If we were afraid of the reexamination of our data we shouldn’t get into science, we shouldn’t publish and we shouldn’t share. How many scientists are like that???

Mat went on to ask about what types of raw data should be posted?

“If we can establish that we should be posting raw data, then what kinds of data should we share, and how? This post is meant to outline an answer, and ask for feedback from anyone who’s already thought about this.

1) X-ray crystallography. This is the exception. Data are routinely deposited raw, and may be downloaded. Not always the case, but XRD blazes a trail here.

AJW> I’m not a crystallographer…I’m an NMR jock by training. So the term “raw data” to me means what comes off the instrument. I think that most crystallography data is processed to extract the CIF file so it’s not raw per se. In NMR raw would be the binary file format from the vendor as a FID file, the time domain data. A frequency domain, phased, baseline corrected and referenced spectrum would not be raw data. The FID is not the most useful form of the spectrum for sure, and real time processing of the FID with all necessary corrections (referencing and phasing etc) is not going to succeed for all spectra.

2) NMR spectroscopy. The big one. IUPAC recommends the JCAMP-DX file format. Jean-Claude Bradley has been a proponent of this format, and has demonstrated how it can be used in all kinds of applications.

AJW> JCAMP is the de facto standard for data exchange in NMR. Most of the vendors (probably all) can autosave their data in JCAMP or at least Export in JCAMP. The third part processing applications import and export JCAMP, some in all of their various flavors….for example, the ACD/NMR processor package has the ability to export all of the various JCAMP flavors….including real only, real and imaginary, with extensions for integrals, peaks, assignments etc. The five JCAMP forms are listed. There are of course no guarantees that all of these various JCAMP flavors can be interpreted by other programs. JCAMP has that type of complexity unfortunately.

ACD/NMR Processor Export to JCAMP


We’ve played with it, and in one of our recent papers we deposited all the NMR data in this format in the SI. We’ve been posting JCAMP-DX files in our online electronic lab notebooks, e.g. here.

AJW> I downloaded a number of these and read them easily into my NMR processing package that I use on my PC. I submitted the first structure in the series 2a (N-(3-Azidopropyl)-1-methylpyrrole-2-carbo​xamide) to ChemSpider (it wasn’t on the database already) and then added the H1 and C13 spectra with a few clicks (see here how). The spectra are flagged as Open Data and available for everyone to download and reuse. I am assuming this is okay since the PLoS article is Open Access, though the data itself isn’t flagged as Open. The data are on ChemSpider here and shown below.

Spectra in ChemSpider


My opinion of this file format (both generating it, and reading it) has not been great. There are two formats, I understand, and we found that if we saved the data in the wrong format, we couldn’t read the data with certain programs, but could with others. i.e. we had to get the generation of the file just right.

AJW> There are way more flavors than two! See above in the image. Over 10 years at ACD/Labs we had to change the JCAMP reader 10s of times to support all of the different implementations of the “standard” JCAMP. A standard set is not a standard followed…

That kind of trickiness, though small, just inevitably means people won’t bother to generate or use the files on a mass scale (unless the journals decide to back it). PDF’s popularity is based on the ubiquity of the reader. JCAMP-DX works well with Jspecview, a free, open source NMR data reader. We’ve not enjoyed our experiences with this, either, though it’s a wonderful endeavour. This led us to look at whether there was a need for saving the data in a particular format, or whether we could just save the raw data, and process those data with a free piece of software. After looking at this with our resident NMR guru, Ian Luck, we found that saving raw data is easy (it’s just a copy and paste of what’s produced by the machine) and that the raw data can be read by free software such as Spinworks or ACDLabs, obviously in addition to our in-house software. This seems ideal? Does anyone have the reason IUPAC prefers a derived data format over the raw data, other than JCAMP-DX is a single file? Aren’t raw data likely to be the most generically useful long-term?

AJW> Assuming that raw data means binary file formats from a spectrometer vendor I can comment that binary files have big issues in terms of longevity. Binary files are that way for a reason …one reason being that the vendors have proprietary information regarding their data acquisition and corrections in the format that are then efficiently dealt with by the vendors software. To try and process Varian, Bruker or Jeol files from many years ago, from their archaic software is likely not easy in even their own modern software. Depending on the vendors to provide their binary file formats to code against generally requires permission AND legal contracts. When I joined a place of work about 20 years ago we had a room of old Winchester drives in OLD Bruker and Varian formats. Not only could we NOT read the files we couldn’t mount the Winchester drives. What about 5.25″ drives, 3.5″ drives, optical disks etc. Lots of that data is almost certainly rotted at this point. JCAMP does offer longevity. Even when a spectrum cannot be read generally it is not difficult to hack the reader. Fixes are easy.

I don’t know if people have experience of this. I was in touch with one of the ACS journals recently, who indicated that their view was that the journal is not a data repository, and that posting of raw data (which was in their view to some extent desirable) should be posted elsewhere, e.g. to an institutional repository. This is an option. I think it’s less convenient. PLoS seem happy to host the data.

AJW> PLoS are hosting the data but in order to view the data you have to have a JCAMP viewer or processing package that will read the JCAMP spectrum. What a reader might like to be able to do is view the spectrum on the Supp Info page. This is of course possible as can be done using the Embed Functionality of ChemSpider’s spectra. The spectra could be posted to ChemSpider and embedded in the PLoS pages if the journal allowed it. There are multiple advantages to this including the spectra being made available to the Spectral Game.

3) IR data. Don’t know if there is a standard. If the file is small, saving raw data could be encouraged. Would allow easy comparisons of fingerprint regions.

AJW> JCAMP is also a standard for IR. In fact I think JCAMP was developed for IR first? See here for IR spectra on ChemSpider.

4) Mass spectrometry. It’s not clear to me there is a huge advantage here to sharing raw data, for a typical low res experiment?

AJW> Most mass spec for chemists is done using soft ionization for mass only. Fragmentation obviously is of more value. If its Electron Impact then JCAMP will suffice. See here. In general NetCDF if the preferred format but we don’t support that on ChemSpider at present.

5) HPLC data. Again, the outputs are fairly simple, and I’m not clear about the advantage of raw data (which I’m assuming would be absorbance vs. time table). Would (perhaps) permit verification that traces have not been cropped to remove pesky impurities.

AJW> Adding HPLC support would not be difficult as an HPLC curve. If it was 2D data of LC and UV  that would be problematic.

6) Anything else?

AJW> Raman, Electron Spin Resonance, 2DNMR, Near infrared, Far infrared, UV-Vis etc. All of these, except ESR, are supported in ChemSpider. Check out this example of H1, C13 and THREE 2D NMR spectra!”

An alternative format of interest might be AnIML, the Analytical Information Markup Language. This format was started a number of years ago and the site does look like it is static but you can follow AnIML on Twitter and it is still alive. I am a little pessimistic that it will provide much value in the near future as the standard will first need to be released, then accepted, then tools will need to be developed/extended to support the format.  The challenge then is either that some organization will need to write all the format converters for all of the vendor instrument binary file formats (and support them for tech support etc) OR the vendors themselves will need to export AnIML – with similar issues to those that arose with JCAMP where the vendors developed their flavor. Clearly a validation suite for checking AnIML files would be appropriate. Overall AnIML has promise but it has taken years to get here already and I would say it will have limited impact in the foreseeable future.

For Mat’s spectral data it would be great to deposit it en masse to ChemSpider rather than for us to deal with it one spectrum and one compound at a time. That is why we already have developed mass deposition tools. If we receive the JDX files in individual directories with the associated molfile then we can do mass deposition. The spectra are then available on ChemSpider immediately, are available to the Spectral Game and also will be fed to our NEW implementation of SpectraSchool that is presently under development and will use spectra under ChemSpider as the foundation data. I’d say the solution for Mat is mostly in place…


Tags: , , ,

Our Book on Collaborative Computational Technologies is Shipping

It’s taken me about a month to get around to this post…our book on Collaborative Computational Technologies is shipping. The book is now available on Amazon here. The book is described in the movie posted on Slideshare here and embedded below. Sean has given the story about how the book came about on his blog.


The book is edited by Sean Ekins (a photo of a proud Sean is here), Maggie Hupcey and myself but has turned out to be a great volume because of the contributions by all of the chapter authors. You can get example chapters at the Wiley website.

The first review is already up on Amazon from @untangledhealth, Jeff Harris. His review is posted on his blog also. We’d welcome any more reviews!!!



Integrating Google Scholar Citations to CrossRef for DOI Lookup

Tonight I was trying to add a couple of publications to my Google Scholar Citations account. The interface available to me is shown below.

Google Scholar Citations

It would make a lot more sense to provide the ability to input a DOI and use the Crossref Resolver to retrieve all of the associated information. There is a full DOI for this purpose (we use it on ChemSpider already) so if I wanted to put our new Melting Point Nature Precedings article into Google Scholar Citations then I would simply put 10.1038/npre.2011.6229.1 into the resolver and it would would link me, or for Google citations it would fill in the appropriate fields. Seems like a good marriage?


Tags: , ,

The Open Notebook Science Melting Point Data Book

Over the past few weeks I have been collaborating with JC Bradley and Andy Lang to develop a curated source of melting point data. JC, as usual, has eloquently detailed the story to this point over on his blog. The findings to date have been fascinating and I won’t retell them on this blog. I recommend these posts from JC Bradley read in order…they tell the bulk of the story spread over a number of months. Fascinating reading in terms of quality, cross validation of data and errors in the data sources.

Chemical Information Validation Results from Fall 2010

Alfa Aesar melting point data now openly available

Validating Melting Point Data from Alfa Aesar, EPI and MDPI

Open Modeling of Melting Point Data

More Open Melting Points from EPI and other sources: on the path to ultimate curation

The quest to determine the melting point of 4-benzyltoluene

More on 4-benzyltoluene and the impact of melting point data curation and transparency

The 4-benzyltoluene melting point twist

Rapid analysis of melting point trends and models using Google Apps Scripts


My role has been to help out with the processing of data, curating data using some of the procedures developed while reviewing data over the past few years, and helping to source data. It’s been a great collaboration but JC and Andy have done the heavy lifting…and done it well!

The work has culminated in the release of a book to both Nature Precedings and available via Lulu as JC has described here. As he said “Now that the book has been accepted by Nature Precedings, it provides a convenient mechanism for citation via DOI, a formal author list, version control, etc. The book is also now available from either as a free PDF download or a physical copy. Because the book runs 699 pages (it covers 2706 unique compounds) the lowest price we could get is $30.96,
which just covers printing and shipping.” It will be interesting to see whether people buy the book or simply go electronic. Time will tell.

Our Melting Point Data Book



Google Scholar Alerts to my Inbox

I already like Google Scholar and Google Scholar Citations (but am INCREDIBLY impressed by the technical support service I am receiving via Twitter from Microsoft Academic Search!) And today this appeared in my Inbox. Nice!

Google Alerts to the Inbox

Leave a comment

Posted by on August 11, 2011 in General Communications


Ongoing Comparisons between Microsoft Academic Search and Google Scholar Citations

I have been blogging on Google Scholar Citations in recent days and noticing some interesting details (1,2,3). I have been in exchanges with the Microsoft Academic Search support team on Twitter trying to collapse multiple accounts. They are helping.

I have since continued my comparison to look for differences in the two platforms. There are some very obvious differences. One GLARING example…on Google Scholar my top cited paper has 50 citations.  On Microsoft Academic Search it has 3. BIG difference!

Citations on Google Scholar


Citations on Microsoft Academic Search

1 Comment

Posted by on August 7, 2011 in Data Quality, General Communications


Tags: , ,

How Accurate was Google Scholar Citations in Detecting my Publications?

I blogged earlier this week about Google’s Brilliance with their new Google Scholar Citations. I was interested to know whether they found all of my papers so have spent a couple of hours checking. The answer? No…they missed 11 of the papers. They are listed below.

1) R.C. Hynes, J.R. Morton, J.A. Hriljac, Y. LePage, K.F. Preston, A.J. Williams, F. Evans, M.C. Grossel and L.H. Sutcliffe,  Isolated Free Radical Pairs in Rb+TCNQ- 18-crown-6 Single Crystals, J.Chem. Soc.,Chem. Commun., 5, 439 (1990)
2) R. Hynes, K.F. Preston, J.J. Springs, J. Tse and A.J. Williams, EPR Studies of M(CO)5-  Radicals (M = Cr, Mo, W) Trapped in Single Crystals of PPh4+ HM(CO)5- , J. Chem. Soc. Faraday Trans., 87(19), 3121 (1991)
3) R. Hynes, K.F. Preston, J.J. Springs, and A.J. Williams, X-Ray Crystallographic, Single-Crystal EPR, and Theoretical Study of Metal-Centred Radicals of the Type {C5R5Cr(CO)2L}
4) R. Duchateau, A.J. Williams, S. Gambarotta and M.Y.Chiang, Carbon-Carbon Double-Bond Formation in the Intermolecular Acetonitrile Reductive Coupling Promoted by a Mononuclear Titanium (II) Compound. Preparation and Characterization of Two Titanium (IV) Imido Derivatives, Inorg. Chem. 30, 4863 (1991)
5) B. Antalek, A.J. Williams, E. Garcia and J. Texter, NMR Analysis of Interfacial Structure Transitions Accompanying Electron Transfer Threshold Transitions in Reverse Microemulsions, Langmuir, 10, 4459, (1994)
6) R.Lok, R. Leone and A.J. Williams, Facile Rearrangements of Alkynylamino Heterocycles with Noble Metal Cations, Journal of Organic Chemistry 61(10), 3289 (1996)
7) D.E. Brown, A.J. Williams and D. McLaughlin, WIMS – A Web-based Information Management System, Trends in Analytical Chemistry, 16, 370 (1997)
8 ) A.J. Williams, Combining Sample, Structural, and Spectral Information in an Information Management System, Sci. Comput. Auto. 15, 60 (1998)
9) M.E. Elyashberg, K.A. Blinov and A.J. Williams, Computer-aided Molecular Structure Elucidation on the Basis of 1D and 2D NMR Spectra, Applied Magnetic Resonance, (May 2000)
10) G. M. Rishton, K. LaBonte, A. J. Williams, K. Kassam and E. Kolovanov.  Computational approaches to the prediction of blood-brain barrier permeability: a comparative analysis of central nervous system drugs versus secretase inhibitors for Alzheimer’s disease Current Opinion in Drug Discovery & Development, 9, 303 (2006)
11) A. J. Williams, V. Tkachenko, C. Lipinski, A. Tropsha and S. Ekins, Free Online Resources Enabling Crowdsourced Drug Discovery, Drug Discovery World Winter 2009/10, 33-39

Fortunately it is easy to add them in…and that is in process. Simply do this:

* To add one article at a time, select the “Add” option from the Actions menu. Then, type in the title, the authors, etc., and click “Save”. Keep in mind that citations to the article you’ve just added may not appear in your profile for a few days.

* To add a group of related articles, select the “Import” option from the Actions menu. Search for your article using its title, keywords, or your name. Click “These are mine” next to the group you wish to add. If you have written articles under different names, with multiple groups of colleagues, or in different journals, you may need to select multiple groups. Your citation metrics will update right away to account for the group(s) you’ve just added.

* When you add a group of articles, we’ll also keep track of changes to this group as our search robots index the web. You can choose to have these changes automatically applied to your profile (recommended) or emailed to you for review. Select “Profile updates” under the Actions menu to configure the updates.”


What’s MORE brilliant though is Google Scholar Citations found papers, book chapters and posters that I didn’t have in my CV. They are now. I remain impressed.

1 Comment

Posted by on August 5, 2011 in Computing, General Communications


ChemSpider Training at ACS Denver

We will be hosting a training session for ChemSpider at the ACS meeting in Denver. Please register early.

An Introduction to ChemSpider – A Combination Platform of Free Chemistry Database, Free Prediction Engines and Wiki Environment

Where: Colorado Convention Center
Room: 503
When: Wednesday, August 31, 8:30 AM – 11:00 AM

>> Click here to register for this workshop
ChemSpider has become one of the premier free online chemistry resources used by many thousands of chemists around the world every day. Hosting over 26 million unique chemical entities, sourced from over 400 separate data sources, ChemSpider provides access to experimental and predicted data, links to patents and publications and uniquely offers the ability to deposit and share their own data online. With the intention of integrating and curating public chemistry resources for the community ChemSpider encourages participation from chemists around the world. Integrated to Wikipedia, Google Patents, Google Books, Google Scholar and PubMed, as well as to the RSC Publishing platform, ChemSpider provides access to chemistry contained in millions of articles. This training session will provide an overview of searching ChemSpider and will discuss how to deposit data and participate in curating the existing information. We will also provide an overview of ChemSpider SyntheticPages, our venture into providing a community-based resource of semantically enriched synthetic procedures and allowing community peer review. This will be an interactive session and you are encouraged to bring your laptops to work along and ask questions regarding present and future capabilities. ChemSpider is built for the community and we welcome your comments about how to make it better for your needs.

1 Comment

Posted by on August 4, 2011 in General Communications


Google’s Brilliance Shows Again with Google Scholar Citations

My colleague David Sharpe pointed me to an interesting blog today concerning Google Scholar Citations. I’d always imagined it would come but didn’t know when. So what a happy lunchtime it was when I sat down to read the blog and register for a citations account here. When I registered on Microsoft Academic Search I was initially impressed.

ONE of my personas on Microsoft Academic Search

Since then I have been collapsing a number of different “authors called Antony Williams”. I’ve been working on it for a few weeks and despite numerous attempts to collapse them, including email requests…I still exist as

If anyone from Microsoft can possibly help me get these collapsed I’d appreciate it! I’ve tried using the approach below and failed.


It’s a shame…I really want to take advantage of a lot of the wonderful tools that Microsoft Academic Search offers. An example is below.

Again…if anyone can help me collapse the various forms of me I’d appreciate it!
Now to  Google Scholar Citations. I registered, I searched on my name and accepted it. Done. The result was, as far as I can tell, a complete capture of my papers. Caveat..I have NOT yet sat and compared with my CV . What impressed me is that I am one person under Google Scholar Citations…no complex “collapsing process”. It also took me 10 mins….it was done with a few button clicks and it looks like this.
Google…impressed again. NICELY DONE!






Tags: ,