@mattoddchem has posted “Raw Data in Organic Chemistry Papers/Open Science” regarding his wanting to “share data in my field I have to think about how to best share those data”. I recommend you read his whole post before continuing.
Mat has made a number of comments (my annotations added as AJW>)
What are the downsides of posting raw data in organic chemistry, either in papers or to lab book posts:
1) You have to save the data and then upload them. Well, this was a problem in 1995, but not now.
AJW> The problem is definitely not in the technology of uploading to a website. It can be easy to do BUT there is still an activation barrier: wanting to share, can I share (do I have the rights or will my boss/supervisor be upset?), will there be negative repercussions for sharing.
2) The data files are large. Not really. A 1H NMR spectrum is ca. 200KB.
AJW> Depending on the magnetic field strength and chosen resolution file sizes can be less than half this size or twice this size. It depends a lot on format also. Binary file formats can give you 100k which blow up to 1/2 mbyte as JCAMP (assuming none of the compressed formats)
3) It’s a pain. Yes, a little. But we must suffer for things we love.
AJW> Yes indeed….
4) People might find mistakes in my spectra/assignments. Yes. You’re a scientist. This is a Good Thing.
AJW> If we were afraid of the reexamination of our data we shouldn’t get into science, we shouldn’t publish and we shouldn’t share. How many scientists are like that???
Mat went on to ask about what types of raw data should be posted?
“If we can establish that we should be posting raw data, then what kinds of data should we share, and how? This post is meant to outline an answer, and ask for feedback from anyone who’s already thought about this.
1) X-ray crystallography. This is the exception. Data are routinely deposited raw, and may be downloaded. Not always the case, but XRD blazes a trail here.
AJW> I’m not a crystallographer…I’m an NMR jock by training. So the term “raw data” to me means what comes off the instrument. I think that most crystallography data is processed to extract the CIF file so it’s not raw per se. In NMR raw would be the binary file format from the vendor as a FID file, the time domain data. A frequency domain, phased, baseline corrected and referenced spectrum would not be raw data. The FID is not the most useful form of the spectrum for sure, and real time processing of the FID with all necessary corrections (referencing and phasing etc) is not going to succeed for all spectra.
2) NMR spectroscopy. The big one. IUPAC recommends the JCAMP-DX file format. Jean-Claude Bradley has been a proponent of this format, and has demonstrated how it can be used in all kinds of applications.
AJW> JCAMP is the de facto standard for data exchange in NMR. Most of the vendors (probably all) can autosave their data in JCAMP or at least Export in JCAMP. The third part processing applications import and export JCAMP, some in all of their various flavors….for example, the ACD/NMR processor package has the ability to export all of the various JCAMP flavors….including real only, real and imaginary, with extensions for integrals, peaks, assignments etc. The five JCAMP forms are listed. There are of course no guarantees that all of these various JCAMP flavors can be interpreted by other programs. JCAMP has that type of complexity unfortunately.
AJW> I downloaded a number of these and read them easily into my NMR processing package that I use on my PC. I submitted the first structure in the series 2a (N-(3-Azidopropyl)-1-methylpyrrole-2-carboxamide) to ChemSpider (it wasn’t on the database already) and then added the H1 and C13 spectra with a few clicks (see here how). The spectra are flagged as Open Data and available for everyone to download and reuse. I am assuming this is okay since the PLoS article is Open Access, though the data itself isn’t flagged as Open. The data are on ChemSpider here and shown below.
My opinion of this file format (both generating it, and reading it) has not been great. There are two formats, I understand, and we found that if we saved the data in the wrong format, we couldn’t read the data with certain programs, but could with others. i.e. we had to get the generation of the file just right.
AJW> There are way more flavors than two! See above in the image. Over 10 years at ACD/Labs we had to change the JCAMP reader 10s of times to support all of the different implementations of the “standard” JCAMP. A standard set is not a standard followed…
That kind of trickiness, though small, just inevitably means people won’t bother to generate or use the files on a mass scale (unless the journals decide to back it). PDF’s popularity is based on the ubiquity of the reader. JCAMP-DX works well with Jspecview, a free, open source NMR data reader. We’ve not enjoyed our experiences with this, either, though it’s a wonderful endeavour. This led us to look at whether there was a need for saving the data in a particular format, or whether we could just save the raw data, and process those data with a free piece of software. After looking at this with our resident NMR guru, Ian Luck, we found that saving raw data is easy (it’s just a copy and paste of what’s produced by the machine) and that the raw data can be read by free software such as Spinworks or ACDLabs, obviously in addition to our in-house software. This seems ideal? Does anyone have the reason IUPAC prefers a derived data format over the raw data, other than JCAMP-DX is a single file? Aren’t raw data likely to be the most generically useful long-term?
AJW> Assuming that raw data means binary file formats from a spectrometer vendor I can comment that binary files have big issues in terms of longevity. Binary files are that way for a reason …one reason being that the vendors have proprietary information regarding their data acquisition and corrections in the format that are then efficiently dealt with by the vendors software. To try and process Varian, Bruker or Jeol files from many years ago, from their archaic software is likely not easy in even their own modern software. Depending on the vendors to provide their binary file formats to code against generally requires permission AND legal contracts. When I joined a place of work about 20 years ago we had a room of old Winchester drives in OLD Bruker and Varian formats. Not only could we NOT read the files we couldn’t mount the Winchester drives. What about 5.25″ drives, 3.5″ drives, optical disks etc. Lots of that data is almost certainly rotted at this point. JCAMP does offer longevity. Even when a spectrum cannot be read generally it is not difficult to hack the reader. Fixes are easy.
I don’t know if people have experience of this. I was in touch with one of the ACS journals recently, who indicated that their view was that the journal is not a data repository, and that posting of raw data (which was in their view to some extent desirable) should be posted elsewhere, e.g. to an institutional repository. This is an option. I think it’s less convenient. PLoS seem happy to host the data.
AJW> PLoS are hosting the data but in order to view the data you have to have a JCAMP viewer or processing package that will read the JCAMP spectrum. What a reader might like to be able to do is view the spectrum on the Supp Info page. This is of course possible as can be done using the Embed Functionality of ChemSpider’s spectra. The spectra could be posted to ChemSpider and embedded in the PLoS pages if the journal allowed it. There are multiple advantages to this including the spectra being made available to the Spectral Game.
3) IR data. Don’t know if there is a standard. If the file is small, saving raw data could be encouraged. Would allow easy comparisons of fingerprint regions.
AJW> JCAMP is also a standard for IR. In fact I think JCAMP was developed for IR first? See here for IR spectra on ChemSpider.
4) Mass spectrometry. It’s not clear to me there is a huge advantage here to sharing raw data, for a typical low res experiment?
AJW> Most mass spec for chemists is done using soft ionization for mass only. Fragmentation obviously is of more value. If its Electron Impact then JCAMP will suffice. See here. In general NetCDF if the preferred format but we don’t support that on ChemSpider at present.
5) HPLC data. Again, the outputs are fairly simple, and I’m not clear about the advantage of raw data (which I’m assuming would be absorbance vs. time table). Would (perhaps) permit verification that traces have not been cropped to remove pesky impurities.
AJW> Adding HPLC support would not be difficult as an HPLC curve. If it was 2D data of LC and UV that would be problematic.
6) Anything else?
AJW> Raman, Electron Spin Resonance, 2DNMR, Near infrared, Far infrared, UV-Vis etc. All of these, except ESR, are supported in ChemSpider. Check out this example of H1, C13 and THREE 2D NMR spectra!”
An alternative format of interest might be AnIML, the Analytical Information Markup Language. This format was started a number of years ago and the site does look like it is static but you can follow AnIML on Twitter and it is still alive. I am a little pessimistic that it will provide much value in the near future as the standard will first need to be released, then accepted, then tools will need to be developed/extended to support the format. The challenge then is either that some organization will need to write all the format converters for all of the vendor instrument binary file formats (and support them for tech support etc) OR the vendors themselves will need to export AnIML – with similar issues to those that arose with JCAMP where the vendors developed their flavor. Clearly a validation suite for checking AnIML files would be appropriate. Overall AnIML has promise but it has taken years to get here already and I would say it will have limited impact in the foreseeable future.
For Mat’s spectral data it would be great to deposit it en masse to ChemSpider rather than for us to deal with it one spectrum and one compound at a time. That is why we already have developed mass deposition tools. If we receive the JDX files in individual directories with the associated molfile then we can do mass deposition. The spectra are then available on ChemSpider immediately, are available to the Spectral Game and also will be fed to our NEW implementation of SpectraSchool that is presently under development and will use spectra under ChemSpider as the foundation data. I’d say the solution for Mat is mostly in place…