Archive for category General Communications
There has been a new capability on LinkedIn for awhile….the ability to add your judgments about people you are LinkedTo in your network. What this looks like is shown below. It’s been interesting to see what I have been “endorsed” for on my profile…
I would agree…I am a Chemist first, then an NMR Spectroscopist but I would put cheminformatics and analytical chemistry above Drug Discovery. I DO like this approach for “tagging” skillsets though and I can see it has a natural role in other ways…a fun project for the New Year. Watch this space.
I continue be impressed with Google Scholar Citations. I receive regular emails, similar to the one below,telling me when papers are referencing articles I have authored/co-authored. In this case this article referred to a paper that I co-authored in 1996 while I was at Kodak….regarding silver-catalyzed cyclizations. I would not have expected a paper about photographic based organic chemistry to show up in a Toxicology journal. But thanks to Google Scholar Citations now I know…
This is one of my presentations at the ACS meeting today in San Diego regarding how to use social networking tools to expose yourself as a scientist
Social networking tools as public representations of a scientist
The web has revolutionized the manner by which we can represent ourselves online by providing us the ability to exposure our data, experiences and skills online via blogs, wikis and other crowdsourcing venues. As a result it is possible to contribute to the community while developing a social profile as a scientist. At present many scientists are still measured by their contributions using the classical method of citation statistics and a number of freely available online tools are now available for scientists to manage their profile. This presentation will provide an overview of tools including Google Scholar Citations and Microsoft Academic Search and will discuss how these are and other tools, when integrated with the ORCID identifier, may more fully recognize the collective contributions to science. I will also discuss how an increasingly public view of us as scientists online will likely contribute to our reputation above and beyond citations.
As discussed over on the ChemSpider blog we will soon be depositing data from the SORD databases (Selected Organic Reactions Database) onto ChemSpider. This will be done as two separate but related datasets until the SORD data source: Reactants and Products. If you don’t know what SORD is then who better to explain than Dick Wife, the “host” of the SORD database. Dick wrote the overview article below to provide an overview about what SORD is…ENJOY!
The Selected Organic Reactions (SOR) Database: capturing “Lost Chemistry”
A new database is capturing the 80% of Lost Chemistry from theses and dissertations which doesn’t make it into publications and chemists who contribute their data get access to the entire database for free.
SORD, an independent Dutch company, is carefully selecting the synthetic chemistry focused on Life Science research and making this chemistry available in their Selected Organic Reactions (SOR) Database. For the theses/dissertations which they select, SORD excerpts all of the reactions in the Experimental section are excerpted. This means there will still be a small overlap of data with full publications. There will also be a larger overlap with publications such as Notes, Letters or Communications but these do not contain the experimental details. The SOR Database brings all this chemistry to the desktop, every last detail written by the author.
Some time back, SORD looked at around 300k interesting drug-like compounds in the literature and which countries they had come from, and the native language. The English-speaking countries accounted for only 37% of the total. German/Swiss dissertations are often written in English but this is new. The theses and dissertations in the other languages represent more than half of the total. SORD routinely translates German and French experimental texts into English. They are about to start on Chinese and Japanese translations and, if anyone can give them access to Russian theses, they will translate these as well!
A thesis or dissertation is the result of several years of hard work by a research student under the constant supervision of the research leader whose reputation is at stake if the work described is wrong or inaccurate. It is also examined by a committee who decide on awarding the degree, or not. They scrutinize closely the Results & Discussion as well as the Experimental sections. The chemistry is reliable.
Advanced Chemistry Development, Inc (ACD/Labs) is partnering SORD in developing this Database. The SOR Database is available for in-house use with ChemFolder Enterprise or on the Internet with ACD/Web Librarian™. This is a screen-shot of a typical SOR Database record in Web Librarian.
The Reaction Scheme shows every atom (there are no abbreviations). The Experimental text is edited to ASCII format and the key parameters (Reagent(s), Solvent(s), yield(s), MP(s) and Optical Rotation(s) are displayed in separate Fields, as are the full bibliographic data, making data-mining possible. There is also a link which enables the user to bring up the PDF of each reaction, containing all of the spectral and other physical data which SORD does not excerpt. The PDF link is a powerful and unique feature of the SOR Database.
Now some explanation about SORD’s excerption rules. What they call the Reaction Scheme (A + B à C, etc.) contains only the reacting and product compound structures. A Reagent is an essential reaction component of which no part ends up in the product – if it does, it becomes a Reactant! When several reactions are performed before the product is isolated (and characterized) the Reagents and Solvents are listed in Steps. Failed reactions are not excerpted but reactions with poor yields are.
The SOR Database currently contains 170k reactions; the target is one million at the end of 2013. Even this number is a lot smaller than what you find today in the major commercial reaction databases. Back in the nineties, SORD researchers looked at one such large commercial database which then contained 9 million compounds. Sifting through the content for drug-like compounds resulted in just 450k or 5% of the records. Size is one database metric; quality is much more important! In the SOR Database, you will only find characterized products – and no polymers, or compounds with no molecular structure.
Users of the SOR Database also have access to the separate databases which contain the Reagents (ca. 3,000) and Solvents (ca. 450) which have been encountered so far. Often a Reagent is a catalyst (organic/organometallic) but they can also be simple entities like bases, acids, ammonium salts, etc. or complex chiral ligands. Authors give Reagents many different names and so each Reagent (and Solvent) in the SOR Database has been assigned a unique name. This enables rapid searches using the assigned names, again a novel feature of the database. Such searches can bring you to really nice chemistry.
As an Example, the second generation Grubbs olefin metathesis catalyst has been given the name Grubbs 2 catalyst. In the current SOR Database, there are more than 500 reactions where it has been used. Some of these are straightforward; some are not and generate novel ring systems like this one from the Martin group at North Carolina at Chapel Hill:
Searches in the Reactions Scheme, or using Reagent/Solvent names and hit refinement brings you to new chemistry which until now was only found on a dusty shelf in a library. The “Lost Chemistry” is now getting smaller as SORD carefully selects and excerpts the reactions which deserve a new life. The SOR Database is essential for novelty searches and it is a powerful supplement for the other commercial reaction databases.
Finally some more good news for academic research chemists; your data will be readily accessible to the whole chemical world who will cite your work in their publications. The chemistry which you never published may be just what others are looking for. Routinely SORD excerpts the complete collection of theses and dissertations from research supervisors; they will be more than happy to see your work appear in the next SOR Database!
 de Laet, A.; Hehenkamp, J. J.; Wife, R. L. Finding Drug Candidates in Lost/Emerging Chemistry. J. Heterocycl. Chem. 2000, 37, 669–674.
One of the highlights of the past year has been my continued collaborations with Sean Ekins on the issues of data quality, modeling of data and the applications of mobile technologies. Recently our commentary on the long term cost of inferior database quality was published in Drug Discovery Today and is available online here.
I have become SOOOOOOOOO popular as a journal editor for Open Access journals. In the past week I have been invited to be a journal editor for three separate Open Access journals. These are simply emails with sign up here, we are a popular publisher of Open Access journals and a “editors are encouraged to submit articles message”. My favorite invitation this week is below. Don’t forget I’m a PhD CHEMIST, NMR spectroscopist and cheminformatician….
I chose NOT to respond…
Subject: Invitation to Join the Editorial/Review Board of Journal of Communication Technology and Human Behaviors
I am writing to introduce Journal of Communication Technology and Human Behaviors to you. Journal of Communication Technology and Human Behaviors is a new journal launched recently by the Columbia International Publishing (CIP) team. CIP is committed to rapidly delivering high-quality research findings and results to the world. We aim to make all CIP journals top publications in their fields.
Based on your outstanding scientific contribution to your field, the CIP team would like to invite you to join the Editorial/Review Board of Journal of Communication Technology and Human Behaviors
Print ISSN: 2163-128X
Online ISSN: 2163-1298
Journal link: http://uscip.org/JournalsDetail.aspx?journalID=19
Acceptance of submissions to Journal of Communication Technology and Human Behaviors is based solely on decisions of the Editorial Board Committee and peer reviewers. If you are interested in serving on the Editorial Board committee, please send your CV to email@example.com and indicate the position (Editor-in-Chief, Associate Editor, Regular Editorial Board Committee member, or Reviewer) you are interested in. CIP will make a selection based on the competition. To accept an Editorial/Review board position, you are required to agree to the terms and conditions given at the end of this invitation letter. The names of Editorial Board Members will be listed online and in print copies.
Only with contributions from Editorial Board Committee members and peer reviewers can we make Journal of Communication Technology and Human Behaviors a top journal. If you have any questions or suggestions, please do not hesitate to contact us. We are keenly looking forward to hearing from you.
Columbia International Publishing LLC
3610 Buttonwood Drive Suite 200
Columbia, MO 65201, USA
Terms and Conditions of Editorial/Review Board Committee positions
1. All Editorial/Review Board members should try to promote the journal as a top publication
in the field.
2. This is a voluntary and honorary position. No payment from Columbia International Publishing is associated with this position.
3. The term is typically two years. It is renewable with approval by the Administration Department of Columbia International Publishing.
4. The Editor-in-Chief should send manuscripts to at least two experts in the field for review.
Editorial/Review Board members should provide timely, fair, objective, and professional comments on the manuscripts assigned by the Editor-in-Chief.
5. Editorial/Review Board members should never disclose information pertaining to any manuscript under their review.
6. Columbia International Publishing reserves the exclusive right to change any rules and terms and conditions without prior notice.
InChI Strings and InChIKeys are very much the backbone of ChemSpider and have quickly become a way by which online databases are being connected online. The InChIKey is a hash of the InChiString and when the hash was adopted it was suggested that the likelihood that there would be a collision was very small, the estimate being, as quoted from the official InChI site:
“An example of InChI with its InChKey equivalent is shown below. There is a finite, but very small probability of finding two structures with the same InChIKey. For duplication of only the first block of 14 characters this is 1.3% in 109, equivalent to a single collision in one of 75 databases of 109 compounds each.”
At a previous ACS Meeting Prof Jonathan Goodman from University of Cambridge announced that he had identified a collision. The collision was for two isomers of spongistatin, a rather complex chemical structure with many stereocenters.
Jonathan has “done it again”…what a troublemaker he is (in a supremely gentlemanly way!). I was fortunate enough to receive the news about this collision from him just as I was getting on the flight from ACS Denver to home tonight and asked his permission to blog it as it is both exciting and, I believe, quite surprising news. Why? In this case the collision is for two distinctly different chemicals with totally different formulae and with NO stereochemistry! Very surprising!
As you can see in the figure below the two chemical compounds are simply long branched alkyl chains, one an alcohol and one a ketone.
In case Jonathan’s software tool that he was using to connect to the InChI generation software was doing something untoward with the molfile I confirmed the observation myself by drawing the structures in ACD/ChemSketch and generating the InChIKeys there. And, sure enough…I see exactly the same Standard InChIKeys for both molecules as shown in the movie below. VERY interesting!
@mattoddchem has posted “Raw Data in Organic Chemistry Papers/Open Science” regarding his wanting to “share data in my field I have to think about how to best share those data”. I recommend you read his whole post before continuing.
Mat has made a number of comments (my annotations added as AJW>)
What are the downsides of posting raw data in organic chemistry, either in papers or to lab book posts:
1) You have to save the data and then upload them. Well, this was a problem in 1995, but not now.
AJW> The problem is definitely not in the technology of uploading to a website. It can be easy to do BUT there is still an activation barrier: wanting to share, can I share (do I have the rights or will my boss/supervisor be upset?), will there be negative repercussions for sharing.
2) The data files are large. Not really. A 1H NMR spectrum is ca. 200KB.
AJW> Depending on the magnetic field strength and chosen resolution file sizes can be less than half this size or twice this size. It depends a lot on format also. Binary file formats can give you 100k which blow up to 1/2 mbyte as JCAMP (assuming none of the compressed formats)
3) It’s a pain. Yes, a little. But we must suffer for things we love.
AJW> Yes indeed….
4) People might find mistakes in my spectra/assignments. Yes. You’re a scientist. This is a Good Thing.
AJW> If we were afraid of the reexamination of our data we shouldn’t get into science, we shouldn’t publish and we shouldn’t share. How many scientists are like that???
Mat went on to ask about what types of raw data should be posted?
“If we can establish that we should be posting raw data, then what kinds of data should we share, and how? This post is meant to outline an answer, and ask for feedback from anyone who’s already thought about this.
1) X-ray crystallography. This is the exception. Data are routinely deposited raw, and may be downloaded. Not always the case, but XRD blazes a trail here.
AJW> I’m not a crystallographer…I’m an NMR jock by training. So the term “raw data” to me means what comes off the instrument. I think that most crystallography data is processed to extract the CIF file so it’s not raw per se. In NMR raw would be the binary file format from the vendor as a FID file, the time domain data. A frequency domain, phased, baseline corrected and referenced spectrum would not be raw data. The FID is not the most useful form of the spectrum for sure, and real time processing of the FID with all necessary corrections (referencing and phasing etc) is not going to succeed for all spectra.
2) NMR spectroscopy. The big one. IUPAC recommends the JCAMP-DX file format. Jean-Claude Bradley has been a proponent of this format, and has demonstrated how it can be used in all kinds of applications.
AJW> JCAMP is the de facto standard for data exchange in NMR. Most of the vendors (probably all) can autosave their data in JCAMP or at least Export in JCAMP. The third part processing applications import and export JCAMP, some in all of their various flavors….for example, the ACD/NMR processor package has the ability to export all of the various JCAMP flavors….including real only, real and imaginary, with extensions for integrals, peaks, assignments etc. The five JCAMP forms are listed. There are of course no guarantees that all of these various JCAMP flavors can be interpreted by other programs. JCAMP has that type of complexity unfortunately.
AJW> I downloaded a number of these and read them easily into my NMR processing package that I use on my PC. I submitted the first structure in the series 2a (N-(3-Azidopropyl)-1-methylpyrrole-2-carboxamide) to ChemSpider (it wasn’t on the database already) and then added the H1 and C13 spectra with a few clicks (see here how). The spectra are flagged as Open Data and available for everyone to download and reuse. I am assuming this is okay since the PLoS article is Open Access, though the data itself isn’t flagged as Open. The data are on ChemSpider here and shown below.
My opinion of this file format (both generating it, and reading it) has not been great. There are two formats, I understand, and we found that if we saved the data in the wrong format, we couldn’t read the data with certain programs, but could with others. i.e. we had to get the generation of the file just right.
AJW> There are way more flavors than two! See above in the image. Over 10 years at ACD/Labs we had to change the JCAMP reader 10s of times to support all of the different implementations of the “standard” JCAMP. A standard set is not a standard followed…
That kind of trickiness, though small, just inevitably means people won’t bother to generate or use the files on a mass scale (unless the journals decide to back it). PDF’s popularity is based on the ubiquity of the reader. JCAMP-DX works well with Jspecview, a free, open source NMR data reader. We’ve not enjoyed our experiences with this, either, though it’s a wonderful endeavour. This led us to look at whether there was a need for saving the data in a particular format, or whether we could just save the raw data, and process those data with a free piece of software. After looking at this with our resident NMR guru, Ian Luck, we found that saving raw data is easy (it’s just a copy and paste of what’s produced by the machine) and that the raw data can be read by free software such as Spinworks or ACDLabs, obviously in addition to our in-house software. This seems ideal? Does anyone have the reason IUPAC prefers a derived data format over the raw data, other than JCAMP-DX is a single file? Aren’t raw data likely to be the most generically useful long-term?
AJW> Assuming that raw data means binary file formats from a spectrometer vendor I can comment that binary files have big issues in terms of longevity. Binary files are that way for a reason …one reason being that the vendors have proprietary information regarding their data acquisition and corrections in the format that are then efficiently dealt with by the vendors software. To try and process Varian, Bruker or Jeol files from many years ago, from their archaic software is likely not easy in even their own modern software. Depending on the vendors to provide their binary file formats to code against generally requires permission AND legal contracts. When I joined a place of work about 20 years ago we had a room of old Winchester drives in OLD Bruker and Varian formats. Not only could we NOT read the files we couldn’t mount the Winchester drives. What about 5.25″ drives, 3.5″ drives, optical disks etc. Lots of that data is almost certainly rotted at this point. JCAMP does offer longevity. Even when a spectrum cannot be read generally it is not difficult to hack the reader. Fixes are easy.
I don’t know if people have experience of this. I was in touch with one of the ACS journals recently, who indicated that their view was that the journal is not a data repository, and that posting of raw data (which was in their view to some extent desirable) should be posted elsewhere, e.g. to an institutional repository. This is an option. I think it’s less convenient. PLoS seem happy to host the data.
AJW> PLoS are hosting the data but in order to view the data you have to have a JCAMP viewer or processing package that will read the JCAMP spectrum. What a reader might like to be able to do is view the spectrum on the Supp Info page. This is of course possible as can be done using the Embed Functionality of ChemSpider’s spectra. The spectra could be posted to ChemSpider and embedded in the PLoS pages if the journal allowed it. There are multiple advantages to this including the spectra being made available to the Spectral Game.
3) IR data. Don’t know if there is a standard. If the file is small, saving raw data could be encouraged. Would allow easy comparisons of fingerprint regions.
AJW> JCAMP is also a standard for IR. In fact I think JCAMP was developed for IR first? See here for IR spectra on ChemSpider.
4) Mass spectrometry. It’s not clear to me there is a huge advantage here to sharing raw data, for a typical low res experiment?
AJW> Most mass spec for chemists is done using soft ionization for mass only. Fragmentation obviously is of more value. If its Electron Impact then JCAMP will suffice. See here. In general NetCDF if the preferred format but we don’t support that on ChemSpider at present.
5) HPLC data. Again, the outputs are fairly simple, and I’m not clear about the advantage of raw data (which I’m assuming would be absorbance vs. time table). Would (perhaps) permit verification that traces have not been cropped to remove pesky impurities.
AJW> Adding HPLC support would not be difficult as an HPLC curve. If it was 2D data of LC and UV that would be problematic.
6) Anything else?
AJW> Raman, Electron Spin Resonance, 2DNMR, Near infrared, Far infrared, UV-Vis etc. All of these, except ESR, are supported in ChemSpider. Check out this example of H1, C13 and THREE 2D NMR spectra!”
An alternative format of interest might be AnIML, the Analytical Information Markup Language. This format was started a number of years ago and the site does look like it is static but you can follow AnIML on Twitter and it is still alive. I am a little pessimistic that it will provide much value in the near future as the standard will first need to be released, then accepted, then tools will need to be developed/extended to support the format. The challenge then is either that some organization will need to write all the format converters for all of the vendor instrument binary file formats (and support them for tech support etc) OR the vendors themselves will need to export AnIML – with similar issues to those that arose with JCAMP where the vendors developed their flavor. Clearly a validation suite for checking AnIML files would be appropriate. Overall AnIML has promise but it has taken years to get here already and I would say it will have limited impact in the foreseeable future.
For Mat’s spectral data it would be great to deposit it en masse to ChemSpider rather than for us to deal with it one spectrum and one compound at a time. That is why we already have developed mass deposition tools. If we receive the JDX files in individual directories with the associated molfile then we can do mass deposition. The spectra are then available on ChemSpider immediately, are available to the Spectral Game and also will be fed to our NEW implementation of SpectraSchool that is presently under development and will use spectra under ChemSpider as the foundation data. I’d say the solution for Mat is mostly in place…
It’s taken me about a month to get around to this post…our book on Collaborative Computational Technologies is shipping. The book is now available on Amazon here. The book is described in the movie posted on Slideshare here and embedded below. Sean has given the story about how the book came about on his blog.
The book is edited by Sean Ekins (a photo of a proud Sean is here), Maggie Hupcey and myself but has turned out to be a great volume because of the contributions by all of the chapter authors. You can get example chapters at the Wiley website.
The first review is already up on Amazon from @untangledhealth, Jeff Harris. His review is posted on his blog also. We’d welcome any more reviews!!!
Tonight I was trying to add a couple of publications to my Google Scholar Citations account. The interface available to me is shown below.
It would make a lot more sense to provide the ability to input a DOI and use the Crossref Resolver to retrieve all of the associated information. There is a full DOI for this purpose (we use it on ChemSpider already) so if I wanted to put our new Melting Point Nature Precedings article into Google Scholar Citations then I would simply put 10.1038/npre.2011.6229.1 into the resolver and it would would link me, or for Google citations it would fill in the appropriate fields. Seems like a good marriage?