I’ve been involved in a number of conversations recently around how monoisotopic masses can be used and the chance of “elucidating a structure” from a molecular formula. There are some shockingly naive views of this possibility. With the availability of accurate mass determinations by mass spectrometry, and the possibility to extract a molecular formula from the data, there are some who believe it is possible to “elucidate structures” using a monoisotopic mass. Let’s clear this naivety up…
Recently I gave a presentation at a local university regarding informatics. During the presentation I asked the students how many structures could be generated “withint the rules of basic organic chemistry” for some very short elemental formulae. General rules means no inappropriate valences but no limitations on the nature of the rings (except none base don 2-carbons 🙂 ) etc. EVERYONE underestimated by many factors.
While working on a structure elucidation software program the issue of how many structures could be generated from some fairly nominal formulae became very clear. Below are some example formulae, the “correct” structure associated with the data under analysis and the number of chemical structures that can be generated from this formula. Notice those numbers….numbers like: 138,136,211,624 structures from a formula of C15H22O2 !
Therefore,_the story that monoisotopic mass, that can give a single molecular formula, can give you an unambiguous chemical structure needs to stop. Now, that said, since we have close to 20 million structures online at present the question “What is the distribution of molecular formulae across ChemSpider?” was an interesting question. So, we ran a query to determine the highest frequency of formulae. The formula C18H20N2O3 occurred 5110 times in the database, 4804 times when looking at single components only. Some representative structures are shown:
I imported the data into Excel (Office 2003) with a 65000 row limit. While there are single molecular formula compounds in the list at the end of the file (viewed in wordpad) at the 65000’th row the frequency was still 45 entries in the database. It’s a long tail..
Now, many people are using mI masses to examine metabonomics data so it may be more appropriate to do the analysis on a more restricted dataset. For example, databases of interest to metabonomics people include KEGG and HMDB. Isolating the search to such databases shows that while there is a much shorter list of unique formulae (8590) a similar distribution persists . The most common formula is C6H12O6 with 71 hits. Searching this in the database shows a number of linear and cyclic carbohydrates, some with stereo, some without as shown below. if you are confused about “linear versus cyclic” see this Wikipedia article.
Monoisotopic mass isn’t going to provide the stereo information anyways and all you will get is a lot of similar structures…but of course there are MANY carbohydrates with that formula. I’ve the listed a group of some of the top formulae here and leave it to you to investigate!
C12H22O11 = 55 hits
C6H8O7 = 52 hits
C5H10O5 = 46 hits
C20H3205 = 46 hits
C8H803 = 40 hits
C20H32O3 = 39 hits
C20H32O4 = 38 hits
C2H4O2 = 38 hits
C24H40O4 = 37 hits
CH4O3S = 36 hits
Bottom line…even removing stereo issues and isolating to a small number of databases it is still an issue to declare that a structure is elucidated just from a mass and some form of prior knowledge or additional information such as elution order or time is necessary.
Now, this observation may not be surprising to many people. The response may be that tandem Mass Spectrometry would give an ambiguous structure. This is also not true unfortunately and in general even tandem MS (MS^n) cannot give a conclusive structure. Certainly, if stereochemistry is involved (as with many carbohydrate molecules) you are still stuck. While library look-ups using monoisotopic mass ARE valuable, and tandem MS adds more criteria for structure identification, neither are unambiguous.