Archive for category ChemSpider Chemistry
Last week I was in the United Kingdom for numerous meetings and at the end of the week struggled to drive north to Macclesfield to the AstraZeneca site there to give a presentation on ChemSpider for an old colleague of mine from the Eastman Kodak company. I had not seen Tony Bristow in well over a decade but we reminisced about the good old days at Kodak (Tony worked in Harrow, UK and I worked in Rochester, NY. Tony is a Mass Spectrometrist and I am an NMR Spectroscopist by training). We also discussed how scientists are increasingly tapping into the ChemSpider resource to aid in the identification of chemical compounds using, especially, Mass Spectrometry. We have numerous examples now of when people are solving their structure ID issues directly by searching ChemSpider and are building up a portfolio of success stories.
The presentation I gave is below and loaded on SlideShare in case you want to download it.
As discussed over on the ChemSpider blog we will soon be depositing data from the SORD databases (Selected Organic Reactions Database) onto ChemSpider. This will be done as two separate but related datasets until the SORD data source: Reactants and Products. If you don’t know what SORD is then who better to explain than Dick Wife, the “host” of the SORD database. Dick wrote the overview article below to provide an overview about what SORD is…ENJOY!
The Selected Organic Reactions (SOR) Database: capturing “Lost Chemistry”
A new database is capturing the 80% of Lost Chemistry from theses and dissertations which doesn’t make it into publications and chemists who contribute their data get access to the entire database for free.
SORD, an independent Dutch company, is carefully selecting the synthetic chemistry focused on Life Science research and making this chemistry available in their Selected Organic Reactions (SOR) Database. For the theses/dissertations which they select, SORD excerpts all of the reactions in the Experimental section are excerpted. This means there will still be a small overlap of data with full publications. There will also be a larger overlap with publications such as Notes, Letters or Communications but these do not contain the experimental details. The SOR Database brings all this chemistry to the desktop, every last detail written by the author.
Some time back, SORD looked at around 300k interesting drug-like compounds in the literature and which countries they had come from, and the native language. The English-speaking countries accounted for only 37% of the total. German/Swiss dissertations are often written in English but this is new. The theses and dissertations in the other languages represent more than half of the total. SORD routinely translates German and French experimental texts into English. They are about to start on Chinese and Japanese translations and, if anyone can give them access to Russian theses, they will translate these as well!
A thesis or dissertation is the result of several years of hard work by a research student under the constant supervision of the research leader whose reputation is at stake if the work described is wrong or inaccurate. It is also examined by a committee who decide on awarding the degree, or not. They scrutinize closely the Results & Discussion as well as the Experimental sections. The chemistry is reliable.
Advanced Chemistry Development, Inc (ACD/Labs) is partnering SORD in developing this Database. The SOR Database is available for in-house use with ChemFolder Enterprise or on the Internet with ACD/Web Librarian™. This is a screen-shot of a typical SOR Database record in Web Librarian.
The Reaction Scheme shows every atom (there are no abbreviations). The Experimental text is edited to ASCII format and the key parameters (Reagent(s), Solvent(s), yield(s), MP(s) and Optical Rotation(s) are displayed in separate Fields, as are the full bibliographic data, making data-mining possible. There is also a link which enables the user to bring up the PDF of each reaction, containing all of the spectral and other physical data which SORD does not excerpt. The PDF link is a powerful and unique feature of the SOR Database.
Now some explanation about SORD’s excerption rules. What they call the Reaction Scheme (A + B à C, etc.) contains only the reacting and product compound structures. A Reagent is an essential reaction component of which no part ends up in the product – if it does, it becomes a Reactant! When several reactions are performed before the product is isolated (and characterized) the Reagents and Solvents are listed in Steps. Failed reactions are not excerpted but reactions with poor yields are.
The SOR Database currently contains 170k reactions; the target is one million at the end of 2013. Even this number is a lot smaller than what you find today in the major commercial reaction databases. Back in the nineties, SORD researchers looked at one such large commercial database which then contained 9 million compounds. Sifting through the content for drug-like compounds resulted in just 450k or 5% of the records. Size is one database metric; quality is much more important! In the SOR Database, you will only find characterized products – and no polymers, or compounds with no molecular structure.
Users of the SOR Database also have access to the separate databases which contain the Reagents (ca. 3,000) and Solvents (ca. 450) which have been encountered so far. Often a Reagent is a catalyst (organic/organometallic) but they can also be simple entities like bases, acids, ammonium salts, etc. or complex chiral ligands. Authors give Reagents many different names and so each Reagent (and Solvent) in the SOR Database has been assigned a unique name. This enables rapid searches using the assigned names, again a novel feature of the database. Such searches can bring you to really nice chemistry.
As an Example, the second generation Grubbs olefin metathesis catalyst has been given the name Grubbs 2 catalyst. In the current SOR Database, there are more than 500 reactions where it has been used. Some of these are straightforward; some are not and generate novel ring systems like this one from the Martin group at North Carolina at Chapel Hill:
Searches in the Reactions Scheme, or using Reagent/Solvent names and hit refinement brings you to new chemistry which until now was only found on a dusty shelf in a library. The “Lost Chemistry” is now getting smaller as SORD carefully selects and excerpts the reactions which deserve a new life. The SOR Database is essential for novelty searches and it is a powerful supplement for the other commercial reaction databases.
Finally some more good news for academic research chemists; your data will be readily accessible to the whole chemical world who will cite your work in their publications. The chemistry which you never published may be just what others are looking for. Routinely SORD excerpts the complete collection of theses and dissertations from research supervisors; they will be more than happy to see your work appear in the next SOR Database!
 de Laet, A.; Hehenkamp, J. J.; Wife, R. L. Finding Drug Candidates in Lost/Emerging Chemistry. J. Heterocycl. Chem. 2000, 37, 669–674.
An early view screencast of the functionality of ChemSpider Mobile is now available. New movies showing the details of the app will follow in the near future but this is an early view for interested parties.
Following on from the many comments made about the recent post about the NPC Browser Markus Sitzmann highlighted a “fun molecule” that he found on ChemSpider. It was here as ChemSpiderID 19053748 shown below but it has now been deprecated…I logged in and deprecated it .
Markus also commented on Sean Ekin’s blog here:
“Well, particularly ChemSpider belongs to the group of “polluters” in PubChem. Count the number of Aspirin, Benzene or Ethanol structures submitted by ChemSpider to PubChem (only linking to a “deprecated” ChemSpider record). Or make an advanced search for ChemSpider records containing also Argon, here is an example:
There are many other examples.”
Markus is CORRECT. I have commented on this publicly myself on a number of occasions and many people have noticed that there are data in PubChem that are in error and originally came from ChemSpider. There’s no point denying it as it’s there for all to see ! We have had the intention for a LONG time to deprecate this data from PubChem and replace it with an updated deposition of cleaner data. The intention remains but the challenge is finding the time to do it. We will do it.
Where did the data came from? These “argon” issues are really NOT argon issues…they are the results of molfiles finding their way into ChemSpider from “patent molecules” where the -Ar is expected to represent a Markush structure where Ar means “Aryl”. This is like -Alk meaning alkyl. Similar issues arise when molecules are drawn as -X, -Y and -Z and lists of X,Y,Z substitutions are give. For example X=CH3, C2H5, Y=F, Br and Z= Br, Cl. Unfortunately Y is not only a substitution it’s an element, Yttrium. So when a molecule is drawn with a supposed Markush bond to -Y then we have a REAL molecule with Yttrium attached. Agh.
A list of the examples of “interesting Ar molecules” are shown below.
At this point these have all been deprecated…takes about 30 seconds per molecule..but if they were in our original deposition to PubChem they are still there until we deprecate. Ahh…the ongoing joys of data curation.
I’ve been involved in a number of conversations recently around how monoisotopic masses can be used and the chance of “elucidating a structure” from a molecular formula. There are some shockingly naive views of this possibility. With the availability of accurate mass determinations by mass spectrometry, and the possibility to extract a molecular formula from the data, there are some who believe it is possible to “elucidate structures” using a monoisotopic mass. Let’s clear this naivety up…
Recently I gave a presentation at a local university regarding informatics. During the presentation I asked the students how many structures could be generated “withint the rules of basic organic chemistry” for some very short elemental formulae. General rules means no inappropriate valences but no limitations on the nature of the rings (except none base don 2-carbons ) etc. EVERYONE underestimated by many factors.
While working on a structure elucidation software program the issue of how many structures could be generated from some fairly nominal formulae became very clear. Below are some example formulae, the “correct” structure associated with the data under analysis and the number of chemical structures that can be generated from this formula. Notice those numbers….numbers like: 138,136,211,624 structures from a formula of C15H22O2 !
Therefore,_the story that monoisotopic mass, that can give a single molecular formula, can give you an unambiguous chemical structure needs to stop. Now, that said, since we have close to 20 million structures online at present the question “What is the distribution of molecular formulae across ChemSpider?” was an interesting question. So, we ran a query to determine the highest frequency of formulae. The formula C18H20N2O3 occurred 5110 times in the database, 4804 times when looking at single components only. Some representative structures are shown:
I imported the data into Excel (Office 2003) with a 65000 row limit. While there are single molecular formula compounds in the list at the end of the file (viewed in wordpad) at the 65000′th row the frequency was still 45 entries in the database. It’s a long tail..
Now, many people are using mI masses to examine metabonomics data so it may be more appropriate to do the analysis on a more restricted dataset. For example, databases of interest to metabonomics people include KEGG and HMDB. Isolating the search to such databases shows that while there is a much shorter list of unique formulae (8590) a similar distribution persists . The most common formula is C6H12O6 with 71 hits. Searching this in the database shows a number of linear and cyclic carbohydrates, some with stereo, some without as shown below. if you are confused about “linear versus cyclic” see this Wikipedia article.
Monoisotopic mass isn’t going to provide the stereo information anyways and all you will get is a lot of similar structures…but of course there are MANY carbohydrates with that formula. I’ve the listed a group of some of the top formulae here and leave it to you to investigate!
C12H22O11 = 55 hits
C6H8O7 = 52 hits
C5H10O5 = 46 hits
C20H3205 = 46 hits
C8H803 = 40 hits
C20H32O3 = 39 hits
C20H32O4 = 38 hits
C2H4O2 = 38 hits
C24H40O4 = 37 hits
CH4O3S = 36 hits
Bottom line…even removing stereo issues and isolating to a small number of databases it is still an issue to declare that a structure is elucidated just from a mass and some form of prior knowledge or additional information such as elution order or time is necessary.
Now, this observation may not be surprising to many people. The response may be that tandem Mass Spectrometry would give an ambiguous structure. This is also not true unfortunately and in general even tandem MS (MS^n) cannot give a conclusive structure. Certainly, if stereochemistry is involved (as with many carbohydrate molecules) you are still stuck. While library look-ups using monoisotopic mass ARE valuable, and tandem MS adds more criteria for structure identification, neither are unambiguous.
I love Wikipedia. I use it at least half a dozen times a week…probably more of late. That said I have previously questioned the level of curation of the data on Wikipedia. (2,3) I DO believe that contributors to Wikipedia are making valiant efforts to ensure the quality of the data but I also believe that tools must be developed soon, or processes developed to ensure the quality of the data. Here’s why…
This is the chemical structure of Mupirocin on Wikipedia. Now, if you bothered to redraw that chemical structure in a drawing package showing the molecular mass (like I did) then you would see that it is NOT what is listed in the DrugBox
The structure, molecular formula and molecular mass are shown below taken directly from Free ChemSketch but of course all the drawing packages can do this!
Looking on ChemSpider I found three structures (two are identical but not yet deduplicated – this is presently going on in the background). two are shown below…
Structure 16739332, the top structure, is the correct one while the bottom one is in error. The structure comes from one data source only – Drugbank. Previously for Taxol, Drugbank contained the correct version of the structure. The problem is that ALL of our systems, including ChemSpider, have issues like this….we all have errors and they need curation. Wikipedia is great…the changes were made by me tonight…see here. I added a IUPAC Name, removed the link to Drugbank and updated the molecular mass.
I am committed to assisting in the curating of Wikipedia…many of us are. However, I think there must be a better way and will continue my discussions with the Wikipedia Chemistry Team to get access to all of the chemical compounds on Wikipedia if possible and validate the data in a batch using ChemSpider and associated tools.
I was approached today with a question regarding the contents of the ChemSpider database. I have commented previously about the fact that there are quality issues based on some of the depositions but that these are being cleaned up fairly quickly because of the efforts of our curation processes, both robotic and manual. The question was regarding the fact that there were two structures on ChemSpider with the registry number 34090-76-1. This is not uncommon. There are occasions when a registry number is appropriate for a particular salt form while the associated structure is the neutral compound. So, the registry number will be on the database for both the neutral compound and the salt. However, this situation was different…it was down to the position of the double bond. The person was out to confirm the position of that double bond. It was not easy for me to confirm.
What was MORE confusing was what the person had already extracted information from an STN Registry Search. That search provided the following information:
CAS Name: 1,3-Isobenzofurandione, tetrahydro-5-methyl- (CA INDEX NAME)
Other listed names:
Cyclohexene-1,2-dicarboxylic anhydride, 4-methyl- (8CI)
4-Methyltetrahydrophthalic acid anhydride
And the following structure:
Compare this structure with the other two off of ChemSpider shown below in the array of three.
Every_single name from STN is listed as a “tetrahydro” compounds so, there needs to be a double bond in the molecule by default. If there isn’t then the compound is a “hexahydro” compound.
Obviously one of the alternative names for the compound was derived from phthalic acid anhydride and this suggests that the “missing double bond” should be at the ring junction as shown.
Included in the STN record is the tag “IDS” tag in the “CI” or “chemical Indexing” field. The term IDS stands for “Incompletely Defined Substance”. So, this is an example of a registry number being allocated to a compound that, in this case, is known to have an additional double bond but it is not shown on the chemical structure displayed in the STN search results but ICS declares it as being “incompletely defined”. Some might say that the fact that ChemSpider has two structures associated with the registration number but each with the double bond in a different position is appropriate. But likely those specific compounds have their OWN registry numbers. So, what should we do?
1) Remove the registry number 34090-76-1 associated with both structures?
2) Leave as is?
3) Add a new term ICS for such records and submit the new incompletely defined substance as a new form of structure?
4) Add NEW registry numbers associated with the individual structures (which someone will need to source since I don’t have them)
5) Something else?
I welcome any or all input. Based on input I will simply login to ChemSpider, make the edit and the information is changed (for addition or removal of identifiers). By working together like this there is an iterative improvement in the quality of structure-name pairs for the benefit of chemists, just as shown with the recent Wikipedia examination of Taxol.
This is a declaration of intent that ChemSpider will shortly start hosting so-called “Focused Libraries” on ChemSpider in the very near future. The focused libraries will contain a set of compounds with in silico predicted affinities for specific protein targets. The availability of focused libraries can dramatically reduce how many compounds might require experimental examination for activity. The first sets of Focused Libraries have been supplied to us by Otava Chemicals.
This is part of our path to offering new services via ChemSpider. Discussions are underway to integrate to online docking, to the possibility of offering synthetic feasibility analysis and to expand the growing list of services integrated to ChemSpider due to the kindness and support of our collaborators.
Attention synthetic organic chemists. Most scientists have skeletons in the closet. Problems they cannot solve and observations they cannot explain. A couple of years ago I was involved in a project to slve the structure of a compound. It had remained unresolved by manual interrogation of the NMR data for over a decade. The application of a computer-assisted structure elucidation system helped resolve the structure. It is described in detail in this publication.
Now, we THINK we have it elucidated correctly. However, we would like to confirm it. Synthesis of the molecule in question, further NMR data generation and a crystal structure would help finish this work fully. This is a call to organic chemists to participate in a hobby project. Anybody want to help? We guarantee a publication etc. The structure is shown below. Contact me at antonyDOTwilliamsATChemspiderDOTcom. Thanks!