Archive for category ChemSpider Chemistry
Second talk delivered today at ACS Philadelphia…
Mining public domain data as a basis for drug repurposing
Online databases containing high throughput screening and other property data continue to proliferate in number. Many pharmaceutical chemists will have used databases such as PubChem, ChemSpider, DrugBank, BindingDB and many others. This work will report on the potential value of these databases for providing data to be used to repurpose drugs using cheminformatics-based approaches (e.g. docking, ligand-based machine learning methods). This work will also discuss the potentially related applications of the Open PHACTS project, a European Union Innovative Medicines Initiative project, that is utilizing semantic web based approaches to integrate large scale chemical and biological data in new ways. We will report on how compound and data quality should be taken into account when utilizing data from online databases and how their careful curation can provide high quality data that can be used to underpin the delivery of molecular models that can in turn identify new uses for old drugs.
I had the privilege of co-chairing a session on Mobile Chemistry today with Harry Pence. The session was “Mobile devices, augmented reality, and the mobile classroom”
Putting chemistry into the hands of students – chemistry made mobile using resources from the Royal Society of Chemistry
The increasing prevalence of mobile devices offers the opportunity to provide chemistry students with easy access to a multitude of resources. As a publisher the RSC provides a myriad of content to chemists including an online database of over 26 million chemical compounds, tools for learning spectroscopy and access to scientific literature and other educational materials. This presentation will provide an overview of our efforts to make RSC content more mobile, and therefore increasingly available to chemists. In particular it will discuss our efforts to provide access to chemistry related data of high value to students in the laboratory. It will include an overview of spectroscopy tools for the review and analysis of various forms of spectroscopy data.
Chemistry is complex. Anybody who has been involved with the creation of electronic datafiles containing thousands of chemical compounds and associated data (chemical names, properties etc) will tell you that errors creep in. ChemSpider has >28 million unique chemical entities and these have been sourced from many different places/groups/individuals. Some of these have been deprecated as we have determined, both manually and algorithmically, that the data are in error. Over the years we have learned a lot about data quality and ways in which algorithms can be applied to data prior to deposition on ChemSpider.
Some obvious structure-based errors that can be checked for would include: hypervalency (e.g. pentavalent carbons), charge imbalance (a compound has no neutralizing counterion for example), absence of stereochemistry (e.g. a compound with 12 possible stereocenters only has one assigned). There are many other such errors that can be detected algorithmically. It’s the old adage of why apply a human to what a computer can fix. With this in mind we have been working on a system called the ChemSpider Validation and Standardization Platform (CVSP for short). This system will serve multiple purposes. It will be one of the foundation blocks for checking structure-based data for our publications (i.e. catch bad chemistry before it is published!), it will be used for validating chemistry for our databases (Natural Product Updates, Methods in Organic Synthesis and Catalysts and Catalyzed Reactions), it will be used to check and validate depositions going into ChemSpider, it will serve data related to the Open PHACTS project and it will serve the community by providing an online website where you can upload your own SDF files (and other file formats in future) to validate the structures.
I won’t go into detail here about all of the functionality and capability of the system as we will discuss this in further detail on this blog. However, we will be unveiling the system in its present form at the ACS meeting in Philadelphia. Come along and meet some of the team involved in building CVSP and give us your feedback!
I will be presenting on “Teaching NMR spectroscopy using online resources from the Royal Society of Chemistry” later this afternoon at the ACS meeting here in San Diego. I have just finished the acknowledgment slide and it is always fun to see just how collaborative some of our ventures are. In order to provide community resources such as ChemSpider (and its integrated projects such as the Learn Chemistry Wiki and SpectralGame) we depend on willing, creative and collaborative minds. And we work with some of the best!
The acknowledgments slide includes:
Alexey Pshenichnov, University of Leicester and Richard Oakley – SpectraSchool
Aileen Day and Martin Walker – Learn Chemistry Wiki
Gary Allred and Chi Wang – Synthonix Data
Ryan Sasaki, Sergey Golotvin, Pranas Japertas (ACD/Labs) –Bulk data processing and Spectral Display Widget
Depositors of data – there are many!
We thank all of these individuals and their companies for supporting our efforts!
I will post the presentation to my slideshare account later today and then embed it to the blog also…but, if youa re in San Diego come and hear the talk…
PAPER ID: 10893 PAPER TITLE: “Teaching NMR spectroscopy using online resources from the Royal Society of Chemistry” (final paper number: 61) DAY & TIME OF PRESENTATION: March 25, 2012 from 2:50 pm to 3:10 pm LOCATION: Westin San Diego , Room: Diamond II
I just gave a presentation at the NFAIS conference in Philadelphia with the conference focus being “Born of Disruption: An Emerging New Normal for the Information Landscape”. I was on a panel with Lee Dirks from Microsoft Research and Kristen Fisher-Ratan from PLoS. Both gave very interesting talks and it was a pleasure to be on the panel with them.
My talk was entited “Crowdsourcing Chemistry for the Community – 5 Years of Experiences” with the abstract below.
“ChemSpider is one of the internet’s primary resources for chemists. ChemSpider is a structure-centric platform and hosts over 26 million unique chemical entities sourced from over 400 different data sources and delivers information including commercial availability, associated publications, patents, analytical data, experimental and predicted properties. ChemSpider serves a rather unique role to the community in that any chemist has the ability to deposit, curate and annotate data. In this manner they can contribute their skills, and data, to any chemist using the system. A number of parallel projects have been developed from the initial platform including ChemSpider SyntheticPages, a community generated database of reaction syntheses, and the Learn Chemistry wiki, an educational wiki for secondary school students.
This presentation will provide an overview of the project in terms of our success in engaging scientists to contribute to crowdsouring chemistry. We will also discuss some of our plans to encourage future participation and engagement in this and related projects.”
The talk is embedded below. I thank the organizers for the ability to ask questions during the talk and get responses using a clicker feedback system (I didn’t realise ahead of time that the questions would consume a few seconds and ran over on my talk..agh). I will get the answers to the questions and post them in a separate post. Interesting answers…
Last week I was in the United Kingdom for numerous meetings and at the end of the week struggled to drive north to Macclesfield to the AstraZeneca site there to give a presentation on ChemSpider for an old colleague of mine from the Eastman Kodak company. I had not seen Tony Bristow in well over a decade but we reminisced about the good old days at Kodak (Tony worked in Harrow, UK and I worked in Rochester, NY. Tony is a Mass Spectrometrist and I am an NMR Spectroscopist by training). We also discussed how scientists are increasingly tapping into the ChemSpider resource to aid in the identification of chemical compounds using, especially, Mass Spectrometry. We have numerous examples now of when people are solving their structure ID issues directly by searching ChemSpider and are building up a portfolio of success stories.
The presentation I gave is below and loaded on SlideShare in case you want to download it.
As discussed over on the ChemSpider blog we will soon be depositing data from the SORD databases (Selected Organic Reactions Database) onto ChemSpider. This will be done as two separate but related datasets until the SORD data source: Reactants and Products. If you don’t know what SORD is then who better to explain than Dick Wife, the “host” of the SORD database. Dick wrote the overview article below to provide an overview about what SORD is…ENJOY!
The Selected Organic Reactions (SOR) Database: capturing “Lost Chemistry”
A new database is capturing the 80% of Lost Chemistry from theses and dissertations which doesn’t make it into publications and chemists who contribute their data get access to the entire database for free.
SORD, an independent Dutch company, is carefully selecting the synthetic chemistry focused on Life Science research and making this chemistry available in their Selected Organic Reactions (SOR) Database. For the theses/dissertations which they select, SORD excerpts all of the reactions in the Experimental section are excerpted. This means there will still be a small overlap of data with full publications. There will also be a larger overlap with publications such as Notes, Letters or Communications but these do not contain the experimental details. The SOR Database brings all this chemistry to the desktop, every last detail written by the author.
Some time back, SORD looked at around 300k interesting drug-like compounds in the literature and which countries they had come from, and the native language. The English-speaking countries accounted for only 37% of the total. German/Swiss dissertations are often written in English but this is new. The theses and dissertations in the other languages represent more than half of the total. SORD routinely translates German and French experimental texts into English. They are about to start on Chinese and Japanese translations and, if anyone can give them access to Russian theses, they will translate these as well!
A thesis or dissertation is the result of several years of hard work by a research student under the constant supervision of the research leader whose reputation is at stake if the work described is wrong or inaccurate. It is also examined by a committee who decide on awarding the degree, or not. They scrutinize closely the Results & Discussion as well as the Experimental sections. The chemistry is reliable.
Advanced Chemistry Development, Inc (ACD/Labs) is partnering SORD in developing this Database. The SOR Database is available for in-house use with ChemFolder Enterprise or on the Internet with ACD/Web Librarian™. This is a screen-shot of a typical SOR Database record in Web Librarian.
The Reaction Scheme shows every atom (there are no abbreviations). The Experimental text is edited to ASCII format and the key parameters (Reagent(s), Solvent(s), yield(s), MP(s) and Optical Rotation(s) are displayed in separate Fields, as are the full bibliographic data, making data-mining possible. There is also a link which enables the user to bring up the PDF of each reaction, containing all of the spectral and other physical data which SORD does not excerpt. The PDF link is a powerful and unique feature of the SOR Database.
Now some explanation about SORD’s excerption rules. What they call the Reaction Scheme (A + B à C, etc.) contains only the reacting and product compound structures. A Reagent is an essential reaction component of which no part ends up in the product – if it does, it becomes a Reactant! When several reactions are performed before the product is isolated (and characterized) the Reagents and Solvents are listed in Steps. Failed reactions are not excerpted but reactions with poor yields are.
The SOR Database currently contains 170k reactions; the target is one million at the end of 2013. Even this number is a lot smaller than what you find today in the major commercial reaction databases. Back in the nineties, SORD researchers looked at one such large commercial database which then contained 9 million compounds. Sifting through the content for drug-like compounds resulted in just 450k or 5% of the records. Size is one database metric; quality is much more important! In the SOR Database, you will only find characterized products – and no polymers, or compounds with no molecular structure.
Users of the SOR Database also have access to the separate databases which contain the Reagents (ca. 3,000) and Solvents (ca. 450) which have been encountered so far. Often a Reagent is a catalyst (organic/organometallic) but they can also be simple entities like bases, acids, ammonium salts, etc. or complex chiral ligands. Authors give Reagents many different names and so each Reagent (and Solvent) in the SOR Database has been assigned a unique name. This enables rapid searches using the assigned names, again a novel feature of the database. Such searches can bring you to really nice chemistry.
As an Example, the second generation Grubbs olefin metathesis catalyst has been given the name Grubbs 2 catalyst. In the current SOR Database, there are more than 500 reactions where it has been used. Some of these are straightforward; some are not and generate novel ring systems like this one from the Martin group at North Carolina at Chapel Hill:
Searches in the Reactions Scheme, or using Reagent/Solvent names and hit refinement brings you to new chemistry which until now was only found on a dusty shelf in a library. The “Lost Chemistry” is now getting smaller as SORD carefully selects and excerpts the reactions which deserve a new life. The SOR Database is essential for novelty searches and it is a powerful supplement for the other commercial reaction databases.
Finally some more good news for academic research chemists; your data will be readily accessible to the whole chemical world who will cite your work in their publications. The chemistry which you never published may be just what others are looking for. Routinely SORD excerpts the complete collection of theses and dissertations from research supervisors; they will be more than happy to see your work appear in the next SOR Database!
 de Laet, A.; Hehenkamp, J. J.; Wife, R. L. Finding Drug Candidates in Lost/Emerging Chemistry. J. Heterocycl. Chem. 2000, 37, 669–674.
An early view screencast of the functionality of ChemSpider Mobile is now available. New movies showing the details of the app will follow in the near future but this is an early view for interested parties.
Following on from the many comments made about the recent post about the NPC Browser Markus Sitzmann highlighted a “fun molecule” that he found on ChemSpider. It was here as ChemSpiderID 19053748 shown below but it has now been deprecated…I logged in and deprecated it .
Markus also commented on Sean Ekin’s blog here:
“Well, particularly ChemSpider belongs to the group of “polluters” in PubChem. Count the number of Aspirin, Benzene or Ethanol structures submitted by ChemSpider to PubChem (only linking to a “deprecated” ChemSpider record). Or make an advanced search for ChemSpider records containing also Argon, here is an example:
There are many other examples.”
Markus is CORRECT. I have commented on this publicly myself on a number of occasions and many people have noticed that there are data in PubChem that are in error and originally came from ChemSpider. There’s no point denying it as it’s there for all to see ! We have had the intention for a LONG time to deprecate this data from PubChem and replace it with an updated deposition of cleaner data. The intention remains but the challenge is finding the time to do it. We will do it.
Where did the data came from? These “argon” issues are really NOT argon issues…they are the results of molfiles finding their way into ChemSpider from “patent molecules” where the -Ar is expected to represent a Markush structure where Ar means “Aryl”. This is like -Alk meaning alkyl. Similar issues arise when molecules are drawn as -X, -Y and -Z and lists of X,Y,Z substitutions are give. For example X=CH3, C2H5, Y=F, Br and Z= Br, Cl. Unfortunately Y is not only a substitution it’s an element, Yttrium. So when a molecule is drawn with a supposed Markush bond to -Y then we have a REAL molecule with Yttrium attached. Agh.
A list of the examples of “interesting Ar molecules” are shown below.
At this point these have all been deprecated…takes about 30 seconds per molecule..but if they were in our original deposition to PubChem they are still there until we deprecate. Ahh…the ongoing joys of data curation.