Tag Archives: ChemSpider

Presentation at the 5th Meeting on U.S. Government Chemical Databases and Open Chemistry

Yesterday I gave a talk at the 5th Meeting on U.S. Government Chemical Databases and Open Chemistry hosted by Mark Nicklaus. It was a great meeting. A lot of like minded people and some great work going on to provide access to chemical databases. I’ll blog more when I get back from the ACS Denver meeting this coming week. For now I am simply putting up a copy of the talk I gave.

“ChemSpider is a structure centric database hosted by the Royal Society of Chemistry and integrating over 25 million chemical compounds to over 400 internet-based resources including many public domain databases, Wikipedia, chemical vendors, patents, publications and other web-based services. The intention is for ChemSpider to become one of the primary online hubs for chemists to source chemistry related data. During the development of the ChemSpider database we have utilized numerous approaches to standardizing, curating and validating the data supplied to us for hosting and integration. This presentation will provide an overview of our initial development of the ChemSpider database and provide an overview of our present processes and procedures for handling incoming data depositions. We will also discuss how crowdsourcing can help to expand, curate and validate the data on the ChemSpider database.”


Leave a comment

Posted by on August 27, 2011 in Publications and Presentations


Tags: , , ,

Accurate Mass Measurements: Identifying “Known Unknowns” using ChemSpider

Over the past few weeks the ChemSpider team has been working hard with James Little from the Eastman Chemical Company. We have been adding new capabilities to support Mass Spectrometry searches. I will detail these capabilities in a later blog post but for now I am pointing to the POSTER that Jim presented at ASMS. It was a real pleasure working with Jim. I met him many years ago when I worked at Eastman Kodak company and before Kodak divested Eastman Chemical (among many other things). Jim gave us great feedback, was exacting in his testing and a gracious collaborator even as we let deadlines slip because of many other distractions.


Accurate Mass Measurements: Identifying “Known Unknowns” using ChemSpider

In very many cases, an unknown to an investigator is actually known in the chemical literature. We refer to these types of compounds as “known unknowns.” ChemSpider is a particular good collection of “known unknowns” for the identification of compounds in commercial products, environmental matrices, etc. However, several modifications were necessary to refine the initial search results sorting with orthogonal filters such as the number of associated patents and references. Previously we described a similar approach using the CAS registry with either SciFinder or STN Express, but ChemSpider is a viable alternative and it is freely accessible to the public.Accurate mass GC-MS and LC-MS measurements were performed on mixtures using, respectively, either Waters GCT or LCT (LockSpray) instrumentation. MassLynx (Waters) elemental software was used to determine molecular formulae which were further refined by i-FIT for ranking to theoretical isotope distributions. Candidate structures were obtained by searching either molecular formulae or monoisotopic molecular weights with ChemSpider. Further data such as EI or MS/MS fragmentation, number of exchangeable protons, or sample history were used to identify the “known unknown.” The ChemSpider database of >25 million chemicals was searched by either molecular formulae or monoisotopic molecular weights to identify “known unknowns.” The latter is an attractive approach since no subjective restrictions on the elements, the range of elements, and the double bond equivalents are required prior the ChemSpider search to limit candidate compounds. Changes were made in the ChemSpider to refine the initial candidate list by number of associated references or patents. This tended to bring more promising candidates to the top of the list. The success of these approaches was evaluated with a group of 90 compounds from literature sources, internet sites, and American Society for Mass Spectrometry Conference presentations. Furthermore, the results were compared to similar methods employed searching the Chemical Abstracts Services databases.


Tags: , , , ,

Community Views and Trust in Public Domain Chemistry Resources

Over the past 4 weeks I have been involved with some new and old friends in the world of chemistry to initiate an analysis of “quality” in public chemistry resources. This is work in progress and involves 3 separate groups (lets call them labs) looking at various resources. Here’s s short description of the project.

The questions we are attempting to answer are:

Core question : what is the quality of data online in public chemistry databases? How accurate and unambiguous is the representation of chemical structures and their measured properties in public chemistry databases?

How capable are the present cheminformatics tools of handling the complexities of structure representations – limited to “small” organic molecules
How hard is it to generate a reference set of highly curated, “gold standard” data (chemical structure and activity) for a database of “known drugs”?

We’ve started with the top 200 selling drugs. The three labs had to come to an agreement about which of the top 200 drugs were small molecules (many of the top 200 are monoclonal antibodies or polymers for example). We then had to decide would we deal with mixtures and combination drugs. Just to identify the list of NAMES of drugs we wanted to deal with was iterative and a negotiation.

Then we decided that each lab would work independently, each lab would have at least two members of the lab working on the same problem independently. We would have both intra-lab and inter-lab comparisons. We decided to start on a set of 10 drug names, using the GENERIC name as the name to work from. I started my part of the work just before I had to give a presentation at the EBI last week and was able to gather a lot of the data before the talk.

Starting with a chemical name how to you determine what the “correct” structure for that drug is. Think it’s easy? Try it! Where would you look? How would you confirm? What would the iterative loop look like in order for YOU to assert the chemical structure(s) for the drug “Vytorin”.

For me the process looks something like this. Use a level of “trust and experience” with previously used resources as a starting point and declare “This is the structure of X based on searching on the drug name for X”. Now, cross-reference, iterate, reiterate, find consistencies and collisions in order to come up with a final assertion, a list of consistent structures and the associated sources, and a set of other resources with inconsistent structures and a list of why they differ. Where possible, and if necessary, make edits to change the information (e.g. ChemSpider and Wikipedia). You can see an example of this for Vitamin K in the talk I gave at EBI here. For ten structures I came up with a number of observations for a number of drugs. The screenshot below summarizes some of the results (Click on the image to see the detail).

Represented in the table is the following information, true at the time of the search and may be already out of date

1) A search for thalidomide in ChEBI gives no hits

2) The structures of Zocor and Crestor on Drugbank are incorrect

3) There are no hits for Voglibose and Crestor on Common Chemistry

4) There were 3 incorrect structures on ChemSpider (now edited of course)

5) For most searches on a drug name on PubChem there are MULTIPLE hits and, for the set examined, the correct structure is in the results set. For example, there are 44 structures of Taxol retrieved with the search and the one I assert to be correct is there.

6) There were two incorrect structures on Wikipedia and one drug without an associated structure.

When I started the work I had a “trust” level for a number of the databases. My basic position at that time was as follows. I could rarely find the correct structure for a drug based on a text-based search of PubChem. I would generally find a set of hits and it was a lot of work to determine what was correct. Common Chemistry was excellent…but limited. Dailymed was generally good but structure representations could be abysmal.  ChEBI, DrugBank, ChemIDPlus and Wikipedia were generally VERY good. Of all of the sources I used, despite the rich data on PubChem, I struggled most with this resource to find the correct structure. The results started to show that my trust perceptions were being challenged.

In parallel with the work to prepare this small dataset for the presentation at EBI I decided that it was appropriate to ask the community for their views on some of the databases I was looking at in this work…specifically asking for how much they “trusted” a resource. Trust means different things to different people. The word, and the question I was asking in the survey, would be interpreted in different ways. But that’s the way we work…so why fight it? The survey is online here…and if you haven’t filled it in PLEASE DO!

The answers to date, from the 46 responders, are below (Click on the image to see the detail):

There are some very interesting results in here…and, I willingly admit, some I find VERY surprising. 1 person has no experience with Wikipedia? Wow. The majority of people have not heard of PDSP, ChemIDPLus, DailyMed or DrugBank…without knowing who the people are that are providing feedback of course I should not be shocked. Most of these resources are not for chemists per se but for Life Scientists. The number of votes for “Always Trust” for ChemSpider and PubChem are very high, and one might say, are a compliment. The results are clearly ChemSpider-biased since I asked the question to my social network. The difference between the people who Always Trust PubChem and Commonly Trust PubChem is one person only. This is wildly different from my own views. I have heard people say that PubChem is the equivalent of quality to CAS except it’s free. Sorry folks…afraid not! (I have since heard at the EBI meeting from one of the people from PubChem that it is possible to do searches in certain ways to limit hits. It should be noted that this does not guarantee that the correct structure is retrieved.) On the flip side to this the distribution of people rarely trusting PubChem is also quite high so someone has had some interesting experiences!

There are a small number of people who NEVER Trust the resources, and early on one person declared they trusted none of them. I trusted myself to tell a colleague…that will be “Egon Willighagen” and this was later confirmed in his blog “Trust has No Place in Science“. That may be true, and the topic of a separate post, but my judgment is pretty good!

How would I fill in the questionnaire. I would NEVER flag “Always Trust” for any of the databases. I would be able to rank order the databases in terms of my perceptions/experience and extracted trust for the quality of results I would find. The answers WOULD be different before I had conducted the work on the first 10 drugs compared with now, after the pre-work.

As the host of ChemSpider I would prefer that no one “Always Trusts” the resource as that will stop people from taking care with the data and thereby removing the possible value of them curating the data. However I am more than happy to have it Commonly Trusted and we have been working hard to gain the community’s trust in this area.

This work has triggered a number of responses….I’ll make my own comments on their positions separately… but their opinions are worth reading:

Egon Willighagen: Truth has no place in Science

Egon Willighagen: Truth has no place in Science Part 2

Christina Pikas: The role of trust in science Christina has a comment “I think that Anthony (sp.) could have chosen a better word than trust in his survey. “which of these have you evaluated and decided you could use? which of these would you prefer to use based on your evaluations of their merit?” Christina is right..I could have chosen a different word but I judge (chosen carefully!) that the responses would not have differed much.

There is also a healthy exchange happening on Friendfeed.

This work has only just started. An examination of >150 “small molecule drugs” by three labs is going to provide a lot of data. The work isn’t over and we have much to do. We’re learning a lot in the process about assertional loops, iterative process, collaboration and agreement. It’s a great adventure.


Posted by on December 11, 2010 in Community Building, Data Quality


Tags: , , , , , ,

Finding the Structure of Vitamin K1 Online

You would think that finding the correct structure of Vitamin K1 online in public domain resources would be an easy exercise. But not so fast. Using the assertion that the chemical structure is correct in the Merck Index, and then wandering through CAS’s Common Chemistry to validate this assumption, this short movie takes us through Wikipedia, Wolfram Alpha, KEGG, DrugBank, PubChem and other online resources to show how complex and impure the public domain databases are in terms of resourcing good quality name-structure associations for chemicals. Vitamin K1 is actually a rather simple chemical structure. Finding the correct chemical structure online…not so simple.


Tags: , , , , , , , ,

Presentation at FACCS2010 in Raleigh

Today I gave a presentation at FACCS 2010 here in Raleigh, NC. The abstract and embedded SlideShare presentation are listed below.

Building a Community Resource of Open Spectral Data

ChemSpider is an online database of almost 25 million chemical compounds sourced from over 300 different sources including government laboratories, chemical vendors, public resources and publications. Developed with the intention of building community for chemists ChemSpider allows its users to deposit data including structures, properties, links to external resources and various forms of spectral data. Over the past three years ChemSpider has aggregated almost 3000 spectra including Infrared and Raman Data and continues to expand as the community deposits additional data. The majority of spectral data is licensed as Open Data allowing it to be downloaded and reused in presentations, lesson plans and for teaching purposes. This presentation will provide an overview of our efforts to build a structure-indexed online database of spectral data, initiate a call to action to the community to participate in improving this resource for the community at large and discuss how such a resource could be used as the basis of a spectral game to teach students spectral interpretation.


Tags: , , , , ,

Chemicalize From ChemAxon

If you are a chemist and looking for some useful internet tools to assist you in your work I recommend Chemicalize from ChemAxon. It’s a great addition to the suite of tools that chemists can bring to bear on their problems. The fastest way to learn about Chemicalize is to watch the YouTube video here, and embedded below.

This is a great way to mark up compounds in web pages and then move over to the data pages for predicted properties. The predicted property capabilities is a great offering to the community. The predictions are licensed under Creative Commons “Attribution-NonCommercial-ShareAlike 3.0 Unported”. The site has a few teeting troubles, especially in terms of layout on IE8, but this should not detract from the value of the predictions. I am not aware of any other site that will provide free access to pKa predictions as shown below. This will really commoditize the market at this point and shake up the other vendors in this domain. ChemSpider has recently integrated Chemicalize as discussed on the ChemSpider blog.

Leave a comment

Posted by on August 20, 2010 in Software


Tags: , , ,

My presentation on Mobile Chemistry and does Slideshare work for marketing?

I have been using Slideshare to post my presentations for a couple of years now. It’s easy to use, has high traffic, has great utilities like embedding (used below) and is a “safe place” to store my presentations (assuming it stays alive!). The presentation I gave at the Special Libraries Association in New Orleans this week received over 300 views in 24 hours. It’s at 735 views as I write this, that’s in 3 days. I have received emails, its been embedded in other websites and I’ve received some very positive feedback. Now I need to find time to do a voiceover version and put it on YouTube!

Slideshare is a super way to share my presentations, store my documents/papers for public exposure and get the message out. I recommend it to everyone!

Leave a comment

Posted by on June 18, 2010 in Computing


Tags: , , , , , ,

A Family Emergency, The UK National Health Service and a Call to President Obama

Those who frequent the ChemConnector or ChemSpider blogs, or have these plugged into your Readers will have noticed a sudden silence from me in the New year. It was one of those “phone calls you never want” calls. My mother was rushed into hospital with a subarachnoid hemorrhage. That is NOT a bleeding spider deep in her waters (get it? A sub arachnid hemorrhage…) but a bleed in the brain. It is a form of stroke and as Wikipedia so nicely states (scary)  “Up to half of all cases of SAH are fatal and 10–15% die before reaching a hospital”. My mother made it to hospital thanks to the valiant efforts of my sister who had experience of exactly this medical emergency since her friend had SAH just over a year ago. By the time I got home three days later my mother was safely at the Walton Center,supposedly one of the best neurology hospitals in the UK. When I walked onto the ward for the first time after a red eye flight from the US and no sleep for about 30 hours, my mother was awake, but tired. She was bruised from all the lines running into her and had a drain running from her head to a bag draining fluid from the brain to prevent hydrocephalus. they must have been collecting over a half litre of fluid every day or so. Whenever I was there the bag was full of bloody fluid and seemed to get drained regularly. It’s very concerning and emotional for any child to see their parent in such a state….

In the first couple of days she was talkative but sleepy. With SAH it’s the few days following the event that are particularly telling and dangerous. No different in this case. All hell broke loose as she headed down from the normal ward and down to the High Dependency Unit (one nurse, per two patients) one day. We received a call and when we arrived she wasn’t conscious and non-responsive. Within a few days she had declined and had moved to the critical care (1 nurse, 1 patient it seemed) ward as a result of the drain from her head blocking and a build up of fluid, heart arrythmia, low blood pressure and an infection. They made a 6 inch cut across the scalp, drilled a hole into the skull and ran a fresh drain into a ventricle of the brain. The next three days were emotionally and physically draining (3 hours a day of driving and not knowing whether she would be able to talk that day or not or even know who we were. By the time she got back to High Dependency (who would have thought that would seem like a happy day…but after critical care it is!) she was on seven drugs, had mainlines running into her femoral artery and later the carotid artery. She was bruised and bandaged, cabled, wired and clearly in distress. At one point her eyes communicated “Enough…I can’t do this anymore” and it was one of the hardest moments of my life…but a singly defining moment in the nature of my relationship with my sister and my mother…and how closely connected we are.

During that period the doctors performed endovascular surgery to insert a coil as described in detail here. My mother now has Platinum in her brain and without it would likely not survive. The stress on her system would not been conducive to her surviving a more invasive surgery. When I left the UK, after almost 3 weeks, multiple changes to my flights (and lots of charges from United airlines!) my mother was off of all drugs, sitting up, had just drank her first glass of water in 7 days (she was on a nose feed for food for a long time and was receiving intravenous fluids the entire time) . I’ve been home almost a week and she is now eating soups, drinking hot drinks, can get out of bed and is learning to walk again…after three ways in bed there is a lot of muscle atrophy.

And so to the National Health Service of the UK. I have heard MANY nightmare stories and experienced some myself when I lived in the UK. However, I’ve lived in Canada and now live in the US. I have nightmare stories and experiences in both countries. Those stories are for another time… What I can say is that the treatment my mother received was outstanding. Her nurses and doctors were phenomenal. There were not only skilled at their jobs but sympathetic to us at a human level, listened to us when we were concerned and educated us when we asked. The coiling procedure is not available in every hospital and is state of the art surgery. Bottom line is my mother nearly died the moment the hemorrhage happened (50% of people do!) and, in my opinion, she went to the edge and back a number of times in 3 weeks. The medical staff clearly saved her life and I and my family are indebted to them for the treatment and the experience. one concern we didn’t have to deal with is “cost”. Even for the most mundane procedures in the US there is a cost concern. Having visited friends and hospital members in hospital I am conscious of the “how much per pill” mentality that persists here. Based on what I saw happen to my mother, and the 4 weeks of hospital stay to come and months of rehabilitation to follow my mothers treatment and recovery in this country would cost well over a hundred thousand dollars..probably more (maybe some one can give me an inform guess?). In the UK the National Health Service assumes those costs. There is no bill to come that we need to worry about. The focus can be on the patient, their rehabilitation and care. In this country I have sensed and discussed with some close friends the mentality of “what is a life worth?”. What child wants to be put in that situation?!

And so my plea to President Obama. “Please stay on task with your intentions to provide affordable health care for all families. Rich or poor none of us want to be faced with the challenging questions associated with the mentality of “What is a life worth” that will prevail unless health care costs are brought under control in this country. We have research investments in this country which have delivered incredible technologies to preserve life as we are threatened. We have drugs to support and enhance life when burdened by sickness and slowed by age. Yet, for many, basic healthcare remains out of reach. It is past the time for change. The majority of the populace, whether they voted for you or not, will lend their support to you to make the necessary changes. The world is watching and you can lead the change in healthcare. You have my support.”

My best friend is right in the middle of the challenges of “commercialized health care” in the United States. Jeff is a wonderful man and one of my life mentors. He is at once incredibly intelligent, thoughtful, caring, challenging and motivating. He is presently struggling with a health issue of his own and is about to enter into the challenges of dealing with the costs of excellent care, some of the (in)adequacies of the system, and going under the knife for a very scary yet incredible surgery. He has the blog American Citizen and is about to start posting videos about the challenges he is going through. Knowing Jeff they will be witty, amusing and straight to the point. Check out his blog and watch out for the movies.

Reblog this post [with Zemanta]
Leave a comment

Posted by on January 27, 2009 in Uncategorized, Wikipedia Chemistry


Tags: , , , , , ,

An Invitation to Bio-IT in Boston in 2009

I’m on the agenda to speak at the Bio-IT meeting in Boston next year (April 27-29 2009) to present on “Crowdsourcing, Collaborations and Text-Mining in a World of Open Chemistry.” In case you are assembling your calendar for next year here’s the announcement…

“Join the life sciences community in Boston, MA next April 27-29, 2009 for the 7th Annual Bio-IT World Conference & Expo (  Since its debut in 2002, Bio-IT has established itself as a premier event showcasing the myriad applications of IT and informatics to biomedical research and the drug discovery enterprise.  The 2009 program will feature best practice case studies and joint partner presentations relevant to the technologies, research, and regulatory issues of life science, pharmaceutical, clinical, health, and IT professionals. “

Leave a comment

Posted by on December 18, 2008 in General Communications


Tags: ,

Three Presentations to give at ACS Spring, Salt Lake City

I’ll be giving three papers at the ACS meeting in Salt Lake City in Spring of next year. It seems way in the distance but as usual that time will come way too quickly. I’ve accepted invitations to write four papers before the end of January so it will be the usual crunch. See you in Salt Lake City!?

PAPER ID: 1212659
PAPER TITLE: “Going a mile InChI by InChI: Enabling online chemistry at ChemSpider”

DIVISION: Division of Chemical Information
SESSION: The Adoption and Use of the IUPAC InChI/InChIKey SESSION START TIME: Sunday, March 22, 2009, 9:00 AM

DAY & TIME OF PRESENTATION: Sunday, March 22, 2009 from 9:35 AM to 10:05 AM

PAPER ID: 1238487
PAPER TITLE: “Text mining for chemistry and building a public platform for document markup”

DIVISION: Division of Chemical Information
SESSION: General Papers
SESSION START TIME: Wednesday, March 25, 2009, 2:00 PM

DAY & TIME OF PRESENTATION: Wednesday, March 25, 2009 from 2:05 PM to 2:30 PM
PAPER ID: 1243060
PAPER TITLE: “Cleaning up chemistry for the pharma industry: Delivering a flexible platform for interrogating the FDA DailyMed website”

DIVISION: Division of Chemical Information
SESSION: General Papers
SESSION START TIME: Wednesday, March 25, 2009, 2:00 PM

DAY & TIME OF PRESENTATION: Wednesday, March 25, 2009 from 3:55 PM to 4:20 PM

1 Comment

Posted by on December 16, 2008 in General Communications


Tags: , , , ,