Hamburger PDFs and Making Them Structure Searchable

There have been numerous conversations about “Hamburger PDFs” over the months and the most recent exchange is that between Chris Rusbridge and Peter Murray-Rust. Another conversation that I have seen go on has been about making Word documents structure searchable (cannot track down the appropropriate blog-postings at present).

This is just an fyi comment for the community really since this is a general assumption that Word Documents and PDFs cannot be made structure-searchable. The truth is that both can be made structure searchable. How? Well, you need to write the correct information into the file to enable it but it’s possible. There are a number of solutions out there allowing structure-based searching of Word document files. I believe the first one was originally from Oxford Molecular before being acquired by Accelrys. I think there are now multiple including, I believe, Cambridgesoft, ACD/Labs and probably others.

The only PDF structure searching capability I am aware of is that created by ACD/Labs a few years ago. Their website states “Our Search for Structure system allows you to seek out chemical structures in various file formats throughout your computer’s file systems. These formats include: SK2, MOL, SDF, SKC, CHM, CDX, RXN, and PDF (Adobe Acrobat); DOC (Microsoft Word), XLS (Microsoft Excel), and PPT (Microsoft PowerPoint), and ACD/Labs databases: CUD, HUD, CFD, NDB, ND5, and INT.”

For PDF it was required that structure files were “tagged” appropriately when written to PDF by an embedded PDF generation capability. Since the PDF format can be extended ACD/Labs did so. If we wanted to make the majority of PDF files structure searchable then it seems as if the appropriate thing to do would be to extend the general PDF format for Life Sciences, talk to Adobe about including the capabilities into their tools and get the publishers to support it. Ok, there’s details….but why isn’t anyone talking about extending PDF to support structures in this way. it’s already proven, years ago.

Next thing will be that structures will be getting embedded into Word documents and made searchable as if it is something novel. It’s been done many times already. The ACD/Labs website states “Microsoft Word documents with structures created in ChemDraw or MDL ISIS can also be retrieved. Not only can you perform exact structure searches, but you can also search by substructure. Added options allow you to preview search results, open search result documents in ChemSketch as well as in other applications, and store search results for later access.” There are other products doing this too.

Strangely people don’t seem to know about these capabilities. They will…as we move forward to index the web for structures we hope to build the capabilities to search structures inside Word documents directly.

Spaces, Dashes and Issues with Nomenclature Conversion

I’ve been involved with Nomenclature in one way or another for well over a decade. While I’m an NMR spectroscopist by training (as evidenced by the >100 publications in this area)  during my decade long tenure  at ACD/Labs I learned a lot about: PhysChem parameters and their prediction, systematic nomenclature, structure drawing and databasing, chemometrics, LC-MS data analysis and so on. As the product manager for many of these products I was dropped in the deep end. Nomeclature was something I really enjoyed. While I am not a  nomenclature specialist in terms of a “generate a perfect systematic name for Taxol level” I have a decade of experience working with nomenclature software for both generation of names from structures and the generation of structures from names. Having worked with 100s of customers and their needs I’ve dealt with a lot of beliefs around nomenclature and perceptions of how to use the tools.

Having just spent the week at Bio-IT and having been engaged with a number of conversations about Name to Structure conversion, it became clear that one of the prevailing beliefs for users of name to structure conversion packages is that spaces in systematic names can be disregarded. It appears that members of the text-mining for chemistry community are using one or more of the commercial name to structure software programs to convert chemical names to structures and, prior to feeding the algorithms, they are removing all white spaces from the names. They are also doing the same, in some cases, with dashes. How well is that going to work? Is it safe to remove spaces from chemical names and assume this has no effect? Is consideration being given more to the accuracy of the text-mining than to the nature of systematic nomenclature?

Let’s look at some examples of the result of removing spaces from chemical names. Consider the different results just from moving a space.

The impact of spaces on naming

Single structure to separate components based on a space.

Another example of multiple to single component structure.

Another example of space-collapsing structure searching

Clearly there is an impact of removing spaces from systematic names. The same is true of random removal and insertion of dashes. The generation of systematic names by chemists is far from ideal as discussed by Gernot Eller here. The mishandling of correct names when reverting back to structures is one more problem layer. There are many of us using text mining and name to structure conversion to link between documents and structures. It is far from a minor undertaking.

XOBNI - Rev Up Microsoft Outlook to EXCELLENT EFFECT

I watch the Microsoft Chemical Team Blog and was very interested in this post about Xobni. With a little work I managed to get the beta sent to me overnight and have been playing with it. I probably haven’t found all of the benefits yet but ohmigod! I love this app. I must have saved myself at least an hour today just in searching through Outlook and searching for old information.

Xobni is showing me how much email I am getting, from who, their ranking, easy access to attachments, an emails’ “network”…it’s very social networking in that way. I’m not going to belabor it’s value. All i can say is if you are an Outlook User this is a must have for you. I am allowed to invite 6 people to receive a beta test of Xobni so let me know if you want one.

FPGAs, GPUs and now the Cell Processor - A Call for Comments

I have received a couple of emails off blog about my post yesterday about the Cell processor and its application to scientific computing.

The basic premise is one of scepticism. The hot area of interest up to a couple of years ago was Field Programming Gate Arrays. Nowadays a lot of discussions focus on the advantages of GPUs. However, the majority of chemists have not even heard of these processors and they remain of interest to programmers and hardware hobbyists and experts.  For the chemists we spoke to at Bio-IT the terms FPGAs and GPUs went over their heads. Not true for the IT people. When we mentioned the Cell processor then it went over the heads of MOST people. So, the Cell is pretty much an unknown entity to most.

People have been programming onto FPGAs for years but none have gone mainstream in the scientific computing world that I am aware of. A number of researchers are now working with GPUs but have any gone mainstream and will they? The Cell might just be different.

So, a question for the readership. What are your thoughts, comments, opinions on FPGAs vs GPUs vs the Cell processor. Where does each have strengths over the others? What do people think about the future of GPUs in terms of scientific computing? What are your thoughts about the Cell processor?

A Green Solution for Virtual Screening Using the IBM Cell Broadband Processor

I spent a few days this week in Boston at the Bio-IT conference. I was there for two reasons - to support one of the companies I have been consulting with of late and to present on ChemSpider.

The ChemSpider presentation seemed to be well received and I’m grateful for the opportunity to expose the ongoing work we are doing on ChemSpider.

I have spent the past few months supporting the efforts of SimBioSys to bring a potentially revolutionary platform to market. The intention was to deliver the platform to Boston Bio-It and under the leadership of Zsolt Zsoldos, the Chief Technology Officer for SimBioSys, the deadline was met. Zsolt’s already blogged about the WOW FACTOR at Bio-IT. Check it out…it was a true phenomenon.

I’ve blogged previously about the possibility to derive processing power from a gaming system (1,2). By the time that eHITS Lightning was unveiled at Bio-IT the Cell Processor had managed to deliver up to 120X performance improvements for certain examples.A White Paper about the technology has been online for a while now. The technology delivered was enough to garner a position as a finalist for the Best in Show award. It was enough to have some of the other domain players question how much work it was to port eHITS to the Cell. It was essentially a full rewrite and over two man years of effort to deal with coding for the special nature of the processor.

Zsolt’s presentation to the Best in Show judges is online here. You MUST look at slides 8 and 9 to really “get it”. The cost savings associated with the electricity, cooling, space and, in theory, network administration, are enormous. A 100 CPU Intel cluster could be replaced with three PlayStations for the same eHITS throughput. It’s definitely a  “Green Solution”.

With so much attention being given to coding on Graphic Processing Units it’s quite surprising that no one is yet talking about the advantages of the Cell processor. Well, maybe they will now!

The Curation of Almost 5000 Structures on Wikipedia

I recently commented on the statement made by Eric Shively of CAS about the CAS Validation Project going on at Wikipedia. The basic premise of the work is the need to validate CAS numbers to ensure that the CAS numbers listed in a chemical box are associated with the appropriate structure shown in the chemical box. So, if the structure has stereochemistry make sure that the CAS number is for the form of the structure with stereochemistry. If the CAS Number is for a neutral compound then the structure displayed should not be the salt. And so on, and so on. There are many sources of CAS Numbers online. In fact there are many places to search for them to confirm. Type in “CAS Number search” online and you’ll find a lot of hits, though admittedly not all of them related to Chemical Abstracts Services.

Some examples on “online CAS number searches” are excellent. In the order that I see them in my search:

The NIST Webbook - much loved by many scientists and very useful.

ChemIndustry - An excellent resource for chemists and gaining a good following in the market I believe

ChemFinder - Cambridgesoft’s online search system

A Buyers Guide - A German Chemical Buyer’s Guide.

PennSylvania Department of Environmental Protection

California Department of Pesticide regulation

And on and on. There are likely legal reasons for a number of these databases to have CAS Numbers. As I continued to peruse the list I was more than impressed by the number of databases serving up CAS numbers online, and, I believe, a number of them containing over 10,000 numbers which, as I have commented before, is rather a magic number. Should Wikipedia be concerned about the 10,000 CAS number issue with some of the other issues being discussed now?

PMR recently commented on my blogpost here. He said “PMR: Wikipedia has between 1000 and 2000 chemical substances (ca 0.01% of the total number of substances in CAS).”

The number of chemical substances in Wikipedia is actually MUCH higher than that…I know since I’ve been looking at them, in detail as described here. To clarify, I am building an SDF file from the chemicals on Wikipedia so that it can be deposited on ChemSpider hooked up back to Wikipedia. This was done earlier by linking up chemical names but it was far from perfect so we are doing it in this more “curated” manner. The outcome from the work, and thanks to multiple other sets of eyes from WP:CHEM, will be a curated SDF file. I will return the SDF file with the following fields generated: SMILES string, Systematic Name, InChIString, InChIKey. These can then be used to homogenize the available fields in the Chemical Boxes etc.

In doing the work (I have already worked through the whole alphabet) I have over 4900 compounds already curated at a first level. I have disregarded the majority of inorganics and organometallics for this pass. ca. 5000 organics manually curated is ENOUGH of a challenge. I estimate the number of chemical compounds to be about 6500-7000, and it’s growing. So, it’s about a factor of 3-4 times bigger than PMR’s estimate. The vast majority do have CAS numbers. While it hasn’t hit 10,000 yet… it’ coming.

An Excellent Review of Protein Docking

As a result of work we are doing over on ChemSpider regarding LASSO I have become increasingly interest in the world of protein docking.  A great review article was just released. I highly recommend it if this is an area of interest for you.

Protein-ligand Docking: A Review of Recent Advances and Future Perspectives

Current Pharmaceutical Analysis, Volume 4, Number 1, February 2008, ISSN: 1573-4129

Montserrat Vaqué, Anna Ardévol, Cinta Bladé, M. Josepa Salvadó, Mayte Blay, Juan Fernández-Larrea, Lluís Arola and Gerard Pujadas

Understanding the interactions between proteins and ligands is crucial for the pharmaceutical and functional food industries. The experimental structures of these protein/ligand complexes are usually obtained, under highly expert control, by time-consuming techniques such as X-ray crystallography or NMR. These techniques are therefore not suitable for routinely screening the possible interaction between one receptor and thousands of ligands. To overcome this limitation, computational algorithms (i.e. docking algorithms) have been developed that use the individual structures of the receptor and ligand to predict the structure of their complex. The present review, then, summarizes: (a) the fundamentals of the algorithms of the most commontly used docking programmes (with particular emphasis on their strengths and limitations); (b) how the results from different docking algorithms compare (i.e. which software gives the best predictions); and (c) the future perspectives and challenges for docking techniques.

The Full NMR Assignment of Hexacyclinol using CASE now published

The hexacyclinol debacle has been highlighted on a number of blogs and has caused a furor within the organic chemistry community. I have discussed this previously on the ChemSpider blog. Well, I am happy to say that the article describing our work is finally available as an ASAP article online . This was certainly an interesting piece of work to be involved with, was a detective story from start to finish and brought together a very skilled team to work on this issue. In my opinion this study truly shows the capabilities of Computer-Assisted Structure Elucidation.

hexacyclinol.png

Taking a Break From Wikipedia Curation

I blogged previously about curating Wikipedia chemistry pages…specifically a focus on chemical structures and the quality of systematic names, trade names, structure images and outlinks to other site. This project has moved quite well….a LOT of eyeballing into the early hours. I am taking a break to catch up with some other work for the next couple of weeks (at least). As it is I have made my first pass to the letter P (having done X,Y,Z) already. There are 1100 links left for me to review - links to pages that I need to click on, open up, see if it is a structure page and then curate and validate.

I think what’s been done to date has been of value to the WP:CHEM team and to the overall quality of what’s on there. I had questioned in my own head how important and valuable the effort was. Thanks to Walkerma who pointed out this facility today it is clear that the chemistry pages are getting a lot of visits…over a 100 per day in many cases. A report on my progress is posted online here.

t’s been a work of passion to this point. Now, the reality is it is just work. I am tired of looking at Wikipedia pages (no insult to WP but I have tired eyes). It will get finished, and I hope by the end of the month…I won’t be rushing it since it will impact the quality but I will be glad when it’s over :-)

My friend “An American Citizen”

My friend “Halbstein” has started his own blog - American Citizen. Recently he and I sat for lunch and talked about the politics of health care in the United States and we reviewed a very interesting article together. He has commented about this on his blog and I recommend people interested in the costs of health care in the USA to browse it. Very revealing …go to his blog for info.

In response to his post I waxed lyrical about the movie Field of Dreams, Burt Lancaster and my doctor when I was growing up. Halbstein took it one step further…and does it in a way that might stimulate you all to remember what medicine used to be like. While technology and  medicine have advanced I have to ask the question whether patient care and doctor-patient relationships have balanced it by going the other way? Read about Dr Lipmann.