Petaflops and Cell Processors

An interesting article regarding the world’s fastest computer was in C&E News today when it hit my desktop. At a time where there is so much focus on High Performance Computing it was interesting to read about the RoadRunner system delivered by IBM to the Los Alamos National Laboratory.

Quoting Wikipedia “Roadrunner differs from many contemporary supercomputers in that it is a hybrid system, using two different processor architectures for the heavy lifting. Usually supercomputers only use one, since it would be easier to design and program for. To tap the full potential of Roadrunner, all software will have to be written specially for this hybrid architecture which is uniquely complex. The hybrid design consists of dual-core Opteron server processors manufactured by AMD utilizing the standard x86 architecture. Attached to each Opteron core is a Cell processor manufactured by IBM using Power Architecture technology. ”

I have blogged previously about Cell Processors being applied for Virtual Screening/Docking (1,2) and have been helping SimBioSys in their marketing and business development of their eHITS Lightning software as discussed here.

With the fastest computer in the world using the Cell processor as part of its architecture, and with the processor now proving itself for docking, the question is whether we will see this processor become even more mainstream in the foreseeable future. It’s NOT easy to port…but it can be done.

Zemanta Pixie

New Shower Curtains and Our Health

Following on from my recent posting about the Autoimmune Epidemic comes a report that new shower curtains can be bad for our health.

Now, while it’s true that chemistry causes emotional responses in the public when such reports are released what is interesting to read is that the work was done by the Virginia-based Center for Health, Environment & Justice and “The Center for Health, Environment & Justice sent a letter to 19 major retailers Thursday informing them of the new report and encouraging them to stop selling PVC products.”

Invited Symposium Speaker at a Fortune 500 Company

I’m excited to speak next week at a “by invitation only” symposium at one of the top Fortune 500 Companies. The focus of the gathering for the 350 attendees will be “Networks” and I will be speaking about  “Crowd-sourcing to Build A Structure-centric Community for Chemists”. I will of course talk about ChemSpider but also about my experiences with Wikipedia Chemistry and other general and scientific networks I have become involved with over the years. I will be speaking alongside invited speakers from organizations such as Yahoo, MIT, General Electric, Brookhaven, Harvard University etc so I am quite humbled not only by the invitation  but also by the chance to network (appropriate for a gathering about “networks”) with such a diverse group of people. I’m not sure what the situation is regarding releasing the presentation publicly after the gathering but will do so following discussions with the organizers. I’m sure it will be acceptable.

Books I am reading - The Autoimmune Epidemic

I seem to be surrounded by people who have developed “autoimmune diseases” (ID) over the past few years. These are commonly people around the age of 40 and are therefore my peer group. It is hard to watch my friends. and over the past few years, members of my immediate family, be severely debilitated by some form of ID whether it’s gastrointestinal in nature, thyroid function or some form of multiple chemical sensitivity.

A close personal friend of mine recently gifted me with a copy of a book called “The Autoimmune Epidemic: Bodies Gone Haywire in a World Out of Balance–and the Cutting-Edge Science that Promises Hope” and I am close to finishing it. I think the title speaks for itself. With an increasing number of “westerners” being diagnosed with autoimmune diseases, and numbers far exceeding thos with cancer, the book makes for interesting, and I would say for me personally, quite shocking reading. As a father of young children I am concerned now for what they will encounter as challenges to their bodies moving forward. A recommended read for everyone…not just scientists.

Hamburger PDFs and Making Them Structure Searchable

There have been numerous conversations about “Hamburger PDFs” over the months and the most recent exchange is that between Chris Rusbridge and Peter Murray-Rust. Another conversation that I have seen go on has been about making Word documents structure searchable (cannot track down the appropropriate blog-postings at present).

This is just an fyi comment for the community really since this is a general assumption that Word Documents and PDFs cannot be made structure-searchable. The truth is that both can be made structure searchable. How? Well, you need to write the correct information into the file to enable it but it’s possible. There are a number of solutions out there allowing structure-based searching of Word document files. I believe the first one was originally from Oxford Molecular before being acquired by Accelrys. I think there are now multiple including, I believe, Cambridgesoft, ACD/Labs and probably others.

The only PDF structure searching capability I am aware of is that created by ACD/Labs a few years ago. Their website states “Our Search for Structure system allows you to seek out chemical structures in various file formats throughout your computer’s file systems. These formats include: SK2, MOL, SDF, SKC, CHM, CDX, RXN, and PDF (Adobe Acrobat); DOC (Microsoft Word), XLS (Microsoft Excel), and PPT (Microsoft PowerPoint), and ACD/Labs databases: CUD, HUD, CFD, NDB, ND5, and INT.”

For PDF it was required that structure files were “tagged” appropriately when written to PDF by an embedded PDF generation capability. Since the PDF format can be extended ACD/Labs did so. If we wanted to make the majority of PDF files structure searchable then it seems as if the appropriate thing to do would be to extend the general PDF format for Life Sciences, talk to Adobe about including the capabilities into their tools and get the publishers to support it. Ok, there’s details….but why isn’t anyone talking about extending PDF to support structures in this way. it’s already proven, years ago.

Next thing will be that structures will be getting embedded into Word documents and made searchable as if it is something novel. It’s been done many times already. The ACD/Labs website states “Microsoft Word documents with structures created in ChemDraw or MDL ISIS can also be retrieved. Not only can you perform exact structure searches, but you can also search by substructure. Added options allow you to preview search results, open search result documents in ChemSketch as well as in other applications, and store search results for later access.” There are other products doing this too.

Strangely people don’t seem to know about these capabilities. They will…as we move forward to index the web for structures we hope to build the capabilities to search structures inside Word documents directly.

Spaces, Dashes and Issues with Nomenclature Conversion

I’ve been involved with Nomenclature in one way or another for well over a decade. While I’m an NMR spectroscopist by training (as evidenced by the >100 publications in this area)  during my decade long tenure  at ACD/Labs I learned a lot about: PhysChem parameters and their prediction, systematic nomenclature, structure drawing and databasing, chemometrics, LC-MS data analysis and so on. As the product manager for many of these products I was dropped in the deep end. Nomeclature was something I really enjoyed. While I am not a  nomenclature specialist in terms of a “generate a perfect systematic name for Taxol level” I have a decade of experience working with nomenclature software for both generation of names from structures and the generation of structures from names. Having worked with 100s of customers and their needs I’ve dealt with a lot of beliefs around nomenclature and perceptions of how to use the tools.

Having just spent the week at Bio-IT and having been engaged with a number of conversations about Name to Structure conversion, it became clear that one of the prevailing beliefs for users of name to structure conversion packages is that spaces in systematic names can be disregarded. It appears that members of the text-mining for chemistry community are using one or more of the commercial name to structure software programs to convert chemical names to structures and, prior to feeding the algorithms, they are removing all white spaces from the names. They are also doing the same, in some cases, with dashes. How well is that going to work? Is it safe to remove spaces from chemical names and assume this has no effect? Is consideration being given more to the accuracy of the text-mining than to the nature of systematic nomenclature?

Let’s look at some examples of the result of removing spaces from chemical names. Consider the different results just from moving a space.

The impact of spaces on naming

Single structure to separate components based on a space.

Another example of multiple to single component structure.

Another example of space-collapsing structure searching

Clearly there is an impact of removing spaces from systematic names. The same is true of random removal and insertion of dashes. The generation of systematic names by chemists is far from ideal as discussed by Gernot Eller here. The mishandling of correct names when reverting back to structures is one more problem layer. There are many of us using text mining and name to structure conversion to link between documents and structures. It is far from a minor undertaking.

XOBNI - Rev Up Microsoft Outlook to EXCELLENT EFFECT

I watch the Microsoft Chemical Team Blog and was very interested in this post about Xobni. With a little work I managed to get the beta sent to me overnight and have been playing with it. I probably haven’t found all of the benefits yet but ohmigod! I love this app. I must have saved myself at least an hour today just in searching through Outlook and searching for old information.

Xobni is showing me how much email I am getting, from who, their ranking, easy access to attachments, an emails’ “network”…it’s very social networking in that way. I’m not going to belabor it’s value. All i can say is if you are an Outlook User this is a must have for you. I am allowed to invite 6 people to receive a beta test of Xobni so let me know if you want one.

FPGAs, GPUs and now the Cell Processor - A Call for Comments

I have received a couple of emails off blog about my post yesterday about the Cell processor and its application to scientific computing.

The basic premise is one of scepticism. The hot area of interest up to a couple of years ago was Field Programming Gate Arrays. Nowadays a lot of discussions focus on the advantages of GPUs. However, the majority of chemists have not even heard of these processors and they remain of interest to programmers and hardware hobbyists and experts.  For the chemists we spoke to at Bio-IT the terms FPGAs and GPUs went over their heads. Not true for the IT people. When we mentioned the Cell processor then it went over the heads of MOST people. So, the Cell is pretty much an unknown entity to most.

People have been programming onto FPGAs for years but none have gone mainstream in the scientific computing world that I am aware of. A number of researchers are now working with GPUs but have any gone mainstream and will they? The Cell might just be different.

So, a question for the readership. What are your thoughts, comments, opinions on FPGAs vs GPUs vs the Cell processor. Where does each have strengths over the others? What do people think about the future of GPUs in terms of scientific computing? What are your thoughts about the Cell processor?

A Green Solution for Virtual Screening Using the IBM Cell Broadband Processor

I spent a few days this week in Boston at the Bio-IT conference. I was there for two reasons - to support one of the companies I have been consulting with of late and to present on ChemSpider.

The ChemSpider presentation seemed to be well received and I’m grateful for the opportunity to expose the ongoing work we are doing on ChemSpider.

I have spent the past few months supporting the efforts of SimBioSys to bring a potentially revolutionary platform to market. The intention was to deliver the platform to Boston Bio-It and under the leadership of Zsolt Zsoldos, the Chief Technology Officer for SimBioSys, the deadline was met. Zsolt’s already blogged about the WOW FACTOR at Bio-IT. Check it out…it was a true phenomenon.

I’ve blogged previously about the possibility to derive processing power from a gaming system (1,2). By the time that eHITS Lightning was unveiled at Bio-IT the Cell Processor had managed to deliver up to 120X performance improvements for certain examples.A White Paper about the technology has been online for a while now. The technology delivered was enough to garner a position as a finalist for the Best in Show award. It was enough to have some of the other domain players question how much work it was to port eHITS to the Cell. It was essentially a full rewrite and over two man years of effort to deal with coding for the special nature of the processor.

Zsolt’s presentation to the Best in Show judges is online here. You MUST look at slides 8 and 9 to really “get it”. The cost savings associated with the electricity, cooling, space and, in theory, network administration, are enormous. A 100 CPU Intel cluster could be replaced with three PlayStations for the same eHITS throughput. It’s definitely a  “Green Solution”.

With so much attention being given to coding on Graphic Processing Units it’s quite surprising that no one is yet talking about the advantages of the Cell processor. Well, maybe they will now!

The Curation of Almost 5000 Structures on Wikipedia

I recently commented on the statement made by Eric Shively of CAS about the CAS Validation Project going on at Wikipedia. The basic premise of the work is the need to validate CAS numbers to ensure that the CAS numbers listed in a chemical box are associated with the appropriate structure shown in the chemical box. So, if the structure has stereochemistry make sure that the CAS number is for the form of the structure with stereochemistry. If the CAS Number is for a neutral compound then the structure displayed should not be the salt. And so on, and so on. There are many sources of CAS Numbers online. In fact there are many places to search for them to confirm. Type in “CAS Number search” online and you’ll find a lot of hits, though admittedly not all of them related to Chemical Abstracts Services.

Some examples on “online CAS number searches” are excellent. In the order that I see them in my search:

The NIST Webbook - much loved by many scientists and very useful.

ChemIndustry - An excellent resource for chemists and gaining a good following in the market I believe

ChemFinder - Cambridgesoft’s online search system

A Buyers Guide - A German Chemical Buyer’s Guide.

PennSylvania Department of Environmental Protection

California Department of Pesticide regulation

And on and on. There are likely legal reasons for a number of these databases to have CAS Numbers. As I continued to peruse the list I was more than impressed by the number of databases serving up CAS numbers online, and, I believe, a number of them containing over 10,000 numbers which, as I have commented before, is rather a magic number. Should Wikipedia be concerned about the 10,000 CAS number issue with some of the other issues being discussed now?

PMR recently commented on my blogpost here. He said “PMR: Wikipedia has between 1000 and 2000 chemical substances (ca 0.01% of the total number of substances in CAS).”

The number of chemical substances in Wikipedia is actually MUCH higher than that…I know since I’ve been looking at them, in detail as described here. To clarify, I am building an SDF file from the chemicals on Wikipedia so that it can be deposited on ChemSpider hooked up back to Wikipedia. This was done earlier by linking up chemical names but it was far from perfect so we are doing it in this more “curated” manner. The outcome from the work, and thanks to multiple other sets of eyes from WP:CHEM, will be a curated SDF file. I will return the SDF file with the following fields generated: SMILES string, Systematic Name, InChIString, InChIKey. These can then be used to homogenize the available fields in the Chemical Boxes etc.

In doing the work (I have already worked through the whole alphabet) I have over 4900 compounds already curated at a first level. I have disregarded the majority of inorganics and organometallics for this pass. ca. 5000 organics manually curated is ENOUGH of a challenge. I estimate the number of chemical compounds to be about 6500-7000, and it’s growing. So, it’s about a factor of 3-4 times bigger than PMR’s estimate. The vast majority do have CAS numbers. While it hasn’t hit 10,000 yet… it’ coming.