Hamburger PDFs and Making Them Structure Searchable


There have been numerous conversations about “Hamburger PDFs” over the months and the most recent exchange is that between Chris Rusbridge and Peter Murray-Rust. Another conversation that I have seen go on has been about making Word documents structure searchable (cannot track down the appropropriate blog-postings at present).

This is just an fyi comment for the community really since this is a general assumption that Word Documents and PDFs cannot be made structure-searchable. The truth is that both can be made structure searchable. How? Well, you need to write the correct information into the file to enable it but it’s possible. There are a number of solutions out there allowing structure-based searching of Word document files. I believe the first one was originally from Oxford Molecular before being acquired by Accelrys. I think there are now multiple including, I believe, Cambridgesoft, ACD/Labs and probably others.

The only PDF structure searching capability I am aware of is that created by ACD/Labs a few years ago. Their website states “Our Search for Structure system allows you to seek out chemical structures in various file formats throughout your computer’s file systems. These formats include: SK2, MOL, SDF, SKC, CHM, CDX, RXN, and PDF (Adobe Acrobat); DOC (Microsoft Word), XLS (Microsoft Excel), and PPT (Microsoft PowerPoint), and ACD/Labs databases: CUD, HUD, CFD, NDB, ND5, and INT.”

For PDF it was required that structure files were “tagged” appropriately when written to PDF by an embedded PDF generation capability. Since the PDF format can be extended ACD/Labs did so. If we wanted to make the majority of PDF files structure searchable then it seems as if the appropriate thing to do would be to extend the general PDF format for Life Sciences, talk to Adobe about including the capabilities into their tools and get the publishers to support it. Ok, there’s details….but why isn’t anyone talking about extending PDF to support structures in this way. it’s already proven, years ago.

Next thing will be that structures will be getting embedded into Word documents and made searchable as if it is something novel. It’s been done many times already. The ACD/Labs website states “Microsoft Word documents with structures created in ChemDraw or MDL ISIS can also be retrieved. Not only can you perform exact structure searches, but you can also search by substructure. Added options allow you to preview search results, open search result documents in ChemSketch as well as in other applications, and store search results for later access.” There are other products doing this too.

Strangely people don’t seem to know about these capabilities. They will…as we move forward to index the web for structures we hope to build the capabilities to search structures inside Word documents directly.

, ,

  1. #1 by Chris Rusbridge on May 16, 2008 - 7:59 am

    I had a look at the PDF searching capability in Chemsketch (via their tutorial); sadly, it’s in the commercial add-on rather than the free part. Still, it is in some sense an existence proof!

    A number of folk have pointed me to tagged PDFs. In the wild of the journal literature they seem to be rather scarce at the moment; I plan to survey a little better than I have done so far. It also looks like the default tagset could do with extension. I’m not sure if tagging would make sense at the “chemical name within sentence” level, as might be done with a micro-formats approach. But I’m still chasing this tasty-looking rabbit!

  2. #2 by Tim Aitken on November 14, 2008 - 5:11 am

    Of course structure searching pdfs is a nice idea, and the ACD add-on does it but the problem there lies with the fact one can only search documents with the chemical annotation. There are a _lot_ of documents out there which are just ‘dumb’ images. The interesting approach for this is to use CLiDe (or a similar tool) which should, I believe be able to convert to live chemistry – I know of one UK consultancy company which did something of that sort, extracted the structure and indexed it in a Cartridge system for searching. I’d be interested to know your experiences with that tool?

  3. #3 by tony on December 11, 2008 - 9:04 am

    Tim – there are a LOT of documents out there with just dumb images. I have experience with both OSRA and CLiDE. See the ChemSpider BLog for info (http://www.chemspider.com/blog/?s=CLIDE&x=0&y=0). We use OSRA on ChemMantis at present it is “ok” at best. CLiDE is better but we have not integrated it to ChemMantis as it is commercial and OSRA is Open Source.

(will not be published)