Following on from the theme of the past couple of evenings of blogging regarding data quality this is just a small example of text-mining issues. Algorithms to find chemical names are well established already and performance is well proven. The algorithms to convert chemical names to structures are also well proven. Filtering of the results is possible and errors can be minimized to a certain level. However…some errors always slip through.
As a result of our early deposition of PubChem in the early days of ChemSpider we did end up with the following compound… Mercury Argon.
Notice that it links out to patents and if you follow it through, linked via InChI lookup on SureChem you find why. Yes…mercury and argon appearing adjacent does give Hg Ar as a chemical but it’s serendipitous. This is a simple example of the single compound, one at a time, iterative nature of understanding how large databases are assembled.