I have blogged previously about the fact that we are willing to share the entire ChemSpider structure collection with anyone who wants it. Specifically PubChem are willing to accept it..and YES, I have pointed out that they will be receiving back their own structure collection too since we deposited it on ChemSpider.
What’s the value of redepositing the same structures to PubChem? There are actually many – structures on PubChem connected out to ChemSpider will now be connected to Patents, those structures will be connected to analytical data, they will be connected to additional identifiers not available on PubChem, they will be connected to curated identifiers (compare the list of names for methane on PubChem versus ChemSpider), they will be connected to Supplementary Data, and they will be connected to additional predicted properties. So, there is actually a LOT of value in having the links back to ChemSpider from PubChem.
Our best estimate is that there will be about 8 million new structures finding their way to PubChem from ChemSpider.
Now, I was close to thinking that we could declare that the ChemSpider structure collection was Open Data. I’ve posted about Open Data and it’s definitions and challenges previously (1,2). PMR , one of the primary evangelists of Open Data and its definitions is continuing to refine the definitions of Open Data. Extracting from Peter’s post
“I reiterate some guidelines. I’m still working these out and would welcome comment. (I don’t feel we should stray too far from the The Open Knowledge Foundation guidelines. ) As a start I would suggest the following:
- There must be some mechanism whereby the community could, if it wished, capture the resource for public archival without permission. This could be as simple as spidering the site, or a relational dump, or a massive file, or an iterator.
- There must be no permission barriers to re-use including commercial re-use.
- The data must either be the whole work (at a given point in time) or be clearly bounded (i.e. there should be no hidden data that the world cannot get access to in the same way).
- There should be no time limits on access and re-use.”
For right now I am giving up on trying to track where Open Data might end up. Based on my previous discussions with Peter Suber regarding navigating the complexities of Open Access definitions, I understand there is a need to define our own policies. I’m not going to do that here but what I will be clear with is that once the ChemSpider structure set is deposited in PubChem then we are at the mercies of THEIR data sharing policies. I believe Peter holds up PubChem as the primary example of Open Data (but maybe not). So, I believe it should be true to say that the ChemSpider structure set IS Open Data when accessed/downloaded/shared from PubChem. But I understand that will then be the PubChem data set and all association with us will likely be lost. But that is fully acceptable!