ChemSpider as a part of Web 2.0 – and what is that Web 2.0 anyways?

10 May

In this blog I am going to excerpt from another blog (and bolded to identify) regarding ChemSpider (based on my previous post it’s the way of the blogosphere) and it’s non Web 2.0 status since pages from the ChemSpider blog are being excerpted in the same way.

The question I posted for ChemSpider bloggers was whether or not the curation of data should be supported by the community. Whatever the answer should be the data is that curation is already underway and continues. Here I share comments posted elsewhere with parts of the material extracted for discussion.

To the question “Should the curation of data on ChemSpider be supported by the community?” the comments made were

…only if the community has time on its hands and wants to donate significant goods to commercial organisation(s) who will then own and control the content. (People already do this, of course – they are called scientists and as authors they donate their goods to commercial publishers. ) Put simply Chemspider is Web 1.0; The chemical blogosphere, Pubchem, Blue Obelisk, CrystalEye is Web 2.0. Chemspider’s business model was fine for the early web. No public content, significant effort to extract it, few alternative sites.”

So, some comments.

Yes, we scientists do donate our goods to commercial publishers. This past 12 months I’ve been author/co-author of almost a dozen peer-reviewed publications to some of the top journals in the world for some of the top publishers (ACS, Wiley, Elsevier for example). Some of the review processes have been slower than hoped and I do take issue to situations when editors receive two “Publish as is” and hold it up for months for one reviewer who comments “It’s too long.” The articles, when published have been exposed to many people and resulted in follow up from many scientists. I like the results, feel that the publishers do a stellar job of creating quality output and a generally seamless process. I’m not going to comment on profit margins for the publishers…you can find those rants elsewhere. To the contrary publications we have put to Open Access journals have produced no interest..yet the work was of similar caliber. The time of Open Access Journal exposure is here though and there will be increasing interest I judge. I believe ChemSpider will help this and will expose why in a later blog.

Web 2.0. I’ve asked people what it is and they generally all point to “community web”. Asked for examples they talk about reviews on Amazon, voting on Ebay, Flickr, YouTube, blogs, Wikipedia and so on. I’m sure you can add a few of your own “Web 2.0 definitions”. The general feeling is that Web 2.0 is about building community.

From the comments above about “The chemical blogosphere, Pubchem, Blue Obelisk, CrystalEye is Web 2.0” I have to assume that the intent here is to identify Web 2.0 as being connected to Open Source, downloadable content and integration.

With MySpace, YouTube and Flickr as the poster children of Web 2.0 I’m not sure how this matches up this intent. Certainly these sites are big business. They are not Open Source to the best of my knowledge. Downloadable content…I don’t think it’s possible to download the database. But these sites are major contributors to community building on Web 2.0.

I turned to Wikipedia for a more formal definition and extract below. From Wikipedia the definition of Web 2.0 is given as:

1) The transition of web sites from isolated information silos to sources of content and functionality, thus becoming computing platforms serving web applications to end-users
2) A social phenomenon embracing an approach to generating and distributing Web content itself, characterized by open communication, decentralization of authority, freedom to share and re-use, and “the market as a conversation”
3) Enhanced organization and categorization of content, emphasizing deep linking

Relative to these definitions ChemSpider delivers. Specifically

1) ChemSpider INTEGRATES information silos. We connect containers of content via the indexed chemical structures and associated identifiers leading into the silos of information. It is this integration that has encouraged data providers to look favorably on our activities. A search of ChemSpider leads scientists to their content and we do NOT replicate it except at the chemical structure and link level. We serve web applications to end-users…visit our services page
2) We are becoming a social environment for chemists…and we have only just started. 6 weeks into our beta release we have openly communicated our intentions and continue this pattern. The decentralization of authority will come as we allow peer-reviewed curation of the data. This is NOT complete at the site yet. As declared previously we will enable a wiki like environment for chemists to contribute and edit to the database. The freedom to share and re-use will be enabled shortly – the level at which this will happen is under discussion. For many it will suffice, for some it will likely be a cause for discussion.
3) In our opinion we are enhancing the organization of data and enables deep linking to an individual structure, for example the 10 millionth structure is labeled as Click on any structure on the Spinneret webzine as an example.

Also extracted from the Wikipedia article: in the opening talk of the first Web 2.0 conference, Tim O’Reilly and John Battelle summarized what they saw as key principles of Web 2.0 applications. Some are excerpted below:

1) the web as a platform
2) data as the driving force
3) network effects created by an architecture of participation
4) innovation in assembly of systems and sites composed by pulling together features from distributed, independent developers (a kind of “open source” development)
5) lightweight business models enabled by content and service syndication
6) the end of the software adoption cycle (“the perpetual beta”)
7) software above the level of a single device leveraging the power of The Long Tail.
8.) ease of picking-up by early adopters

With these definitions we believe ChemSpider delivers on many of these also. We certainly LIVE number 6 above.

From Wikipedia again “While interested parties continue to debate the definition of a Web 2.0 application, a Web 2.0 web-site may exhibit some basic common characteristics. These might include:
1) “Network as platform” — delivering (and allowing users to use) applications entirely through a browser. See also Web operating system.
2) Users owning the data on the site and exercising control over that data.
3) An architecture of participation and democracy that encourages users to add value to the application as they use it. This stands in sharp contrast to hierarchical access-control in applications, in which systems categorize users into roles with varying levels of functionality.
4) A rich, interactive, user-friendly interface based on Ajax or similar frameworks.
5) Some social-networking aspects.
6) Enhanced graphical interfaces such as gradients and rounded corners (absent in the so-called Web 1.0 era). “

Again, we deliver on the majority of these at present. Relative to 2) we do NOT have permission from our collaborators to hand over their data. Please don’t thrash us over their decisions to contribute and not share. Relative to 4) Ajax is NOT yet implemented at the site..but will be…watch this space. Relative to 6)…we have ROUNDED CORNERS…ooohhhh.

Other comments include “I see very little difference between Chemfinder and Chemspider. They are both closed, proprietary, do not expose data, or metadata, or algorithms; have closed code, do not allow downloads or re-use. They lose metadata in their aggregation process. I have nothing personal against Chemspider (or, if they are associated, ACDLabs) – I just think the Web 1.0 model is out of date for chemistry.”

To respond…yes, the code is proprietary and closed..we don’t know of any Open Source code that would quickly search >10 million structures by structure and substructure (that will be covered in a separate blog as I have the utmost respect for the commercial entities that do this well! It’s DIFFICULT.) Oh…but Open Source isn’t part of the Web 2.0 definition. We don’t expose algorithms…correct…many are provided by collaborators and we do not have the right to expose their code. But that isn’t part of Web 2.0 either.

And next…the beloved “metadata” term. What exactly IS metadata? Let’s refer again to our web-friendly Wikipedia regarding metadata. In brief it’s “data about data” and a perfect example is an XML schema vs XML. An XML schema is metadata. According to my interpretation this means InChI and SMILES are not metadata since these data can be interchanged with the structure itself. I may be wrong. The hypothetical entity describing what data can be bound to a structure would be metadata not necessarily data related somehow to the structure, but rather more general data describing the datamodel – for example the source of the data – this IS metadata. ChemSpider doesn’t lose the metadata…we retain the only metadata currently available, the data source, and use it as our link out to the provider. Our primary role again, for now, is to connect silos of information via chemical structures.

In a related vein ChemSpider just published data to PubChem and the same occurred – metadata was purposely removed. Regardless of what is uploaded to PubChem in the SDF files all except a very small number of data fields are removed and then the structure record is filled with properties calculated by commercial software – CACTVS and OpenEye. By the definition of “losing data in the aggregation process” PubChem is part of the Web 1.0 model. It’s no issue for us…we’re proud to be working under the same model as both efforts provide value. If there is interest we can certainly publish our datamodel and likely will in the near future when we submit a publication about ChemSpider.

For details about PubChem’s CACTVS (see slide 28 of this presentation) and OpenEye (see Richard Apodaca’s comments on this – I quote “Why did PubChem, the granddaddy of all open chemistry databases, choose a closed, proprietary toolkit for its software infrastructure?”).

Continuing with the comments…“99% of Chemspider’s data appears to come from Pubchem. If so, surely it is better to curate Pubchem directly. There are mechanisms for this and as Pubchem is effectively the normalised source it gives less problems for maintenance. “

Yes, what is posted on the beta version is primarily PubChem. 96% of it. This was made clear at release of beta. It’s the largest publicly available database. However, at this point we have over 7 million structures to index and deduplicate. That will reduce the contribution of PubChem significantly…but PubChem IS growing daily (we just contributed data) so we also need to download the new data. Our estimates are that by the end of June PubChem will be about 60% of the ChemSpider database.

The conclusion of the post was “They will own the results and the results will not be made Openly available but served through their gateway. You are invited to contribute. The Web 2.0 community will use a different mechanism.”

Yes, we will own the results. But we have committed to return all data provided by public sources to the providers so that entities such as PubChem can update and provide to the community. We will also provide feedback to all contributors. It is their choice to provide access.

I have to wonder why PubChem is being declared as Web 2.0…I don’t care, I just have to wonder. PubChem certainly has downloadable content…it is an incredible data collection. Integration is clear…it’s excellent. But what else makes it Web 2.0? It’s not Open Source to the best of my knowledge – it uses the components of CACTVS and OpenEye, both commercial concerns as far as I know. (This makes the PubChem just as dependent as other concerns on the longevity of the providers by the way). The social environment is where? I’m not aware that Ajax is on PubChem, at least not yet. It has been stated that the PubChem Sketcher is Ajax but an email exchange with the author suggests otherwise. So, is it provision of downloadable data that makes it Web 2.0? They DO have some rounded corners!

I note that eMolecules declares itself as Web 2.0 at its home page…”eMolecules is bringing the power of Web 2.0 to chemists around the world.” Clearly they have identified what Web 2.0 is. But I don’t see it. How? Where is the community building? That said they also comment that they are “the world’s largest repository of publicly accessible chemistry information” but I believe that particular accolade belongs to PubChem at present. Oh…major faux pas…NO rounded corners! It’s a public blog so maybe they will tell us? Maybe a tickle of Ajax on the site?

We are not declaring ChemSpider as Web 2.0..though it seems generally compatible based on the definitions…I’ll go more for Web 1.72.

We’re very sensitive to a statement made on the Wikipedia article “…when a website is proclaimed “Web 2.0” for the use of some trivial feature (such as blogs or gradient-boxes) observers may generally consider it more an attempt at promotion than an actual endorsement of the ideas behind Web 2.0. “Web 2.0” in such circumstances has sometimes sunk simply to the status of a marketing buzzword“. Yup, I can see that happening.

We’re busy delivering a functional system..we’ll let the community judge us on our compatibility as we creep from Web 1.72 to Web 2.0. For some reason I think this particular blog posting will be judged, again. I just hope this time it isn’t copied and posted elsewhere. I think we’re saying important things here too!

One comment from the Wikipedia definition of Web 2.0 that resonates with me is: “The impossibility of excluding group-members who don’t contribute to the provision of goods from sharing profits gives rise to the possibility that rational members will prefer to withhold their contribution of effort and free-ride on the contribution of others.” It would be a shame if when people see issues on ChemSpider regarding performance or content that they not curate the data for others to benefit from or at least direct their comments to us directly for us to resolve.

Readers…ChemSpider is still in beta and will be for the foreseeable future (That’s so Web 2.0…remember that Tim O’Reilly commented it to include ”the end of the software adoption cycle – “the perpetual beta”). Differently than some we chose to “go big or go home” and went live with the beta…and then got pounded, not once but twice. A great introduction to the power of the blogosphere and the catalyst to putting up our own blog.

We got into this discussion about Web 2.0 as a result of the question “Should the curation of data on ChemSpider be supported by the community?. “Whatever the answer the reality is that curation is already underway and continues unabated. Thanks all.

By the way we did try to validate ourselves against the Web 2.0 Validator…we didn’t score very well (9/66) but it did say we were Web 3.0 compliant! WWMM, PubChem and eMolecules all got 4/66 We’re glad to be 9/66 but read the full story first….There’s fun stuff out there…


