Data Mining Dissertations and Adventures and Experiences in the World of Chemistry
This presentation was given at the CLIR/DLF Postdoctoral Fellowship Summer Seminar at Bryn Mawr college in Pennsylvania on July 29th 2014. The intention was to communicate what we are doing in the fields of text and data mining in the domain of chemistry and specifically around mining the RSC archive publication and chemistry dissertations and theses. How would these experiences map over to the humanities?
I presented at the Food and Drug Administration today regarding some of our efforts to develop a research data repository for the community. The abstract and presentation from Slideshare is below.
Current Initiatives in Developing Research Data Repositories at the Royal Society of Chemistry
Access to scientific information has changed in a manner that was likely never even imagined by the early pioneers of the internet. The quantities of data, the array of tools available to search and analyze, the devices and the shift in community participation continues to expand while the pace of change does not appear to be slowing. RSC hosts a number of chemistry data resources for the community including ChemSpider, one of the community’s primary online public compound databases. Containing tens of millions of chemical compounds and its associated data ChemSpider serves data tens of thousands of chemists every day and it serves as the foundation for many important international projects to integrate chemistry and biology data, facilitate drug discovery efforts and help to identify new chemicals from under the ocean. This presentation will provide an overview of the expanding reach of this cheminformatics platform and the nature of the solutions that it helps to enable including structure validation and text mining and semantic markup. ChemSpider is limited in scope as a chemical compound database and we are presently architecting the RSC Data Repository, a platform that will enable us to extend our reach to include chemical reactions, analytical data, and diverse data depositions from chemists across various domains. We will also discuss the possibilities it offers in terms of supporting data modeling and sharing. The future of scientific information and communication will be underpinned by these efforts, influenced by increasing participation from the scientific community.
We have been working with Vincent Scalfani from the University of Alabama towards supporting a community of 3D printing crystal structure enthusiasts. There is a listserv, [3DP-XTAL] hosted by the university of Alabama and if you would like to be added to the listserv, simply email Vincent at vfscalfaniATuaDOTedu. They are also in the process of creating a 3D printing crystal structure wiki/blog for the community.
With Vincent as the driver we are creating a public on-line repository for 3D printable structure files (.stl and .wrl). He used Jmol to prepare ~30,000 molecules and solids in .wrl and .stl format and we will be hosting them on part of our data repository. We are very excited about this project and there will be more information at the upcoming 248th American Chemical Society Meeting in San Francisco, CA. See CINF Abstract # 125.
The flier that will be distributed at the IUCr meeting in Montreal in August is available on Slideshare here:
I give a lot of presentations. A lot. Maybe too many. At the impending ACS meeting in San Francisco I am giving nine presentations. When I give a presentation I like to share it afterwards. I need the distribution method to be quick, easy to use and hopefully let users of the platform find it if they were interested in it. I have used various platforms to disseminate my talks. There are really no usability issues with any of them….the various groups have done a good job building their platforms. I am a user of both Slideshare and Figshare and my accounts are here: Slideshare and Figshare. This week I received my weekly stats email and the numbers are below…>3000 views in one week and a total of 400,000 views total of my talks, preprints etc.
Compare this with my Figshare stats of >6600 views ever.
The majority of talks I upload to Slideshare have about 3000 views in 2 months as shown below…some have over 25000 now.
If I compare this with Figshare the most views I have is around 500 but that was over 18 months.
Clearly my presentations on Slideshare get way higher exposure. However, the usual question of quality vs quantity comes to bear. Likely the audience on Figshare, of scientists primarily, may be more my audience rather on Slideshare. What I should do, but it is time-consuming (but only a few additional minutes per presentation) is put the presentation to Slideshare, to Figshare, to my Academia.edu account, to my ResearchGate account, to Vimeo, to YouTube etc. But I only have so much time and right now my easiest deposition route is Slideshare. In terms of my actual prioritization of places to deposit, based on the number of views and downloads the order is
I have been working with the Kudos platform for a few weeks now…see here. Two weeks ago I chose to run an experiment. Here it is… (you may want to watch the video on the previous post first to understand what enriching an article is and I why I feel the platform is of value)
1) I enriched an article that I had authored in 2013. GENERALLY after I enrich an article I tweet it out and then look for the response… you can see some of the results below for the articles I have done…I am starting from most recent and going back to the 80s but with 150 articles to do it’s a long journey…
The important stats to take a look at are Kudos views, clickthroughs and Share referrals. ULTIMATELY we want clickthroughs and views on the publisher platform. Kudos views are good but Share Referrals are very useful I believe. In the list below notice that for the fifth article in the list that the referrals are ZERO and the Kudos Views are low relative to the others….but this is the only one I haven’t “shared”…i.e. no tweets and no facebook posts. My hypothesis was “Ok, so it’s not Kudos itself that is helping to drive the views/shares/clickthroughs but MY work to share…Can I prove this?”
2) In order to prove the hypothesis…and I think it’s done…I did the following.
- Choose one article that had been on Kudos for a while and had low views/shares (all do that have not been enriched)
- Enrich the article in increments and see if it makes a difference…see the A’s shown on the chart below as those are enriching activities
- Monitor the views and see if any enriching activities made a difference.
- Wait two weeks and share the article and see what happens
3) The chart below proves the point.
- Enrichment, while useful for me as it helps aggregate information of value to the article, does NOTHING to drive attention to the article…i.e. the community doesn’t know what I’ve done without me telling them
- Once I share then BOOM…views/accesses/share referrals go through the roof. I went from 7 to 42 Kudos views in <2 hours
So, an article languished on Kudos for two weeks with no real traffic. I enriched it…no real impact. Not until I released out to my networks, and it got retweeted and passed on to others did traffic increase. I have fairly good followings on the different social network tools built up over a number of years. But what will Kudos do for those people who don’t use Facebook or Twitter? Yes they can enrich the article but the only way to let people know then is via email. Pushing the Kudos articles out to networks on an authors behalf would be very useful of course. Things will get exciting if and when Kudos uses intelligent algorithms to deliver updates to people interested in specific article topics. Google Scholar Citations does this for me now…it uses my published articles to provide me with notifications and pointers to related articles, not just articles that cite me. If Kudos could send me an email with “You might be interested in these new articles claimed on Kudos…” then that may be of value also. I think a Follow button would make sense whereby I can follow an article and if it is enriched further by the author I am informed by Kudos regarding what new enrichment is added.
This presentation was given at the JC Bradley Memorial Symposium on 14th July 2014
Jean-Claude Bradley had an incredible passion for providing open science tools and data to the community. He had boundless energy, no shortage of ideas and ran so many projects in parallel that it was often difficult to keep up. But at RSC we tried. We provided access to our data, our application programming interfaces and lots of our out-of-hours time to help turn his vision into reality. As a result we helped in the delivery of the SpectralGame to help people learn about NMR and we supported the integration of our services into GoogleDocs underpinning the management and curation of physicochemical property data. We tweaked a number of our services based on JC’s input and as a result we have ended up with a suite of capabilities that serve many of our existing efforts to integrate to electronic lab notebooks and support the ongoing shift towards Open Chemistry. JC was very much ahead of his time….and we were glad to have supported his work. This presentation will give a snapshot of some of the work we did to support his vision.
On July 14th 2014 the Jean-Claude Bradley Memorial Symposium was held to celebrate the life and work of Professor Jean-Claude Bradley of Drexel University. This slide deck highlighting dedications made to JC on various blogs and the memorial symposium wiki helps to capture JC’s contributions to science and how we felt about him.
On 14th July 2014 a memorial symposium to celebrate the life and work of Professor Jean-Claude Bradley, the father of Open Notebook Science, used this photo loop to connect us to some of his activities and give us a glimpse into his personal life.
Next week I am looking forward to co-hosting the JC Bradley Memorial Symposium. How did this come about? The symposium is of course catalyzed by the tragic loss of our friend and colleague Jean-Claude. This hit Andy Lang and myself very hard (and of course many others) because for a number of years we had been collaborating together on a number of projects regarding Open Science, many of these to be discussed in some detail next week at the symposium. When we talked and discussed about ways to memorialize JC we happened upon an instance where we would both be in the UK at the same time and, since JC had so many interactions in place with European scientists and advocates for Open Science, we decided to try and make a go of a symposium to celebrate his work.
We received general support for a gathering and went seeking a venue that would be kind enough to host us. Thank you so much to Christoph Steinbeck for trying to make this work at EBI but because of the popularity of the venue no rooms were available. We extended our hand to Bobby Glen at the University of Cambridge and, gentleman that he is (!), he immediately gave us a home for the gathering. Bobby is Director of the Unilever center of molecular science informatics at the university and may well known scientists and open science evangelists work there, one of these of course being Peter Murray-Rust. Peter threw his support behind the symposium 100% and, together with Susan Begg, has taken all responsibility for local coordination. Despite Peter’s demanding travel schedule we have been able to coordinate the event and we owe a debt of gratitude to Susan for all the work she has done in the background to bring this together in such a short time. Literally, this event will come together as a result of a few skype calls between Andy and I and a series of email exchanges between us, Peter and Susan. When the event comes together, starting Sunday evening with a social gathering, and finally on Monday morning when the formal gathering kicks off, then intention, collaboration, trust and willingness to get it done will be the underpinnings of the meeting.
“Intention, collaboration, trust and willingness to get it done” speak volumes regarding how JC Bradley approached science. He was a get-it-done type of guy. The speakers that will gather next week, listed here, operate in the same way in my mind. They are driven, passionate and getting it done. We thank every one of them for taking their time to come and celebrate JC.
The gathering will honor his work and enormous contributions to open science. He was ahead of his time. With this gathering of people, and the support of the attendees, we hope that we will be able to discuss how to drive forward what he had put so much effort into…OPEN NOTEBOOK SCIENCE. Peter Murray-Rust has already outlined his thoughts and will expand at the gathering. What we will need to do is consider how to turn discussions into actions and deliverables to get it done. The symposium will be a start…the “networking events” (call them pub gatherings) will continue the discussions and what we do afterwards makes it work. Hope to see you in Cambridge!!!
For those of you who CANNOT attend on Monday of next week….you can still contribute…
If you have any photos of JC please send them through to me at tony27587ATgmailDOTcom for a photo loop
If you want to send a dedication to JC send a few words that I will show on a dedication loop sometime during the meeting.
Recently I had a chance to sit with Fiona McKenzie and discuss why we both find Kudos to be of so much interest…it was just released as a movie on the RSC YouTube Channel and is embedded below. Will Russell who is often behind the scenes leading the way on interesting engagements and collaborations is behind the camera…