A conference is a conference, but one can’t help but be inspired to think a little bigger when a guest of the organisation that has engineered a 27km vacuum, have consistently made mind-boggling breakthroughs in particle physics over the past 30 years… oh and invented the world wide web as a side project.
I’m by no means clever enough to understand most of what goes on at CERN – despite a very personable physicist doing his best to explain what matter is during a short bus journey – but the CERN workshop on innovations in scholarly communication (aka OAI8) covered familiar territory.
The conference, that ran 19th-21st June, was held at and in partnership with the University of Geneva. The plenaries covered “technical” issues, metrics, data and document semantics and research data – plus sessions on arts, humanities and social sciences and Gold OA infrastructure. In a nutshell? Keeping track of our stuff, this side of the Gutenberg Parenthesis.
Before the plenaries got started I attended a tutorial “Metadata: from records to graphs” given by Stefan Gradmann of KU Leuven. Some interesting stuff in there which I will explore in another post.
The idea of a web of resources linked by assertions as opposed to a web of documents has been about for a little while. The technical plenary traced an interesting route around this notion. Starting with Paul Groth from the University of Amsterdam looking at the possibilities for scholarly communication using semantic representation of metadata. Traditional academic output is formatted to assist understandability by librarians and other academics. Semantic representations can help with the “machine understandability” of ones work, the pay-off being that this is a relatively easy method to increase the channels through which work is disseminated and indeed the quality of that dissemination. Rob Sanderson from LANL then explained how the Open Annotations standard can be used to manage expressions of knowledge or opinions that implicitly surround a resource in a way that is transferable, transparent and therefore (amongst other things) more preservable. As it turned out Rob’s talk on Open Annotations was nicely relevant to that mornings tutorial so I’ll explore that more in depth in that other post.
The session ended with a timely reminder from Henry S Thompson of the University of Edinburgh that all this work to record the fact that something “somethings” something else is all for nothing if the system of reference used is not sustainable. In fact one may as well literally record something “somethings” something else and as I’m sure anyone reading this can see, that is not much use to anyone. Henry made the point that there is absolutely no obligation for anyone to ensure that a URI will mean the thing that it was originally intended to. I’m using the word “mean” in a very loose sense and deliberately not using the phrase “resolve to”. This will break a web of documents, but has the potential to make a nonsense of the semantic web. Henry cited Carl Linnaeus’ binomial classification and indeed the Linnean Society, who have spent the previous 250 odd years policing his naming convention, is a laudable example of longevity in the area of maintaining persistent references to representations of things.
The next session covered the theme of Metrics. I can’t admit to getting too excited about measuring impact factor with or without twitter. Nonetheless the session raised interesting questions surrounding the continued influence of large publishers on scholarly communication (negative and positive) and how to manage issues of trust, accountability and authority as the methods used by academics to share their ideas begin to move into uncharted territory.
The next morning returned to the semantics of scholarly communication. Olivier Bodenreider from the Lister Hill National Center for Biomedical Communications showed off pubMed’s semantic indexing. A nice example of what can be achieved when a very large corpus and a well structured information resource such as MeSH are used as the basis of automatic semantic indexing. This did put me in mind of MERLIN, a project ULCC collaborated on with UCL in 2009, which involved term-extraction using termine and automatic indexing that was presented to users alongside thesaurus terms from the (now defunct) HILT. There was somewhat more structure in pubMed’s offering partly due to the more tightly scoped body of work being indexed, the fact that the resulting exemplar is somewhat more usable, and the fact that they eschewed the idea of extracting terms from full-texts, instead preferring abstracts and titles.
Further talks in this session included “Detecting knowledge level claims in research articles”. Ágnes Sándor of Xerox Research Centre Europe presented on an NLP technique to identify not just the fact asserted, but how that fact is asserted. So terms like “However X remains unclear” tells us that the assertion X is couched as an “open question”. X may say that something “somethings” something else but “In contrast with previous hypothesis X” tells us that this is a “contrasting idea”. A compelling addition to the types of knowledge categorisation achievable by automated systems.
The semantic plenaries were very much focused on scientific scholarly communication, and this realm is an interesting confluence of verifiable facts and the vagaries and implication of traditional human communication. The scientific theme continued with Anita De Waard’s talk on ways to connect research publications and the claims made therein with the research data that represents the evidence for these claims. Some lovely examples of the real world of scientific data, ranging from printed out tables stuck into lab logbooks replete with scribbled diagrams to a fridge full of antibodies with handwritten labels. Quite a task to manage and cite in the digital realm, but significant potential for the discovery, dissemination and accountability of research if done well.
The fourth session covered issues of research data. Wolfram Horstmann of Oxford University took a quick tour around the various policies on research data. From cast iron preservation and access mandates to woolly sentiment, it was all in there. Donna Castelli of CNR-ISTI got technical again as she spoke on the importance of interoperability between devices and systems that collect and manage research data. The notion of a data-infrastructure was illustrated using the iMarine.
Kevin Ashley of the DCC was up next on the issue of data quality and curation. Some nostalgia here for me as Kevin presented screenshots of NDAD the project he led and I worked on here at ULCC back around the turn of the millennium. His points were interesting, especially to us at ULCC, as he admitted that NDAD might have benefited from a different approach – something that tied together more closely the data and very high quality metadata. An understanding that when it comes to quality, different data users have different requirements and this notion is well embedded in some markets that offer various “versions” of the same data, but less so for research data.
Finally Tim Smith from CERN provoked gasps of disbelief as he ran through some of the data volume numbers that the LHC experiments throw up. His talk, rather quaintly, referred to “large dataset”, but he quickly replaced that with the term “Big data”. And this really is big data especially for a single non-commercial institution. Even after electronic and computational filtering they still need to find somewhere to put 6GB of data every second. CERN’s solution for where to put all this is the WorldWide LHC Computing Grid which, after a few iterations turned out to be – to all intents and purposes – a cloud storage solution. CERN are now busy putting this know how to use for much smaller datasets (but lots of them) with Zenodo. This is a research output repository that applies an extremely agnostic approach to what goes in.
The Web that connected academic documents to each other has already changed beyond recognition and is set to change further. The amazing thing is that the comment written at the top of TBL’s original paper by his supervisor is as relevant today as it ever was.