This is a brief and undigested report from day 1 of the DCC’s international digital curation conference taking place in Edinburgh. After a welcome from Chris Rusbridge (DCC director) and Professor Peter Clarke (NeSC director) we had a keynote from Professor David Porteous, a professor of human molecular genetics and medicine and a key player in Generation Scotland.
He began by illustrating the changes in health, disease and knowledge of causes that lie behind some of his work. Changes in scottish demography illustrate this: in 1911 everyone is young, numbers decline with age in a smooth curve. After WWII, in 1951, there is a flat bulge from ages 10 to 50 with a decline thereafter, whereas 2001 and 2031 sees a bulge in pensioners. There is a consequent rise in chronic disease: disease that treatments of today are not very effective for, unlike the killers of the past, where effective treatments contributed to the changes in age profile in the population that we are now seeing. He illustrated this with reference to the grim reapers road map: an atlas of mortality in the UK (and, as he said, a fine book title.) It showed cancer, heart and lung disease unequally distributed over the UK. Glasgow is particularly bad. Why? Nature and nurture both play a part, but other than smoking, we have very little evidence for the real effects underlying nurture causes such as diet variation. So his research concentrates on the nature aspect: what difference our genetic inheritance plays.
He then looked at changes in sequencing costs. From 1990-2003 the human genome project spent $3bn to do the first with machines spread over aircraft hangers; now one machine can do a genome for 500k, in the next year 5k. completegenomics plans 20,000 genomes at $5k each in 2010 using 60,000 processors and 30Pb of storage. One goal of all this is personalised medicine – drugs that work for your genetic makeup. In the USA, adverse drug reaction is 4th leading cause of hospital mortality (but I’m thinking that only some of this is genome-related; some of it must surely be because some drugs are just downright toxic, with prescription involving balanced risk that sometime doesn’t go the way we want, and because prescribing errors are still all too frequent.) Bringing together mass genomics and automated drug screening is key, involves two big sets of data. Generation Scotland plans to do this: a competitive advantage is an unhealthy, stable population. Large scale family-based studies possible, supportive attitude in Scotland to doctors and medical schools. Expertise in health informatics and ethical, legal, social science essential. It’s all volunteer-based (striking contrast to Iceland and UK genome bank); grandmothers are key influencers. Recruits have blood, serum and uurine samples and tests of lung, bone, etc. and mental health status, so it’s more than just aggregating health records. Mental health is one where drugs generally don’t work and interaction between nature, nurture, aetiology and drugs needs to be explored in much more depth. System is linked to medical records; subjects have right to withdraw.
10 years of consultation before it started. but then there’s google health and google health trends – both ways of using large amounts of data to gain knowledge which work in different ways.
I missed the next morning sessions because I had to attend another meeting, and rejoined to see the minute-madness presentations for the posters, about which I hope to write later this week.
After lunch we had Dr Bryan Lawrence of STFC talk about big science data curation at the STFC environmental data archive. There were lots of numbers in this presentation and I only capture some of them: petabytes of data overall, 50TB from met office, 4000 years work to ‘look’ at it all. That’s why you need metadata, because you can’t examine it all at ingest. 2 minutes to find and do something simple: 60,000 images/year/person. So need to automate metadata creation and extraction. Google needs this metadata to help; it can’t deal with non-cited data directly. Most data mining processes text, not image data. We find data with discovery and ontology metadata; then look at context, character and discipline stuff; then also archival metadata. He mentions ISO 19115, should be derived from browse metadata. (There was a much more formal classification of metadata in his presentation which I haven’t captured in these notes.)
Data scientists can’t do their job unless the scientist has done theirs. They can choose not to take stuff, though, because the scientist hasn’t done their job. But even not taking something consumes resources to make the assessment and decision. Makes point that you can automate streams, but you can’t automate jobs away (10 things still need to be done, even if they are automated, so there’s still a linear resource relationship to the number of objects.)
A charge for 3 years storage up-front at ingest time which if volumes continue to grow, historical data storage lives on the margins from current business. Core budget pays for management and access systems , data management, network access, etc. then per data stream costs charged to projects. Core covers some projects already, with 25 FTE can supoport 10 new types of data per year, 100 things of a few hours work, 1000 things of a few minutes work, beyond that it must be automated. Next IGPCC requirements changes scale and thus the cost models. Interesting problems are in browse ontology and extra metadata space. Preserving metadata now presents its own challenges; real data publication is the way forward. In questions we determine that the deluge is necessary for the cost model, as is storage cost reduction. And at the moment it isn’t worth bothering about things to throw away, but that could change if those cost assumptions aren’t true (such as if storage costs plateau at some point.)
Neal Beagrie then speaks about research data costs. 4 case studies, 12 interviews, literature review, detailed look at 2 cost models against OAIS and UKHE TRAC led to the report. Produced a 3-part activity model which supports a cost framework (pre-archive, archive, support services.) There are key cost variables, a resource template. separate economic adjustments from service adjustments. He contrasts repository costs for publications with costs for data repositories, and data from elsewhere about much bigger cost of repairing metadata later on as opposed to doing it right at creation. It looks at efficiency curve effects, economies of scale and problems with first mover costs. He mentions the ADS 20-year rule (that all-time costs of preservation are essentially accounted for in the first 20 years), but points out there are a number of assumptions behind that. Points out usefulness of NSB/NSF distinction between research, community and reference collections. The study is new in using FEC (full economic costs, a UK HE funding model which comes from TRAC – transparent approach to costing (not trustworthy archive certification!)
The study is not just about DIY, can account for partial or full outsourcing. OSI study shows that 1.4% or 1.5% of research funding goes on data preservation and access.
Brian Lavoie talks about economic sustainability, from the blue ribbon task force. Resources aren’t just ‘available’: meaningful engagement is necessary. They need to be comprehensive (or at least a critical mass), actionable and sustainable (hence persistent.) Sustainability is economic, technical and social. Task force supported by NSF, Mellon, LoC, JISC, CLIR, NARA. Mission to frame digital preservation as a sustainable economic activity. Need to articulate benefits and incentives for decision makers (parallels with PoWR and digital preservation policy study.) First gives a willingness to pay, second a willingness to provide.
Need selection and efficiency, and reliable predictability. Then need to choose organisational form and governance. Org may be no interest( 3rd party provider), private interest (university library/archive), statutory/mandate interest (national library/archive). Issues: separating costs of access now vs access in the future. Monetizing public good. He mentions “spend now for future value” which appears to resonate with the DELOS/NSF “Invest to Save” message (in which I must declare an interest.) First report due this month.
At this point my laptop battery gave up the ghost. The final presentation from John Willbanks was a real highlight in many ways, but it will take until tomorrow for me to transcribe my handwritten notes.
Day 1 ended with a conference dinner in the splendid setting of Edinburgh Castle. The harpist was a particularly fine touch.