# A repository for pi(es)

As you may have read recently, Fabrice Bellard has announced the computation of π to almost 2.7 trillion decimal places using a faster algorithm that allows desktop technology to be used, rather than the supercomputers that are usually used to break this particular record. Bellard is an extremely talented programmer who has made a useful contribution to one area of digital preservation with his emulation and virtualisation system QEMU. But it’s a comment by Les Carr that set me thinking about costs, research data and repositories.

“Would you want to put that in your repository?” asked Les. And this is a particularly extreme example where we can do some calculations to give us a fairly good answer. Scientific data centres and the researchers that use them have been considering this question for many years, and one way of looking at it is to see if the cost of recomputation exceeds the cost of storage over a particular time period. We’re assuming here that the initial question – is this worth keeping at all – has been answered at least vaguely positively.

Let’s look first at the cost of recomputation. Fabrice says the equipment used for this task cost no more than €2000. If we assume that it has a life of 3 years, that gives us a cost per day of €1.83. I’m avoiding the usual accounting practice of allowing for inflation, or lost interest on capital, in calculating the true depreciation value of the asset – there’s a number of different schemes and they all give similar results. I’ve just dividided the capital cost by the number of days of use we’ll get. But computers use electricity, and that costs money as well. Let’s assume this is a power-hungry beast that draws 400W and that power costs us 13.5¢ per kwH (which is what my domestic tarrif is if we assume a euro/sterling rate of €1.10 = £1 and 5% VAT.) That adds €1.30/day to the cost of running the system, for a total cost of €3.13/day.

Fabrice’s announcement says that it took 131 days of system time to calculate and verify his results, which gives a computational cost of €410.03 – which I’ll round to €410 since I’ve only been using 3 significant figures so far in the computations, and because there’s a lot of hand-waving involved in lots of these figures. So, we know how much it would take to recompute this result given the software, machine and instructions. (And the computational cost is likely to decline over time in the short term.)

The answer needs a Terabyte of storage. What will it cost to keep that in a repository? That’s a slightly more difficult question to answer, but we can give a number of figures that provide upper and lower bounds. SDSC quote \$390/Tbyte/year for archival tape storage (dual copies), excluding setup costs and assuming no retrieval. Moore et al quote \$500/year as a raw figure, obtained by dividing total system costs by usable storage within it. At current rates of \$1 = €0.67, that gives us a cost of €261/year or €335/year. SDSC are likely to be at the cheap end of the scale. ULCC’s costs, given our lower total volumes, would be closer to €1500/year for a similar service (dual archival tape copies on separate sites) although that does include retrieval costs. Amazon’s AWS would be about €100/year for a single copy. You would want two copies, so it’s twice that, and the cost of transferring the data in would be about 25% more than the storage cost. Since I haven’t factored in ingest costs for any of the other models, I’ll ignore it for AWS as well. (And yes, AWS isn’t a repository, and there’s no metadata, and… This is a back-of-the-envelope calculation. It’s a small envelope.)

Which means, at a very rough level and ignoring many pertinent factors, that after about two years of storage in the repository, we would have been better off recalculating the data rather than storing it. There’s a lot of assumptions hidden there, however. For one, we’re assuming that this data will rarely, if ever, be required. If many people want it, the recalculation cost rapidly becomes prohibitive (and so does the 131 days they have to wait for their request to be satisfied!)

One of the other problems is more subtle. I said that, in the short term, recalculation costs would be likely to fall as computational power becomes cheaper. The energy costs involved will rise, of course, but there’s still a significant downward trend. But after a sufficient period of time, it becomes non-trivial to reconstruct the software and the environment it needs in order to allow the computation to happen. Imagine trying to recalculate something now where the original software is a PL/I program designed to run under OS/360. It’s not impossible by any means, but the cost involved and expertise required is non-trivial. At least with our example we won’t have any doubts about whether the right answer has been produced – the computation of π produces an exact, if never-ending, answer. Most scientific software doesn’t do this and the exact answers produced can depend on the compiler, the floating-point hardware, mathematical libraries and the operating system. Over time, it becomes harder and harder to recreate these faithfully, and we often don’t have any means of checking whether or not we have succeeded. (Keeping the original outputs would help in this, of course, but that’s exactly what we’re trying to avoid.) That’s part of the problem that Brian Matthews and his colleagues examine in the SigSoft project and there’s still a great deal of work to be done there.

So have we answered Les’s question ? My feeling is that in this case we have – there’s a fair amount of evidence that suggests that keeping this particular data set isn’t cost-effective. But in general, the question is far harder to answer. Yet we must strive harder for more general answers as the cost of not doing so is not trivial. Even if money did grow on trees, it still wouldn’t be free and at present we need to be very careful how we use it.

# DPC AGM – and thoughts on preserving research data

Last Monday (2009-11-23) saw DPC members travel to Edinburgh for a board meeting and for the annual general meeting of the company. We elected a new chair – Richard Ovenden – and offered our thanks to Bruno Longmore for the effective leadership he has offered as acting chair following the departure of Ronald Milne for New Zealand earlier this year.

We had a brief preview of the new DPC website, which promises to be a much more effective mechanism for the membership to engage with each other and the wider world, and confirmed recommendations emerging from a planning day earlier in November which should keep the DPC busy (and financially secure) for a few years to come.

Finally, we had an entertaining and thought-provoking talk from Professor Michael Anderson. Professor Anderson touched on many issues relating to digital preservation from his research career, past and present. He mourned the loss of Scottish census microdata from 1951 and 1961, painstakingly copied to magnetic tape from round-holed punch cards for 1951 and standard cards for 1961, which had to be destroyed when ONS realised the potential for inadvertent disclosure of personal information.

# Moving Home

It’s been quiet here recently. Partly because people have been busy with projects such as CLASM and ArchivePress, but also because we’ve been busy readying ourselves for a move. After nearly 40 years in the same purpose-built premises, we’re relocating to Senate House, the home of the University of London’s federal activity. Many staff members won’t be in the office on Monday or Tuesday 28/29 September and those that are will be fully occupied making sure that everything moves across in one piece and ends up in the right place with the right cables plugged into it. Please bear with us if you’re trying to contact us then and it takes a bit longer than usual to get a reply. Our email addresses stay the same but our telephone numbers are changing.

We’ve had to lose a lot of material relating to past computing technologies that is now of limited or no value to us – stuff we were keeping really just because we had the space. That includes extensive documentation on IBM and CDC systems of the past, as well as DEC systems and a huge variety of micros. We’ve kept a few gems (the reference card for the SNUFF editor is a particular personal favourite) and discovered a few as well. They include what’s probably the earliest evidence of ULCC’s web presence in 1994. I’ll be writing about that soon over on the JISC-PoWR blog.

# Good news from the DPC

My day today began with one of those moments that remind us how technology, and the world, changes. On the train I sat next to someone reading and scribbling on an academic text of some sort on which the words “network research” and “SNA” appeared prominently. I began reading, as one does (yes, I shouldn’t, but I always do.) The first paragraph or so made sense and then I was brought up short. If you worked in computer networking during the 1970s, 1980s or 1990s (as I did) then seeing “SNA” and “network” within a few paragraphs of each other could only mean one thing, and it came from IBM. (Google still thinks so.) But in this case, SNA meant social network analysis, an entirely different field. (And one possibly related to Erdős numbers, a favourite of mine.) There’s even some perl modules for it, which is more than could be said for ACF/VTAM.

But I digress. I’m here to write about some outcomes from Friday’s DPC board meeting. Encouragingly, it looks likely that the digital preservation award will return in November 2010, although some hurdles remain to be overcome. It’s quite possible that some aspects (such as eligibility or marking criteria) may change. Watch out for news late this year or early next. In the meantime, if you have views on what would make the awards more interesting or relevant to you, and particularly on what might encourage you to enter, do let me or the DPC know.

The joint Society of Archivists’s digital preservation roadshows (supported by DPC, TNA, Planets and Cymal) have been extremely popular, with some events over-subscribed. They are proving a great way to get basic, practical information about digital preservation tools and methods into the hands of working archivists and records managers. The problems, and the reception, sound reminiscent of similar work I did for the SoA about 10 years ago, as part of their occasional training days for newly-qualified archivists.

I’m also pleased to say that the Board approved a proposal to allocate more money to training scholarships in 2009/10, which can be used to support attendance at DPTP or other member-provided courses such as DC 101 (which is currently free.) We’re also looking forward to a joint training showcase in Belfast with the DCC’s DC 101, facilitated by JISC and PRONI, in September. More details will appear here and elsewhere when we have them.

We’re expecting an increased number of DPC techwatch reports in the coming year. The latest, released on preview to DPC members yesterday (2009-07-10), covers geospatial data, and there’s a long list of candidate topics for the next couple of years.

Finally, the board said thanks and farewell to its current chair, Ronald Milne, who is taking up a new post at the National Library of New Zealand next month. The Vice Chair, Bruno Longmore, will act as DPC chair until elections are organised for the AGM in November.

# On the limits of preservation

A recent article in New Scientist on the outer fringes of the chiptune scene prompted me to think about preservation, emulation and the fact that some digital things simply aren’t preservable in any useful sense.

Chiptunes are typically created using early personal computers or videogames and/or their soundchips. In that respect, they depend on technology preservation – the museum approach to digital preservation. Chiptune composers either use the systems as designed, programming them directly to create their music, or alter them in some way using techniques collectively known as ‘circuit-bending’, which makes the machines capable of producing sounds that they could not have originally produced. Some aspects of the chiptune scene utilise more modern synthetic techniques to recreate the sounds produced by these early chips – these are, in a loose sense, emulating the original systems, although not in a way that would allow you to use original software to create your sounds. But some adherents of the chiptune genre are going further, using the sounds of the systems themselves in their compositions.

The article which set my train of thought going covered Matthew Applegate’s (aka Pixelh8) concert in late March 2009 at the National Museum of Computing,

# rpmeet – the JISC Repositories and Preservation Programme Meeting

Some of us at ULCC, and over 100 other people from around the UK, spent a couple of days this week at the Aston Business School reviewing the outcomes of JISC’s repositories and preservation programme and looking forward to what comes next. It was a useful and stimulating couple of days – the best programme meeting I’ve attended so far. The few projects that weren’t represented at the meeting missed out in a lot of ways. If you’re involved in a JISC project, make sure you, your project manager, or both of you go to a programme meeting when you are invited. You’ll learn a lot, make some useful contacts, save some time, get some useful ideas and possibly lay the groundwork for future projects or collaborations.

I began the day by chairing the final meeting of RPAG(the repositories and preservation advisory group.)

# DPC sponsors DPTP scholarships for May

We’re pleased to say that the DPC has agreed to sponsor two places at the forthcoming open run of the Digital Preservation Training Programme (DPTP) at SOAS, 18-20 May 2009. Attendance at DPTP itself is open to everyone, but the sponsored places are only available to staff of DPC member institutions. We’re pleased that this continues the valuable relationship we’ve had between the training programme and DPC since its inception. It also gives us the ideal excuse to welcome William Kilbride back as one of the tutors on the course – he’s a talented teacher and a joy to work with.

DPTP is of value to anyone with responsibility for digital preservation in an institutional context – its aim is to equip you with the knowledge to effect change in the organisation to allow the right things to happen. (If your primary responsibility is scientific data curation, you may find the DCC’s DC 101 course more applicable.)

Applications need to be in by May 5th – it’s not an onerous process, so don’t delay.

# International Repositories Infrastructure Workshop: public wiki now open

About a month ago (March 15-17) I attended an invitation-only event entitled “An International Repositories Infrastructure Workshop” in Amsterdam. Others have already blogged more contemporaneously about this event, including Chris Rusbridge, Amanda Hill and Jeremy Frumkin. They all provide a good summary of some of what took place, the activities which led up to the workshop and some sources of other information.

What’s prompted me to write about it now is the news that the outputs from that workshop are now visible, and the ongoing process of revising and amending them is taking place in a far more public forum on pbwiki. repinf.pbwiki.com is somewhere you should visit if you are, in the words of its homepage:

…. interested in:

1. developing coordinated action plans for specific areas of repository development

# Marking and writing JISC proposals

There’s been quite a bit of online discussion around writing and marking of proposals in JISC’s recent 12/08 call, including discussion of how Twitter can help you prepare a bid and how it was used (and perhaps abused) during the marking process. Andy Powell has vented his frustration on some aspects of the process (and people who can’t stay within the page limits!) (Updated to add: I also intended to mention Lorna Campbell’s post, written earlier in the process before marking had begun – lots of good advice there about writing a proposal.)

Rating a JISC bid

The marking process isn’t a secret – it’s exposed on the JISC website, along with some concise guidance on what makes a good bid, and examples of past winning bids. This advice is reinforced at the town meetings that accompany large funding rounds, so none of us have any excuse for not knowing what to do. Yet we continue to see some bids that don’t provide the information requested, or fail to demonstrate how they meet the requirements of the call. (I will readily admit that I’ve been guilty of writing bids like this as well.) More openness about the process can’t hurt, although it may not help. So I’ll say a little about the way we (or rather I) mark, and then speculate a bit about things we might want to do to improve it. If you’ve marked JISC bids yourself, you probably want to skip the next bit and go straight to idle thoughts

# Entertainment from seasons past

At this time of year (or any other) a bit of levity doesn’t go amiss. In the interests of saving paper, electrons and brain cells I’ve recycled some levity from the past rather than constructing something new. The image here links to a higher-resolution version of a page from ULCC’s newsletter in December 1983, suitable for printing in the comfort of your home and office. Yes, it’s a board game! Fun for all the family, colleagues, friends and neighbours! (Dice, batteries, playing pieces, rules not included.)

As well as providing entertainment, it’s a useful historical document. Although it features punch cards, we can see from the penalty square on the left hand side that the ability to describe a punch card, even 25 years ago, marked you as old enough to be collecting your pension. (Possibly some exaggeration there, but on such observations history is constructed.) There’s a handy, if again just-so-slightly inaccurate, diagram of the London network from the days just before the emergence of JANET.

To bring this every-so-slightly on-topic, it should be noted that the most severe penalty in the game is reserved for those who failed to ensure that their datasets were properly archived – a sentiment which I wholeheartedly support! Seasonal greetings to our readers, and to Alan Knott, the author/artist of the game and my immediate predecessor at ULCC.