On 1st December Malcolm Todd of The National Archives gave a good account of the work he’s been doing on File Formats for Preservation, resulting in a substantial new Technology Watch report for the DPC. It was a seminar hosted by William Kilbride, with participants from the BBC, the BL, NLW and others. The afternoon was useful and interesting for me since I teach an elementary module on file formats in a preservation context for our DPTP courses.
My naïve thinking in the area has been characterised by the assumption that the process is rather static or linear, and that the problem we’re facing is broadly the same every time; migrate data from a format that’s about to become obsolete or unsupported, onto another format that’s stable, supported, and open. MS Word document to PDF or PDF/A…now that, I can understand!
In fact, I learned at least two ways of thinking about formats that hadn’t occurred to me before. One simple one is costs; some formats can cost more to preserve than others. This can be calculated in terms of storage costs, multiplied over time, and the costs associated with migrations to new versions of that format. For example, we’ve tended to pin our faith on the TIFF format for images for many reasons, but there’s a high storage price to be paid for all that wonderful losslessness. This may be one reason why the DP world is looking with more favour on the JPEG2000 format, which is ‘virtually’ lossless and smaller in size.
Secondly, the problems of preserving digital data which doesn’t actually have a specified stable preservation format. Chris Puttick of Oxford Archaeology gave a vivid description of the problems he’s facing with CAD and GIS files, where the data can’t easily be tied to a single format in the first place (nor can a stable format for migration be identified). As the NLA put it on their PADI page, “At present there is little dealing specifically or comprehensively with the preservation of this particular type of data, although some aspects of database preservation are applicable to GIS. Some long term preservation issues include a lack of open source formats and metadata standards, large data volume and complex data objects.” Puttick suggests that his data doesn’t really perform at all unless it’s operated within a very specific environment of hardware and software. How do we preserve an environment? This appears to be quite a distinct preservation problem and much harder to solve than Word to PDF, to put it mildly.
William Kilbride suggested that such cases (and websites too, arguably, because they are time-based) are more like a stream of data – a handy image which conveys something about the dynamic of such information packages, and showing us that it’s much harder to nail them down into a single format. You can never step into the same river twice.