File formats…or data streams?

On 1st December Malcolm Todd of The National Archives gave a good account of the work he’s been doing on File Formats for Preservation, resulting in a substantial new Technology Watch report for the DPC. It was a seminar hosted by William Kilbride, with participants from the BBC, the BL, NLW and others. The afternoon was useful and interesting for me since I teach an elementary module on file formats in a preservation context for our DPTP courses.

My naïve thinking in the area has been characterised by the assumption that the process is rather static or linear, and that the problem we’re facing is broadly the same every time; migrate data from a format that’s about to become obsolete or unsupported, onto another format that’s stable, supported, and open. MS Word document to PDF or PDF/A…now that, I can understand!

In fact, I learned at least two ways of thinking about formats that hadn’t occurred to me before. One simple one is costs; some formats can cost more to preserve than others. This can be calculated in terms of storage costs, multiplied over time, and the costs associated with migrations to new versions of that format. For example, we’ve tended to pin our faith on the TIFF format for images for many reasons, but there’s a high storage price to be paid for all that wonderful losslessness. This may be one reason why the DP world is looking with more favour on the JPEG2000 format, which is ‘virtually’ lossless and smaller in size.

Secondly, the problems of preserving digital data which doesn’t actually have a specified stable preservation format. Chris Puttick of Oxford Archaeology gave a vivid description of the problems he’s facing with CAD and GIS files, where the data can’t easily be tied to a single format in the first place (nor can a stable format for migration be identified). As the NLA put it on their PADI page, “At present there is little dealing specifically or comprehensively with the preservation of this particular type of data, although some aspects of database preservation are applicable to GIS. Some long term preservation issues include a lack of open source formats and metadata standards, large data volume and complex data objects.” Puttick suggests that his data doesn’t really perform at all unless it’s operated within a very specific environment of hardware and software. How do we preserve an environment? This appears to be quite a distinct preservation problem and much harder to solve than Word to PDF, to put it mildly.

William Kilbride suggested that such cases (and websites too, arguably, because they are time-based) are more like a stream of data – a handy image which conveys something about the dynamic of such information packages, and showing us that it’s much harder to nail them down into a single format. You can never step into the same river twice.

4 thoughts on “File formats…or data streams?

  1. I think we have to think very carefully about this stuff. This is dangerous to say, as I wasn’t at the seminar, and maybe this was gone into. For example, you say “…migrate data from a format that’s about to become obsolete or unsupported, onto another format that’s stable, supported, and open. MS Word document to PDF or PDF/A…now that, I can understand!” So first, is Word about to become obsolete? really? Any format since Word 6 about to fall off the map of readable document formats? Somehow given the large number of alternative platforms (including OpenOffice), this does not seem likely. Obsolescence is a slooooooow phenomenon!

    Second, Word to PDF is an extremely lossy migration, unless your world view (or significant property view) is STRICTLY 2-d page oriented. You lose a lot of what might well be in the Word document. So it’s not a migration that should be done unless you have to (or at least, not without retaining the original, which begs the question of why migrate at all).

    You also quote the TIFF vs JPEG2000 matter. We’ve just published in IJDC an article by Wright et al from iPres 2008 (Wright, R., Miller, A., & Addis, M. (2009). The Significance of Storage in the “Cost of Risk” of Digital Preservation. The International Journal of Digital Curation, 4(3), 104-122. Retrieved from; here’s a quote from a late section, after a long discussion of some of the many ways that storage can go bad on you “Tests by Heydegger showed that corrupting only 0.01% of the bytes in a compressed JPEG2000 file, including lossless compression, could result in at least 50% of the original information encoded in the file being affected. In some cases, corrupting just a single byte in a JPEG2000 image would cause highly visible artefacts throughout the whole of that image.”

    Anyway, what I’m trying to get to is the need for some really careful thought and subtlety on the matter of migration!

  2. Chris, you are right to draw attention to these issues. But to be fair both to the original report, and Ed’s summary of the day, I think the tone of the event was more nuanced, and the difficult tradeoffs to be considered – such as that between storage costs and vulnerability of content to bitstream corruption – were part of the discussion.

    As for Word->PDF – I read Ed’s description not as approving it or recommending it, but merely as a relatively simple example where the issues can be considered. It’s one we include in the DPTP for that very reason. The property of editability, for instance, is critical if one is storing resources for reuse and refactorisation. But if the purpose of preservation is to preserve an inalienable record, then editability isn’t a significant property.

    Malcolm Todd’s report draws attention to these and other issues. I think the release of an earlier draft may be one of the reasons that the current redrafting of OAIS is going to explicitly address these concepts, albeit using new terminology.

    MS Word isn’t going to go away in the near future. Obsolescence is slow. But it does happen. 9-track magtapes continued to be a good interchange format for many years after the technology had been superseded by others precisely because 9-track was so widely adopted. But they are definitely obsolete now – the drives aren’t being manufactured any more. There’s still lots of content living on 9-track tape and it’s more endangered with each year that passes.

    < advertising interlude >

    ULCC happens to have a 9-track drive, by the way. How much longer that will be true, I can’t say. But while we have it, we’re happy to help others who need to recover information from such tapes.

    < /advertising interlude >

  3. As also discussed during the day and as yet imperfectly completed, the DPC has a new website – but the redirects to files and pages are not yet as reliable as we had hoped. So if you’re looking for the report that Ed mentions then let me recommend:

    As to whether a website is a collection of files or a data stream, my instinct is that we (DPC) are moving toward the latter and for good reasons. Just as well Ed made a copy of the old site for us in the UK Web Archive!

  4. I am glad the report has stirred this thread. Two things I feel I ought to say, rather belatedly:

    Chris – the second half of the report is concerned with the adequacy of information representation by candidate formats. If by saying we need to be very careful about that, you mean:

    [1] This is so important we need to get it right; and

    [2] We as a community have only really just begun to grapple with this head on ……

    …..then I couldn’t agree more. What I’d tried to do is to look at this from an archival science perspective and try to provide the hooks for other parts of the DP community to find their own equivalent of similar issues.

    Ed – at the event itself you asked a very valid question about web archiving formats that I could have answered a bit more intelligently and helpfully. I admitted that the report doesn’t discuss these in their own right, though it does discuss the problems with XML encoding as a “preservation strategy”. What I should have added is that the criteria in the first half of the report ought to be equally applicable to web archiving formats as to anything else, so no doubt the developers of WARK traded the inherent risks of a wrapper format advisedly for the ability to relate and render web pages into the future…..

    I’ve only got my tongue slightly in my cheek here – I honestly wouldn’t know, but what I do know is that one of the most interesting parts of researching the report was looking back at previous announcements of preservation approaches and reflecting on how these might be articulated and explained today.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>