To experience this website in full, please enable JavaScript or upgrade your browser.

Web Archiving

DART Podcast – web archiving and research data

marta_teperek_plusOur latest DART podcast  contains a compelling and fascinating interview with Dr Marta Teperek, the Research Data Facilitator at Cambridge University Library’s Research Operations Office. She attended our May Web Archiving 101 Course (featuring guest speakers Dr Peter Webster and Sara Day Thomson) and seemed to derive a lot of benefit from it. She even published a very positive blog post on the subject.

Read more

DPTP: Web Archiving 101 Course

Blog Image

Yesterday we ran our first ever one-day course in a specialist area of digital preservation – Web Archiving 101. We looked at all aspects of web archiving, and had a great group of people attending. Our Storify from the day gives a flavour of some of the topics we covered, and some of the wider discussions around them.

Thanks to Dr Peter Webster of Webster Research & Consultancy who shared his expert knowledge on all aspects of web archiving, including a really useful researcher/user perspective.  Thanks also to Sarah Day Thomson of the Digital Preservation Coalition who gave a really useful insight into the archiving of social media, based on the research work she is currently undertaking in this area. And of course a huge thanks to all the participants, both in the room and on Twitter, who came with lots of questions, case studies and contributions.

Podcast Number 2 – Web Archiving 101

Image from the Internet Archive in Flickr Commons

Image from the Internet Archive in Flickr Commons

As previously mentioned on the blog, we will be running a new DPTP 1-day course on 12/May, ‘Web Archiving 101’.  Peter Webster, of Webster Research and Consulting,  who will be teaching on the course, joined Ed, Steph and Frank to discuss web archiving, what it is, why it’s important, and how we hope to help people engage this type of digital preservation through our forthcoming course.

Our discussion explored some of the ‘big’ reasons for archiving the web on a grand scale, including projects such as the British Library’s Web Archive and also the importance of web archiving for smaller projects, down to individual researcher level. We talked about the preservation of social media, the legal aspects of archiving content from the web and the growing importance of content on the web in all aspects of life.

As more and more information of all kinds, from blogs used as online ‘notebooks’ for researchers to company and organisational records, research data and documents of historical importance,  becomes web-based, the decision to archive at least some of that content is growing more pressing for everyone. Our course aims to equip people with the tools and knowledge to manage web archiving at whatever scale is appropriate for their own work – a theme that was central to our podcast discussion.

We hope the podcast will be a useful introduction to the subject, and we still have a few places left on the course if web archiving is something you need to know more about.

Why web archiving?

It is a great pleasure to be joining old colleagues and to work with new  at ULCC to help deliver Web Archiving 101 for the first time. But why archive the web at all ? Isn’t it all just pornography and pictures of cats?

My first answer is the reason I first became involved with web archiving, first at the Institute of Historical Research and then as part of the UK Web Archive team at the British Library. As an historian of contemporary Britain, it became clear to me that much of the activity that a historian of the 1920s or the 1980s can study in printed literature has now migrated online, but that neither libraries and archives nor scholars had begun to catch up with what was a very rapid transition in the nineties and noughties. Both professional historians and anyone else with a need to know about the recent past will be unable to understand our times without reference to the archived web.

Web archiving is also now being recognised as a key part of the institutional memory of organisations, in all sectors, both public and private. And there is also a recognition that web archiving is not easily bolted onto the existing tasks of archivists and record managers without careful thought; and that it is many ways a distinctive part of the wider field of digital preservation.

Both these reasons would be less compelling if were not for the fact that the web changes and decays at a remarkable rate – and so the web is not its own archive, as many people imagine. Recent episodes such as the removal of historic speeches from the Conservative Party website illustrate the point that content is constantly being removed from the web intentionally. Other studies point to the speed in which content not only disappears but is also amended and updated, or migrates to new locations.  (See a summary of some of these studies at Historians and Web Archives.)

Archivists, record managers, the leaders of their organisations, and scholars of every discipline need to engage with web archiving. I hope that Web Archiving 101 will help that engagement along.

DPTP: Web Archiving 101- A New Course



The Sphinx, does, of course, know the answers to all riddles, including, we assume, how to archive the web… Image from the British Library Flickr stream, no known copyright.

We are pleased to announce a new course in the Digital Preservation Training Programme – ‘Web Archiving 101’. This course is a 1-day course and will take place at Senate House on 12th May 2015. The day will be a mix of tutor-led learning, discussion and group exercises. We won’t be offering ‘hands-on’ in the sense of using tools, but the focus is very much on the practical. If you create and/or use  web-based information and resources for research and you have an interest ensuring that such web content can persist and endure in an accessible and usable preservation environment, then this event will be of interest to you.

This course grew out of our re-working of our core DPTP course last year. At that time, we decided to create two new courses, an introduction course and an intermediate course, and in doing so, we reviewed all our existing content very thoroughly. We used to run a web archiving module on the original DPTP course, but times, we felt had moved on. Web archiving had grown significantly in since the original course was designed, and we no longer had the time within a broader course to do it justice. What to do? After much heart searching, we took the module out of the 2 and 3 day courses. However, we still felt there was a strong interest in the area, and so the idea of a ‘101’ course, just focusing on web archiving, was born.

We have been very lucky to work with Peter Webster, a long-established expert in this field, in the design of this new one-day course. Peter will also be teaching on the course, along with Steph Taylor and Ed Pinsent, the regular DPTP tutors, and Sara Day Thomson, a project officer with the  DPC, who will be sharing her knowledge about and research into the issues of archiving social media.

There is more information about the course on our website, and you can now book in the ULC shop.

Foiled by an implementation bug

I recently attempted to web-archive an interesting website called Letters of Charlotte Mary Yonge. The creators had approached us for some preservation advice, as there was some danger of losing institutional support.

The site was built on a WordPress platform, with some functional enhancements undertaken by computer science students, to create a very useful and well-presented collection of correspondence transcripts of this influential Victorian woman writer; within the texts, important names, dates and places have been identified and are hyperlinked.

Since I’ve harvested many WordPress sites before with great success, I added the URL to Web Curator Tool, confident of success. However, right from the start some problems were experienced. One concern was that the harvest was taking many hours to complete, which seemed unusual for a small text-based site with no large assets such as images or media attachments. One of my test harvests even went up to the 3 GB limit. As I often do in such cases, I terminated the harvests to examine the log files and folder structures of what had been collected up to that point.

This revealed that a number of page requests were showing a disproportionately large size, some of them collecting over 40 MB for one page – odd, considering that the average size of a gathered page in the rest of the site was less than 50 KB. When I tried to open these 40 MB pages in the Web Curator Tool viewer, they failed badly, often yielding an Apache Tomcat error report and not rendering any viewable text at all.

Read more

BlogForever: Preservation in BlogForever – an alternative view

From the BlogForever project blog

I’d like to propose an alternative digital preservation view for the BF partners to consider.

The preservation problem is undoubtedly going to look complicated if we concentrate on the live blogosphere. It’s an environment that is full of complex behaviours and mixed content. Capturing it and replaying it presents many challenges.

But what type of content is going into the BF repository? Not the live blogosphere. What’s going in is material generated by the spider: it’s no longer the live web. It’s structured content, pre-processed, and parsed, fit to be read by the databases that form the heart of the BF system. If you like, the spider creates a “rendition” of the live web, recast into the form of a structured XML file.

What I propose is that these renditions of blogs should become the target of preservation. This way, we would potentially have a much more manageable preservation task ahead of us, with a limited range of content and behaviours to preserve and reproduce.

If these blog renditions are preservable, then the preservation performance we would like to replicate is the behaviour of the Invenio database, and not live web behaviour. All the preservation strategy needs to do is to guarantee that our normalised objects, and the database itself, conform to the performance model.

Read more

BlogForever: BlogForever and migration

From the BlogForever project blog

Recently I have been putting together my report on the extent to which the BlogForever platform operates within the framework of the OAIS model. Inevitably, I have thought a bit about migration as one of the potential approaches we could use to preserve blog content.

Migration is the process whereby we preserve data by shifting it from one file format to another. We usually do this when the “old” format is in danger of obsolescence for a variety of reasons, while the “target” format is something we think we can depend on for a longer period of time. This strategy works well for relatively static document-like content, such as format-shifting a text file onto PDF.

The problem with blogs, and indeed all web content, is when we start thinking of the content exclusively in terms of file formats. The content of a blog could be said to reside in multiple formats, not just one; and even if we format-shift all the files we gather, does that really constitute preservation?

Read more

The BlogForever survey is live!

After weeks of design work, the BlogForever survey is live, available in 6 languages and running for 28 days.

This survey is part of BlogForever, an EU-funded collaborative project that ULCC collaborates through the Digital Archives department.

The results of the survey, available at the end of the summer, will help to develop digital preservation, management and dissemination facilities for weblogs. Hence, we are keen to gather information about the content, context and usage patterns of current weblogs, so we could identify blogs users’ views on their long-term preservation, management, analysis, access and use. If you would like to take part on the survey please use the following link: