How to archive a website built with a wiki? It’s worth looking into this as increasingly JISC projects are using wikis to manage and report on their projects; of the available brands, MediaWiki is a popular one.
The challenge for me is how to bring in a good copy of a wiki site without causing Web Curator Tool to gather too many pages from it. We don’t want that, because (a) the finished result occupies unnecessary space in the archive and (b) because it takes so long to complete that it can hold up the gather queue in the shared web-archiving service, delaying the work of other UKWAC partners.
I am not technical enough to tell you in great detail what’s causing this, although I sense that it’s something to do with the Heritrix crawler requesting too many pages from the wiki. When you consider that a wiki is database-driven it should not surprise us that it’s creating a lot of its pages on the fly. Secondly, since a wiki is editable by lots of contributors (that’s its core function after all), it presumably means we have numerous past versions of pages also stored somewhere in the wiki labyrinth, and it’s possible that the implacable Heritrix will not cease until it’s faithfully requested and copied every single one of them.
Let’s look at the Repositories Research Team wiki (DigiRep) owned by UKOLN, which I tried to gather five times in 2008. WCT conveniently keeps a history of these attempts, information about which I can still access even if the actual gathered pages have been discarded or archived. The size problems were chronic. Of five 2008 gathers, one was aborted after it had reached a massive 16.87 GB; a second one was rejected at 14.69 GB. I have archived one impression at 5.31 GB, another at 736.26 MB and another at 157.36 MB. Quite large variations there, which was worrying enough in itself.
At first, my workaround was to adjust the Profile Setting in the title to override the maximum number of documents Heritrix can gather. Setting ‘Maximum Documents’ at 10000 worked, but it was not ideal; I suppose all this means is that Heritrix stops when it collects 10,000 pages, whether we have everything we want or not. (I found that the copies in the archive seemed to render OK however).
To get a closer look at what’s going on, I started to browse the Log Files created by WCT (complete records of every single client-server request), which show patterns which I can vaguely understand; when these Log Files are packed with near-identical strings of code I sense that something’s up. For example, a string containing
index.php?title=Repositories_Research&action=edit tells us that the wiki is requesting a specific named page, and allowing an edit action on that page. If you multiply that by the number of pages in the wiki, you can see how the problem builds up. (PHP is the script used for MediaWiki’s web scripting engine).
I follow this up by browsing the actual gathered pages in Web Curator Tool using the Tree View. From here I can click on the ‘View’ button to examine a page which I think to be suspect, and compare it with other suspect pages. Lastly, I go back to the live DigiRep site to confirm in my mind what’s happening when certain links are followed.
All the above gave me just about enough information to experiment with exclusion filters. After a certain amount of trial and error, and working with other Media Wiki sites, I arrived at the following exclusion codes which I can add to the Profile Setting:
These have the effect of telling WCT to exclude certain pages and actions from Heritrix’s harvesting action. The expectation was that I would lose the discussion / edit / history functions of the wiki in the archive copy.
The title with the above exclusion profile gathered just 63.41 MB and it completed in under ten minutes. I would say that’s an improvement on 16.87 GB. Log Files and the Tree View confirmed the success of this new “slimline” gather. As well as losing the discussion / edit / history functions, we also have eliminated the Toolbox functions, the ‘printable’ views, and the login pages.
This is no great loss at all for our purposes, as scholars who browse the archived copy of DigiRep are not expecting to be able to edit pages, nor join in the discussions, nor browse the history of stored versions of pages. Indeed in a lot of cases, they would require a login to do so. The users simply want to see the results of the DigiRep team’s work.