To experience this website in full, please enable JavaScript or upgrade your browser.

Working with Web Curator Tool (part 2): wikis

mediawiki

How to archive a website built with a wiki? It’s worth looking into this as increasingly JISC projects are using wikis to manage and report on their projects; of the available brands, MediaWiki is a popular one.

The challenge for me is how to bring in a good copy of a wiki site without causing Web Curator Tool to gather too many pages from it. We don’t want that, because (a) the finished result occupies unnecessary space in the archive and (b) because it takes so long to complete that it can hold up the gather queue in the shared web-archiving service, delaying the work of other UKWAC partners.

I am not technical enough to tell you in great detail what’s causing this, although I sense that it’s something to do with the Heritrix crawler requesting too many pages from the wiki. When you consider that a wiki is database-driven it should not surprise us that it’s creating a lot of its pages on the fly. Secondly, since a wiki is editable by lots of contributors (that’s its core function after all), it presumably means we have numerous past versions of pages also stored somewhere in the wiki labyrinth, and it’s possible that the implacable Heritrix will not cease until it’s faithfully requested and copied every single one of them.

Let’s look at the Repositories Research Team wiki (DigiRep) owned by UKOLN, which I tried to gather five times in 2008. WCT conveniently keeps a history of these attempts, information about which I can still access even if the actual gathered pages have been discarded or archived. The size problems were chronic. Of five 2008 gathers, one was aborted after it had reached a massive 16.87 GB; a second one was rejected at 14.69 GB. I have archived one impression at 5.31 GB, another at 736.26 MB and another at 157.36 MB. Quite large variations there, which was worrying enough in itself.

At first, my workaround was to adjust the Profile Setting in the title to override the maximum number of documents Heritrix can gather. Setting ‘Maximum Documents’ at 10000 worked, but it was not ideal; I suppose all this means is that Heritrix stops when it collects 10,000 pages, whether we have everything we want or not. (I found that the copies in the archive seemed to render OK however).

To get a closer look at what’s going on, I started to browse the Log Files created by WCT (complete records of every single client-server request), which show patterns which I can vaguely understand; when these Log Files are packed with near-identical strings of code I sense that something’s up. For example, a string containing index.php?title=Repositories_Research&action=edit tells us that the wiki is requesting a specific named page, and allowing an edit action on that page. If you multiply that by the number of pages in the wiki, you can see how the problem builds up. (PHP is the script used for MediaWiki’s web scripting engine).

I follow this up by browsing the actual gathered pages in Web Curator Tool using the Tree View. From here I can click on the ‘View’ button to examine a page which I think to be suspect, and compare it with other suspect pages. Lastly, I go back to the live DigiRep site to confirm in my mind what’s happening when certain links are followed.

All the above gave me just about enough information to experiment with exclusion filters. After a certain amount of trial and error, and working with other Media Wiki sites, I arrived at the following exclusion codes which I can add to the Profile Setting:

.*&oldid.*

.*&diff.*

.*&limit.*

.*&direction.*

.*Recentchanges.*

.*/Special.*

.*?title=Special.*

.*&action=edit.*

.*&action=history.*

.*&section.*

.*&redlink.*

.*&printable=yes.*

.*&redirect=no.*

These have the effect of telling WCT to exclude certain pages and actions from Heritrix’s harvesting action. The expectation was that I would lose the discussion / edit / history functions of the wiki in the archive copy.

The title with the above exclusion profile gathered just 63.41 MB and it completed in under ten minutes. I would say that’s an improvement on 16.87 GB. Log Files and the Tree View confirmed the success of this new “slimline” gather. As well as losing the discussion / edit / history functions, we also have eliminated the Toolbox functions, the ‘printable’ views, and the login pages.

This is no great loss at all for our purposes, as scholars who browse the archived copy of DigiRep are not expecting to be able to edit pages, nor join in the discussions, nor browse the history of stored versions of pages. Indeed in a lot of cases, they would require a login to do so. The users simply want to see the results of the DigiRep team’s work.

8 thoughts on “Working with Web Curator Tool (part 2): wikis”

  1. Maureen Pennock says:

    Hi Ed,

    this is really interesting, thanks for posting it. I’m particularly interested in your selection decision to capture the main pages and not the history/edit/discussion pages. I see where you’re coming from when you say that ‘scholars who browse the archived copy of DigiRep are not expecting to be able to edit pages, nor join in the discussions, nor browse the history of stored versions of pages.’ I agree insofar as the edit pages are concerned, but there’s an argument for capturing the history pages as an audit trail of who contributed what and when. This is interesting and valuable stuff – a particular (or should that be significant?) characteristic of wikis is that they are collaborative tools. The history pages are evidence of this and I imagine there will be people who want to know what was done by whom and when. You could make the same argument for the discussion pages, but in my experience the discussion pages are not widely used. I haven’t looked at the ones in the Digirep wiki so can’t say whether that’s the case there. But as I said, all interesting stuff, thank you!

  2. Ed Pinsent says:

    Hi Maureen

    Many thanks for adding a comment. You make a good point. I agree there is an argument for capturing history pages from a wiki, and it’s likely “there will be people who want to know what was done by whom and when.” But I wonder if the UKWAC user community are those people? Perhaps a full record of this blog’s change history is more likely to be of value to UKOLN (primarily), and to the wider community of information specialists who are interested in digital preservation. So maybe this raises another possibility for web-archiving; different levels of capture, and capture of different types of content, depending on the requirements of your user communities.

  3. I don’t know if many people ever actually try to follow the authorship trail on (say) a Mediawiki page from start to finish. For example, the notorious Ronnie Hazlehurst/S Club 7 controversy on Wikipedia: there’s solid audit info there, no doubt, but there’s probably work for a whole new breed of auditor in actually working it out 🙂

    Maybe a graphic, progressive change animator would be a useful tool in a wiki archive?

  4. Maureen Pennock says:

    Ed – I agree, it comes back to the user community. I guess I’m just not clear on who’s ‘in’ the UKWAC user community, let alone how it may change in the future. Oh, for a crystal ball… !

    Richard – hmm, a graphic, progressive change animator tool. Like a Dipity interface for the web archive? Neat idea 🙂

  5. Maureen – sort of… though I was thinking more of a view of the document, so that the timeline slider plays through the amendments as if they are being entered live – you know the sort of thing? Bit like this (but not too much 😉

  6. Hi Ed, Maureen:

    We’ve encountered pretty the same Wiki issue as Ed describes at the National Library of New Zealand. The curators used exclusion filters much as Ed describes to limit the harvester (I will also be sending them this article in case they need to harvest mediawiki).

    The biggest problem is not the edit history, but the differences between versions of a page. A page with 10 versions will have dozens of “diff” pages (Ed’s .*&diff.* pattern presumably takes care of this). So even if we want to capture the edit history, we don’t want the differences.

    We have similar but more frequent issues with blogs. On some platforms each post appears on the homepage and on its own page, but also separately on the archive pages for the relevant year, month and day, and also on the page for any tags. So you can wind up harvesting the same content many times over. Again, crafting exclusion filters is the solution.

    Gordon

  7. emijrp says:

    Hi;

    There is an excellent tool for extracting all the info (text, histories and images) of a MediaWiki wiki, it is called WikiTeam tool.[1]

    It has been tested with several MediaWiki versions. You can see a list of preserved wikis ad the downloads section.

    The output format is a big XML with all the text and metadata (asy to import using MediaWiki import tools) and a directory with the wiki images.

    Regards,

    emijrp

    [1] http://code.google.com/p/wikiteam/

Leave a Reply

Your email address will not be published. Required fields are marked *