This month I have been happily harvesting JISC project website content using my new toy, the Web Curator Tool. It has been rewarding to resume work on this project after a hiatus of some months; the former setup, which used PANDAS software, has been winding down since December. Who knows what valuable information and website content changes may have escaped the archiving process during these barren months?
Web Curator Tool is a web-based workflow database, one which manages the assignment of permission records, builds profiles for each ‘target’ website, and allows a certain amount of inter-facing with Heritrix, the actual engine that gathers the materials. The open-source Heritrix project is being developed by the Internet Archive, whose access software (effectively the ‘Wayback Machine’) may also be deployed in the new public-facing website when it is launched in May 2008.
Although the idiosyncrasies of WCT caused me some anguish at first, largely through being removed from my ‘comfort zone’ of managing regular harvests, I suddenly turned the corner about two weeks ago. The diagnostics are starting to make sense. Through judicious ticking of boxes and refreshing of pages, I can now interrogate the database to the finest detail. I learned how to edit and save a target so as to ‘force’ a gather, thus helping to clear the backlog of scheduled gathers which had been accumulating, unbeknownst to us, since December. Most importantly, with the help of UKWAC colleagues, we’re slowly finding ways of modifying the profile so as to gather less external material (or reduce collateral harvesting, to put it another way); or extend its reach to capture stylesheets and other content which is outside the root URL.
True, a lot of this has been trial and error, involving experimental gathers before a setting was found that would ‘take’. But WCT, unlike our previous set-up, allows the possibility of gathering a site more than once in a day. And it’s much faster. It can bring in results on some of the smaller sites in less than two minutes.
Now, 200 new instances of JISC project sites have been successfully gathered during March and April alone. A further 50 instances have been brought in from the Jan-Feb backlog. The daunting backlog of queued instances has been reduced to zero. Best of all, over 30 new JISC project websites (i.e. those which started around or after December 07) have been brought into the new system. I’ll be back in my comfort zone in no time…