Web-Archiving: Ian Milligan and His Derived Data-sets

Monday, July 10, 2017 01:38

Ed Pinsent by Ed Pinsent

Web-archiving is a relatively young side of digital archiving, yet it has already established a formidable body of content across the world, a large corpus of materials that could be mined by researchers and historians to uncover interesting trends and patterns about the 20th and 21st centuries. One advocate and enthusiast for this is Ian Milligan from University of Waterloo's Faculty of Arts.

His home country has many excellent web archive collections, but he feels they are under-used by scholars. One problem is that scholars might not even know the web archives exist in the first place. The second problem is many people find web archives really hard to use; quite often, search engines which interrogate the corpus don't really match the way that a scholar wishes to retrieve information. At a very simple practical level, a search can return too many hits, the hitlist appears to be unsorted, and the results are difficult to understand.


Milligan is personally concerned at the barriers facing academics, and he's actively working to make it easier, devising a way of serving historic web archives in ways that doesn't require massive expertise. His project Web Archives for Longitudinal Knowledge (WALK) is aiming to create a centralised portal for access to web content. The main difference to most such approaches which I've seen is that he does it by building derived data-sets.

As I understand it, a derived data-set is a new assemblage of data that has been created out of a web archive. To put this in context, it might help to understand the basic building block of web-archiving is a file called a WARC. A WARC is a file format, of which the contents are effectively a large chunk of code that represents the harvesting session, all the links visited, the responses, and a representation of the web content. If you wanted to replay the WARC so that it looks like the original website, then you would feed it to an instance of the Wayback Machine, which is programmed to read the WARC and serve it back as a rendition of the original web page.

However, Milligan is more interested in parsing WARCS. He knows they contain very useful strings of data, and he has been working for some time on tools to do the parsing. He is interested in text strings, dates, URLs, embedded keywords and names, and more. One such tool is Warcbase, part of this WALK project. Broadly, the process is that he would transfer data from a web archive in WARC form, and use Warcbase to create scholarly derivatives from that WARC automatically. When the results are uploaded to the Dataverse platform, the scholar now has a much more user-friendly web-archive dataset in their hands. The process is probably far more elaborate than I am making it sound, but all I know is that simple text searches are now much more rewarding and focussed; and by using a graphical interface, it's possible to build visualisations out of data.


A few archival-ish observations occur to me about this.

  • What about provenance and original order? Doesn't this derived dataset damage these fundamental “truths” of the web crawl? Well, let's remember that the derived dataset is a new thing, another digital object; the "archival original" WARC file remains intact. If there's any provenance information about the date and place of the actual website and the date of the crawl, that won't be damaged in any way. If we want paper analogues, we might call this derived dataset a photocopy of the original; or perhaps it’s more like a scrapbook, if it's created from many sources.

  • That made me wonder if the derived dataset could be considered a Dissemination Information Package in OAIS Reference Model terms, with the parent WARC or WARCs in the role of the Archival Information Package. I'd better leave it at that; the terms “OAIS conformance” and “web-archiving” don't often appear in the same sentence in our community.

  • It seems to me rather that what Milligan is doing is exploiting the versatility of structured data. If websites are structured, and WARCs are structured, why not turn that to our advantage and see if we can create new structures? If it makes the content more accessible to otherwise alienated users, then I'm all for it. Instead of having to mine a gigantic quarry of hard granite, we have manageable building blocks of information carved out of that quarry, which can be used as needed.

  • The other question that crossed my mind is "how is Ian and his team deciding what information to put in these derivatives?" He did allude to that fact that they are "doing something they think Humanities Scholars would like", and since he himself is a Humanities scholar, he has a good starting point. Scholars hate a WARC, which after all isn't much more than raw data generated by a web crawl, but they do like data arranged in a CSV file, and text searches with meaningful results.

  • To play devil's advocate, I suspect that a traditional archivist would recoil from any approach which appears to smack of bias; our job has usually been to serve the information in the most objective way possible, and the actions of arrangement and cataloguing are intended to preserve the truth of original order of the resource, and to help the user with neutral finding aids that steer them through the collection. If we do the work of creating a derived dataset, making decisions in advance about date ranges, domains, and subjects, aren't we somehow pre-empting the research?

This may open up bigger questions than can be addressed in this blog post, and in any case I may have misunderstood Milligan's method and intention, but it may have implications for the archive profession and how we process and work with digital content on this scale.

Posted in Digital Preservation