Web-Archiving: Web-Archives as Institutional Records

Friday, July 14, 2017 09:00

Ed Pinsent by Ed Pinsent

In my last in this series of blog posts about IIPC 2017, I'll look at the work of Gregory Wiedeman at Albany SUNY.

He is doing two things that are sure to gladden the heart of any archivist. First, he is treating his Institutional web archives as records that require permanent preservation. Secondly, he is attempting to apply traditional archival arrangement and description to his web captures. His experiences have taught him that neither one of these things is easy.

Firstly, I'm personally delighted to hear someone say that web archives might be records; I would agree. One reason I like it is because the way mainstream web-archiving seems to have evolved is in favour of something akin to the "library" model – where a website is treated as a book, with a title, author and subject. For researchers, that might be a model that's more aligned to their understanding and methods. Not that I am against it; I am just making an observation.

I first heard Seamus Ross make the observation "can websites be records?" some 11 years ago, and I think it is a useful way of regarding certain classes of web content, which I would encourage. When I worked on the Jisc PoWR project, one of the assumptions I made was that a University would be using its website to store, promote or promulgate content directly relevant to its institutional mission and functions. In doing this, a University starts to generate born-digital records, whether as HTML pages or PDF attachments. What I am concerned with is when these are the only copies of such records. Yet quite often we find that the University archivist is not involved in their capture, storage, or preservation.


The problem becomes even more poignant when we see how important record/archival material that was originally generated in paper form, such as a prospectus, is shifting over to the web. There may or may not be a clear cut-off point for when and how this happens. The archivist may notice that they aren't receiving printed prospectuses any more. Who is the owner of the digital version, and how can we ensure the pages aren't simply disposed of by the web master when expired? Later at the RESAW conference I heard a similar and even more extreme example of this unhappy scenario, from Federico Nanni and his attempts to piece together the history of the University of Bologna website.

However, none of this has stopped Gregory Wiedeman from performing his duty of archival care. He is clear: Albany is a public university, subject to state records laws; therefore, certain records must be kept. He sees the continuity between the website and existing collections at the University, even to the point where web pages have their paper equivalents; he is aware of overlap between existing printed content and web content; he knows about embedded documents and PDFs on web pages; and is aware of interactive sites, which may create transient but important records through interactions.

In aligning University web resources with a records and archives policy, Wiedeman points out one significant obstacle: a seed URL, which is the basis of web capture in the first place, is not guaranteed to be a good fit for our existing practices. To put it another way, we may struggle to map an archived website or even individual pages from it to an archival Fonds or series, or to a record series with a defined retention schedule.

Nonetheless, Wiedeman has found that traditional archives theory and practice does adapt well to working with web archives, and he is addressing such key matters as retaining context, the context of attached documents, the relationship of one web page to another, and the history of records through a documented chain of custody – of which more below.

When it comes to describing web content, Wiedeman uses the American DACS Standard, which is a subset of ISAD(G). With its focus on the intellectual content rather than the individual file format, he has found this works for large scale collections and granular access to them. His cataloguing tool is ArchivesSpace, which is DACS compliant, and which is capable of handling aggregated record collections. The access component to ArchivesSpace is able to show relations between record collections, making context visible, and showing a clear link between the creating organisation and the web resources. Further, there are visible relations between web records and paper records, which suggests Wiedeman is on the way to addressing the hybrid archive conundrum faced by many. He does this, I suggest, by trusting to the truth of the archival Fonds, which continues to exert a natural order on the archives, in spite of the vagaries of website structures and their archived snapshots.


It's in the fine detail of capture and crawling that Wiedeman is able to create records that demonstrate provenance and authenticity. He works with Archive-It to perform his web crawls; the process creates a range of technical metadata about the crawl itself (type of crawl, result, start and end dates, recurrence, extent), which can be saved and stored as a rather large JSON file. Wiedeman retains this, and treats it as a provenance record, which makes perfect sense; it contains hard (computer-science) facts about what happened. This JSON output might not be perfect, and at time of writing Albany don't do more than retain it and store it; there remains developer work to be done on parsing and exposing the metadata to make it more useful.

Linked to this, he maintains what I think is a stand-alone record documenting his own selection decisions, as to the depth and range of the drawl; this is linked to the published collection development policy. Archivists need to be transparent about their decisions, and they should document their actions; users need to know this, in order to make any sense of the web data. None of these concepts are new to the traditional archivist, but this is the first time I have heard the ideas so well articulated in this context, and applied so effectively to collections management of digital content.

Gregory’s work is described at https://github.com/UAlbanyArchives/describingWebArchives

Posted in Digital Preservation