How to capture and preserve electronic newsletters in HE and beyond

Friday, March 10, 2017 09:30

Ed Pinsent by Ed Pinsent

This blog post is based on a real-world case study. It happens to have come from a UK Higher Education institute, but the lessons here could feasibly apply to anyone wishing to capture and preserve electronic newsletters.

The archivist reported that the staff newsletter started to manifest itself in electronic form “without warning”. Presumably they’d been collecting the paper version successfully for a number of years, then this change came along. The change was noticed when the archivist (and all staff) received the Newsletter in email form. The archivist immediately noticed the email was full of embedded links, and pictures. If this was now the definitive and only version of the newsletter, how would they capture it and preserve it? 

I asked the archivist to send me a copy of the email, so I could investigate further.  

It turns out the Newsletter in this case is in fact a website, or a web-based resource. It’s being hosted and managed by a company called Newsweaver, a communications software company who specialise in a service for generating electronic newsletters, and providing means for their dissemination. They do it for quite a few UK Universities; for instance, the University of Manchester resource can be seen here. In this instance, the email noted above is simply a version of the Newsletter page, slightly recast and delivered in email form. By following the links in the example, I was soon able to see the full version of that issue of the Newsletter, and indeed the entire collection (unhelpfully labelled an “archive” – but that’s another story). 

READ MORE: Metadata creation for digitisation - counting the costs

What looked at first like it might be an email capture and preserve issue is more likely to be a case calling for a web-archiving action. Only through web-archiving would we get the full functionality of the resource. The email, for instance, contains links labelled “Read More”, which when followed take us to the parent Newsweaver site. If we simply preserved the email, we’d only have a cut-down version of the Newsletter; more importantly, the links would not work if Newsweaver.com became unavailable, or changed its URLs. 404 error.png

Since I have familiarity with using the desktop web-archiving tool HTTrack, I tried an experiment to see if I could capture the online Newsletter from the Newsweaver host. My gather failed first time, because the resource is protected by the site robots (more on this below), but a second gather worked when I instructed the web harvester to ignore the robots.txt file. 

My trial brought in about 500-600MB of content after one hour of crawling – there is probably more content, but I decided to terminate it at that point. I now had a working copy of the entire Newsletter collection for this University. In my version, all the links work, the fonts are the same, the pictures are embedded. I would treat this as a standalone capture of the resource, by which I mean it is no longer dependent on the live web, and works as a collection of HTML pages, images and stylesheets, and can be accessed and opened by any browser. 

Of course, it is only a snapshot. A capture and archiving strategy would need to run a gather like this on a regular basis to be completely successful, to capture the new content as it is published. Perhaps once a year would do it, or every six months. If that works, it can be the basis of a strategy for the digital preservation of this newsletter. 

Such a strategy might evolve along these lines: 

  • Archivist decides to include electronic newsletters in their Selection Policy. Rationale: the University already has them in paper form. They represent an important part of University history. The collection should continue for business needs. Further, the content will have heritage value for researchers. 
  • University signs up to this strategy. Hopefully, someone agrees that it’s worth paying for. The IT server manager agrees to allocate 600MB of space (or whatever) per annum for the storage of these HTTrack web captures. The archivist is allocated time from an IT developer, whose job it is to programme HTTrack and run the capture on a regular basis. 
  • The above process is expressed as a formal workflow, or (to use terms an archivist would recognise) a Transfer and Accession Policy. With this agreement, names are in the frame; tasks are agreed; dates for when this should happen are put into place. The archivist doesn’t have to become a technical expert overnight, they just have to manage a Transfer / Accession process like any other. 
  • Since they are “snapshots”, the annual web crawls could be reviewed – just like any other deposit of records. A decision could be made as to whether they all need to be kept, or whether it’s enough to just keep the latest snapshot. Periodic review lightens the burden on the servers. 

This isn’t yet full digital preservation – it’s more about capture and management. But at least the Newsletters are not being lost. Another, later, part of the strategy is for the University to decide how it will keep these digital assets in the long-term, for instance in a dedicated digital preservation repository - a service which they University might not be able to provide themselves, or even want to. But it’s a first step towards getting the material into a preservable state. 

mobile-phone-1875813_1280.jpg

There are some other interesting considerations in this case: 

The content is hosted by Newsweaver, not by the University. The name of the Institution is included in the URL, but it’s not part of the ac.uk estate. This means that an intervention is most certainly needed, if the University wants to keep the content long-term. It’s not unlike the Flickr service, who merely act as a means of hosting and distributing your content online. For the above proposed strategy to work, the archivist would probably need to speak to Newsweaver, and advise them of the plan to make annual harvests. There would need to be an agreement that robots.txt is disabled or ignored, or the harvest won’t work. There may be a way to schedule the harvest at an ideal time that won't put undue stress on the servers. 

Newsweaver might even wish to co-operate with this plan; maybe they have a means for allowing export of content from the back-end system that would work just as well as tis pull-gather method, but then it’s likely the archivist would need additional technical support to take it further. I would be very surprised if Newsweaver claimed any IP or ownership of the content, but it would be just as well to ascertain what's set out in the contract with the company. This adds another potential stakeholder to the mix: the editorial team who compile the University Newsletter in the first place. 

Operating HTTrack may seem like a daunting prospect to an archivist. There is a simpler option, which would be to use PDFs as a target format for preservation. One approach would be to print the emails to PDFs, an operation which could be done direct from the desktop with minimal support, although a licensed copy of Adobe Acrobat would be needed. Even so, the PDF version would disappoint very quickly; the links wouldn’t work as standalone links, and would point back to the larger Newsweaver collection on the live web. That said, a PDF version would look exactly like the email version, and PDF would be more permanent than the email format. 

pdf.pngThe second PDF approach would be to capture pages from Newsweaver using Acrobat’s “Create PDF from Web Page” feature. This would yield a slightly better result than the email option above, but the links would still fail. For the full joined-up richness of the highly cross-linked Newsletter collection, web-archiving is still the best option. 

To summarise the high-level issues, I suggest an Archivist needs to: 

  • Define the target of preservation. In this case we thought it was an email at first, but it turns out the target is web content hosted on a domain not owned by the University. 
  • Define the aspects of the Newsletter which we want to survivesuch as links, images, and stylesheets. 
  • Agree and sign off a coherent selection policy and transfer procedure, and get resources assigned to the main tasks. 
  • Assess the costs of storing these annual captures, and tell your IT manager what you need in terms of server space. 

If there’s a business case to be made to someone, the first thing to point out is the risk of leaving this resource in the hands of Newsweaver, who are great at content delivery, but may not have a preservation policy or a commitment to keep the content beyond the life of the contract.  

This approach has some value as a first step towards digital preservation; it gets the archivist on the radar of the IT department, the policy owners, and finance, and wakes up the senior University staff to the risks of trusting third-parties with your content. Further, if successful, it could become a staff-wide policy that individual recipients of the email can, in future, delete these in the knowledge that the definitive resource is being safely captured and backed up.  


Learn more about CoSector

Posted in Digital Preservation, Higher Education, IT and Digital