Research Datasets: Lockdown or Snapshot?

Wednesday, January 18, 2017 09:00

Ed Pinsent by Ed Pinsent

In today’s blog post we’re going to talk about digital preservation planning in the context of research data. We’re planning a one-day course for research data managers, where we can help with making preservation planning decisions that intersect with and complement your research data management plan.

When we’re dealing with datasets of research materials, there’s often a question about when (and whether) it’s possible to “close” the dataset. The dataset is likely to be a cumulative entity, especially if it’s a database, continually accumulating new records and new entries. Is there ever a point at which the dataset is “finished”? If you ask a researcher, it’s likely they will say it’s an ongoing concern, and they would rather not have it taken away from them and put into an archive.


For the data manager wishing to protect and preserve this valuable data, there are two possibilities.

The first is to “lock down” the data-set

This would involve intervening at a suitable date or time, for instance at the completion of a project, and negotiating with the researcher and other stakeholders. If everyone can agree on a lockdown, it means that no further changes can be made to the dataset; no more new records added, and existing records cannot be changed.

A locked-down dataset is somewhat easier to manage in a digital preservation repository, especially if it’s not being requested for use very frequently. However, this approach doesn’t always match the needs of the institution, nor the researcher who created the content. This is where the second possibility comes into play.


The second possibility is to take “snapshots” of the dataset

This involves a capture action that involves abstracting records from the dataset, and preserving that as a “view” of the dataset at a particular moment in time. The dataset itself remains intact, and can continue being used for live data as needed: it can still be edited and updated.

Taking dataset snapshots is a more pragmatic way of managing and preserving important research data, while meeting the needs of the majority of stakeholders. However, it also requires more effort: a strategic approach, more planning, and a certain amount of technical capability. In terms of planning, it might be feasible to take snapshots of a large and frequently-updated dataset on a regular basis, e.g. every year or every six months; doing so will tend to create reliable, well-managed views of the data.

Another valid approach would be to align the snapshot with a particular piece of research

For instance, when a research paper is published, the snapshot of the dataset should reflect the basis on which the analysis in that paper was carried out. The dataset snapshot would then act as a strong affirmation of the validity of the dataset. This is a very good approach, but requires the data manager and archivist to have a detailed knowledge of the content, and more importantly the progress of the research cycle.

The ideal scenario would be to have your researcher on board with your preservation programme, and get them signed up to a process like this; at crucial junctures in their work, they could request snapshots of the dataset, or even be empowered to perform it themselves.

In terms of the technical capability for taking snapshots, it may be as simple as running an export script on a database, but it’s likely to be a more delicate and nuanced operation. The parameters of the export may have to be discussed and managed quite carefully.


Lastly we should add that these operations by themselves don’t constitute the entirety of digital preservation. They are both strategies to create an effective capture of a particular resource; but capture alone is not preservation.

That resource must pass into the preservation repository and undergo a series of preservation actions in order to be protected and usable in the future. There will be several variations on this scenario, as there are many ways of gathering and storing data. We know that institutions struggle with this area, and there is no single agreed “best practice.”

Posted in Digital Preservation