In this blog post I should like to disambiguate uses of the word “archive”. I have found the term is often open to misunderstandings and misinterpretation. Since I come from a traditional archivist background, I will begin with a definition whose meaning is clear to me.
At any rate, it is a definition that pre-dates computers, digital content, and the internet; the arrival of these agencies has brought us new, ambiguous meanings of the term. Some instances of this follow below. In each instance, I will be looking for whether these digital "archives" imply or offer "permanence", a characteristic I would associate with a traditional archive.
- In the paper world: an archive is any collection of documents needed for long-term preservation, e.g. for historical, cultural heritage, or business purposes. It can also mean the building where such documents are permanently stored, in accordance with archival standards, or even the memory Institution itself (e.g. The National Archives).
- In the digital world: a "digital archive" ought to refer to a specific function of a much larger process called digital preservation. This offers permanent retention, managed storage, and a means of keeping content accessible in the long term. The organisation might use a service like this for keeping content that has no current business need, but it still needed for historical or legal reasons. Therefore, the content is no longer held on a live system.
The OAIS Reference Model devised the term "Archival Storage" for describing this, and call it a Functional Entity of the Model; this means it can apply to the function of the organisation that makes this happen, the system that governs it, or the servers where the content is actually stored. More than just storage, it requires system logging, validation, and managed backups on a scale and frequency that exceeds the average network storage arrangement. The outcome of this activity is long-term preservation of digital content.
- In the IT world: a sysadmin might identify a tar, zip or gz file as an “archive”. This is an accumulation of multiple files within a single wrapper. The wrapper may or may not perform a compression action on the content. The zipped “archive” is not necessarily being kept; the “archiving” action is the act of doing the zipping / compression.
- On a blog: a blog platform, such as Wordpress or Google Blogger, organises its pages and posts according to date-based rules. Wordpress automatically builds directories to store the content in monthly and annual partitions. These directories are often called “archives”, and the word itself appears on the published blog page. In this context the word “archives” simply designates “non-current content”, in order to distinguish it from this month's current posts. This "archive" is not necessarily backed up, or preserved; and in fact it is still accessible on the live blog.
- In network management: the administrator backs up content from the entire network on a regular basis. They might call this action “archiving”, and may refer to the data, the tapes/discs on which the data are stored, or even the server room as the “archive”. In this instance, it seems to me the term is used to distinguish the backups from the live network. In case of a fail (e.g. accidental data deletion, or the need for a system restore), they would retrieve the lost data from the most recent “archive”. However: none of these “archives” are ever kept permanently. Rather, they are subject to a regular turnover and refreshment programme, meaning that the administrator only ever retains a few weeks or months of backups.
- Cloud storage services may offer services called "Data Archive" or "Cloud Archive". In many cases this service performs the role of extended network storage, except that it might be cheaper than storing the data on your own network. Your organisation also might decide to use this cheaper method to store "non-current" content. In neither case is the data guaranteed to be preserved permanently, unless the provider explicitly states it is, or the provider is using cloud storage as part of a larger digital preservation approach.
- For emails: In MS Outlook, there is a term called AutoArchive. When run, this routine will move emails to an “archive” directory, based on rules (often associated with the age of the email) which the user can configure. The action also does a “clear out”, i.e. a deletion, of expired content, again based on rules. There is certainly no preservation taking place. This “AutoArchive” action is largely about moving email content from one part of the system to another, in line with rules. I believe a similar principle has been used to “archive” a folder or list in SharePoint, another Microsoft product. Some organisations scale up this model for email, and purchase enterprise “mail archiving” systems which apply similar age-based rules to the entire mail server. Unless explicitly applied as an additional service, there is no preservation taking place, just data compression to save space.
- The term “archive” has been used in a rather diffuse manner in the IT and digital worlds, and can mean variously “compression”, “aggregated content”, “backing up”, “non-current content”, and “removal from the live environment”. While useful and necessary, none of these are guaranteed to offer the same degree of permanence as digital preservation.
- Of these examples, only digital preservation (implementation of which is a complex and non-trivial task) offers permanent retention, protection, and replayability of your assets.
- If you are an archivist, content owner, or publisher: when dealing with vendors, suppliers, or IT managers, be sure you take the time to discuss and understand what is meant by the term “archive”, especially if you're purchasing a service that includes the term in some way.