Library Technology Guides

Document Repository

Digital Archiving in the Age of Cloud Computing

Computers in Libraries [March 2013] The Systems Librarian

Image for Digital Archiving in the Age of Cloud Computing

By now we are all aware of the fragility of digital information. Any number of events can happen with the potential to destroy hours of work. And as more of our personal and professional existence takes place online, a lifetime of accumulated work can be irretrievably lost without adequate layers of proactive measures. It's also extremely challenging to ensure the long-term survival of digital content. What trace of our accomplishments will endure to be handed over to future generations?

As I have shifted from working for a major university with industrial-strength equipment, processes, and procedures for the protection and recovery of any data to working on my own as an independent consultant, I have to be especially careful about ensuring a similar level of protection on my own. I've had separate professional activities for quite some time, so this isn't necessarily entirely new. I think that most individuals have a mixture of digital data they create in the course of their primary employment, some that relates to professional projects that might be independent from their employer and some that relates to their personal and recreational interests.

Each of these scenarios highlights the need to take a reasonable amount of care to protect and preserve digital assets that represent investments of time and creative energies. At the institutional level, especially in libraries and cultural institutions, specialists are in place to safeguard data. IT departments institute procedures for the protection of routine business data as part of their disaster planning. A professional archivist, or even a team of them, would take responsibility for the long-term preservation of digital as well as physical assets that arise to a given level of significance. Protecting data created outside an institutional setting places the onus of responsibility on the shoulders of the individual. It's in this area that I think a lot of digital content remains under-protected. These days we all need to take on the role, to a certain extent, of personal archivists of the content that we create.

Fortunately, in this age of cloud computing, more options are available than ever before to help protect data. Whether the concern is protecting data at the personal, professional, or institutional level, digital storage services provide thorough cloud computing technologies that can deliver extraordinarily high levels of protection.

No Time for Complacency

A sense of urgency to take proactive procedures for data protection often doesn't become instilled until after something goes wrong. It does seem that computer devices are more reliable than ever. The kinds of computer failures that were much more common earlier in my career rarely happen in recent years. Yet, this perception of reliability should not lull anyone into complacency regarding safeguarding the data of their digital life.

The need for proactive data security processes applies to our individual personal and professional work. The stakes are an order of magnitude of higher concern for organizations and institutions. Those charged with curating the cultural heritage of any given aspect of society face incredible challenges to ensure that the increasingly digital proportions of those materials endure for generations to come. Longterm digital preservation brings together a cluster of complications that make it much more difficult than that of physical objects.

Data Security Basics

Responsible care of digital data requires several levels of proactive attention and planning. It's worth stating that waiting until the time of a failure is far too late to deal with the issue. At a minimum, it's essential to have backup and disaster recovery measures in place to protect against equipment failures. But the real challenge lies in the next step of ensuring the long-term survival of your data.

For those using desktop and laptop computers, we know that measures must be in place to back up our data so that it can be restored in case of equipment failure, destructive actions of malware, device theft, or catastrophic events. Whether in physical or digital form, most of us would never want to lose the photos that represent a lifetime of memories or artistic or professional accomplishments. Through fires, floods, or other life-changing events, a bit of advance planning can help avoid the loss of irreplaceable digital objects.

Even our mobile devices demand some attention to data security. For most of us, the loss of our smartphones and other mobile devices can be an enormous inconvenience. In most cases the kinds of digital assets at stake are not critically important. Calendar appointments and contacts can be reconstructed. But we may feel differently about photos. Fortunately, relatively good practices of disaster recovery are built into the ecosystem of most smartphones. With a few configuration options, all data can be set to be backed up or synchronized onto some kind of cloud-based storage. But keep in mind that this storage may not necessarily come with any guarantees. In most cases, access to this data is only as good as your standing with the company involved. Account disputes can be catastrophic when it comes to the data associated with any online provider.

The classic approach to disaster recovery planning includes creating routine backups of all data files onto media that is separate from the computer on which it normally resides. At one time, this might have involved copying data onto some flavor of tape media. But as the cost of disk storage has plummeted, most security procedures involve disk-to-disk backups. Even for business and large enterprise data centers, disk-based redundancy has largely displaced tape. Only at the very largest scale implementations does tape offer cost and capacity benefits over disk for disaster recovery purposes. But for long-term preservation, tape storage plays an important role over disk systems that depend on continuous power to maintain the integrity of their contents.

Maintaining an extra copy of data represents only the most basic level of safety. Geographic distribution of data covers a much wider range of disaster possibilities. At the very least, keep an up-to-date copy of the data in another physical location. It used to be common to keep backup tapes or drives for work at home, and vice versa. These days, cloud storage provides a convenient way to distribute your data geographically.

Using Cloud Computing to Enhance Security

The advent of cloud computing opens up a variety of options that can be employed in the care of data. For data stored on a physical device, such as a laptop or desktop computer, it's a common practice now to keep copies on one of the many cloud-based storage services. This technique is easy and inexpensive. For personal use, it's likely that safety net will fall within the free levels of storage offered. Dropbox, Microsoft's SkyDrive, Google Drive, and many others offer limited amounts of storage at no cost, with paid service offerings for larger quantities of data or for additional services. I regularly use Dropbox, for example, not only for files that I want to share with others but as a convenient additional layer of backup for important data files. Most cloud storage services come with a component that can be installed on your Mac or PC that automatically synchronizes any files stored in designated folders with the copy on the cloud storage service.

One could choose to keep the primary copy on a cloud storage service, but that might not always be ideal for the times when internet access isn't available. Until we can count on continuous connectivity, the need to store at least some data on local devices will remain. But more and more, it seems that cloudbased storage is becoming the default approach for storing and sharing data, which brings its own set of concerns for data protection and preservation.

It's important to keep in mind that in most cases, cloud-based storage services have their own vulnerabilities and will not take responsibility for your data if anything goes wrong. In the same way that you would never entrust important data on a single physical device, don't become vulnerable to any one cloud storage service.

One of the key precepts in data security involves keeping many separate copies. For important data, replicating across multiple cloud services provides added security. It's a good practice to deposit important files to at least two different cloud storage services.

The traditional procedures for disaster preparation center on the possibilities of equipment failures. But as we make use of cloud-based services, any related data resides well out of sight. But it shouldn't be out of mind. In the same way we know that any one computer device will eventually become obsolete or fail, one must keep a close watch on any cloud storage services used to be sure that they continue to be viable and that there are not alternative services with more appropriate features.

While it is unlikely that any of the major providers of cloud storage would abruptly go out of business, it's best not to be vulnerable to any single business. Again, keeping data on multiple providers ensures some safety, regardless of what happens on the business front. Both in the virtual and physical storage realm, avoid being vulnerable to a single point of failure.

For all of my projects, I make use of both cloud and local storage. For those based in the cloud, I keep fresh copies on local storage, and for those based on local storage, I keep copies on cloud storage services. Some, such as my Library Technology Guides site, operate on a hosted service on computers that I have never seen. I take advantage of backup services offered, but I also have an automated script that moves a copy of the databases and program files to a local computer at home, which is then backed up to a separate hard drive. For files that I create on my local computers, I archive finished files to a cloud-based repository with the corresponding descriptive metadata, and I keep copies of working files on multiple storage services.

Cloud Computing for the Enterprise

Cloud storage can be helpful at the organizational level as well. Most large organizations have what are considered enterprise-level storage management strategies that include prescribed drives or folders for employees to place data files, which are then automatically backed up with multiple layers of redundancy. This approach ensures that all institutional data have adequate disaster recovery protection but that they are secured for access to authorized individuals and subject to any applicable policies and business rules. Many organizations are subject to freedom of information requests and internal or external audit. They may also enforce various levels of security clearance or may be regulated by such statutes as HIPAA for medical information or FERPAfor student information. It's important for those who work in an institutional setting to follow the prescribed methods for storing their data files and not circumvent these systems by using personal cloud storage accounts.

Organizations that maintain enterprise networks, even those with complex business requirements, increasingly make use of cloud storage technologies. Rather than those designed for the consumer and smallbusiness environment, they would make use of industrial-strength services that provide such features as encrypted transport and storage, privately segregated cloud storage, data integrity validation, and the appropriate equivalents of other methods and procedures that would be applied to locally managed storage.

Managed storage is not inexpensive, either when operated locally or when accessed through a service provider. It's tempting to think of the cost of storage in terms of raw storage devices. Today, drives with capacity of multiple terabytes are sold for less than $100. But the layers of additional hardware and software involved in creating storage with ultra-high reliability, security, and other characteristics of managed storage are expensive. The raw disks themselves can represent a relatively small portion of the cost of an enterprise storage environment.

Cloud services providers and large institutional data centers implement storage on a very large scale and are able to achieve high levels of reliability with low costs per unit. High-capacity disk-based storage is generally implemented using RAID (redundant arrays of independent disks) and related techniques that automatically recover data when any individual component fails. Even in fault- tolerant environments, data can be lost through many human and software failures or access can be interrupted due to communications or business issues. It's essential to know what additional layers of data protection come with any given data storage arrangement, what additional services might be available for enhanced data protection, and when additional measures are needed to protect critical digital assets.

Know the Terms of Service

When subscribing to a cloud-based storage solution, it's essential to understand the terms of service and features offered. When using any service available without cost, you can assume that it's offered as is, with no guarantees, no automated backup, and with no ability to recover lost files, even when it's the fault of the provider. In most cases the free levels of service are made available with the hope that at least some proportion of the subscribers would opt for paid services. Even at the free level, the services can be quite reliable, but not to the extent that you would want to rely on them without additional layers of protection.

For most individuals, it should be possible to devise a personal storage strategy, such as using a combination of local and cloud-based replicates, which will offer a high degree of confidence of recovery for all reasonable contingencies. That's not a realistic expectation for individual professionals and for organizational settings. A paid storage service with additional backup and redundancy features and guaranteed levels of service can be a good investment. At the institutional level, issues of scale and capacity, reliability needed, and bandwidth concerns will factor into whether cloud-based storage offers the best alternative.

Regardless of whether the setting is personal, professional, or institutional, media objects such as digital photographs and video present special challenges. The size of the files involved makes them much more difficult and expensive to manage. These days where most digital cameras produce photos at 10 megapixel or higher resolution, individuals can easily accumulate collections in the hundreds of gigabytes, and professionals may need to manage multiple terabytes. The challenge is twofold. The size of the files requires higher-capacity storage devices, and they require significant bandwidth to transport among storage options. Any individual or organizational project dealing with thousands of photographs also needs to deal with metadata. Libraries regularly manage these issues through digital collection management platforms, and outside the library space, a variety of digital asset management systems are available for either general or highly specialized materials.

During my tenure as the executive director of the Vanderbilt University Television News Archive, we worked through the available scenarios for supporting the storage of a collection of video assets that totaled more than 150 terabytes. Especially when we began digitizing the collection, the costs of high-capacity storage were quite expensive. Even today the costs for storage of this quantity of materials on cloud-based storage would be in the neighborhood of $20,000 per month, illustrating that relying on cloud storage for very large projects may not necessarily be feasible.

Beyond Disaster Recovery to Digital Preservation

In the realm of managed digital collections, the issues surrounding backups, disaster recovery, and digital preservation apply at the highest level of strategic concern. These implementations would of course be subject to rigorous backup and disaster recovery procedures. But especially for very large-scale projects involving large collections of multimedia objects, we have seen that the cost of maintaining multiple replicates can be extraordinary expensive.

We can think of backup and disaster recovery as focused primarily on maintaining daily operational continuity with any activity involving data. We've seen an ascending scale of complexity and cost relative to whether the activity is at the personal level, at the smallscale professional level, within a large organization, or that involves irreplaceable cultural heritage materials.

The real challenge lies in finding ways to pass our digital assets to future generations. Digital technology changes so fast that even within a single decade, a given digital file may be migrated through several devices or storage services. Extending the horizon of concern through multiple decades or even centuries presses beyond what we can realistically anticipate in terms of any specific technology.

Digital preservation in practice means implementing all measures available during each current era to maintain the materials intact and to pass along all the data and metadata that the next generation might need to maintain access to the materials. In turn, that data would be handed off to the next generation, knowing that each successive generation will not only make use of different kinds of computing and database and storage technologies but also different file formats, transmission, and metadata standards. Over a very long time span, it's not really possible to anticipate all the technical and societal changes that might have major impacts on the viability of digital objects created in generations past.

Those involved in digital preservation are familiar with the OAIS (Open Archival Information System) reference model that describes the processes involved for developing a system capable of ensuring the long-term survival of digital materials. A trusted digital archive would follow this reference model to be able to migrate data files forward, through successive generations of migrations into applicable formats.

In the broader library arena, there are a couple of good examples of using distributed replicates of data for security and preservation. The LOCKSS (Lots Of Copies Keep Stuff Safe; project initiated by Stanford University for the preservation of primarily ejournal content, operates on the basis of a distributed set of replicates across many low-cost computers and associated storage devices among participating institutions. DuraCloud Services, offered by the DuraSpace organization that brings together the Fedora Commons and DSpace repositories, relies on multiple cloud storage services for highly reliable storage and preservation infrastructure for digital collections.

Organizations interested in establishing their own digital preservation environment can implement infrastructure components that work together to accomplish the processes described in the OAIS reference model. Many libraries have created their own trusted digital archive infrastructure using a combination of commercial and open source components. One of the few commercial products that I am aware of that offers this capability as an integrated package tailored for libraries and cultural heritage organizations is Rosetta, which was developed by Ex Libris originally for the National Library of New Zealand.

Long-term digital preservation is just as much organizational as it is technical. The key component involves an institutional commitment to do what it takes to preserve the materials through a never-ending cycle of technology platforms by institutions that can realistically be expected to endure into the distant future, with the resources to support that commitment. What future generations inherit from our legacies, now encoded mostly in various types of digital assets, will depend on how well we tend to these principles of data security and preservation.

View Citation
Publication Year:2013
Type of Material:Article
Language English
Published in: Computers in Libraries
Publication Info:Volume 33 Number 02
Issue:March 2013
Publisher:Information Today
Series: Systems Librarian
Place of Publication:Medford, NJ
Notes:Systems Librarian Column
Record Number:17760
Last Update:2024-04-19 11:05:25
Date Created:2013-03-17 19:36:38