What kind of disaster response plan should we have in place in regards to data management and security?
Protecting data remains a top priority for libraries, especially since they rely on technology for almost all aspects of their operations and service delivery. Interruptions due to technology failures are not only inconvenient to library users and personnel but also harm the library's reputation. All technologists should be well versed in the core precepts in disaster planning and recovery and institute proactive measures for their organization.
It's essential for libraries to safeguard their data against any possible failure in hardware, software, or human error. The relative proportion of risks have changed dramatically over the course of my career in library technology systems. In earlier times, most failures were related to hardware failures. Hard drives, storage arrays, and other devices were prone to failures, making it important to ensure that all data were copied as frequently as possible to another medium. Today, organizations are less dependent on local hardware components, with greater involvement with highly-redundant cloud-based storage services. Malicious attacks and human error represent some of the worst threats to data security.
Any effective disaster plan must work toward minimizing any operational impact potentially incurred by any type of technology failure. These strategies are based on a design of technical infrastructure able to remain operational even if one or more components fail and to prevent loss of data even in the event of failure. To achieve these goals, systems need to have multiple layers of redundancy, data synchronized across multiple systems or repositories. Ideally, organizations will monitor systems for potential issues before they result in data loss and have automated and human processes to recover data and systems when failures happen.
Data protection depends on redundancy and replication across devices so that the operational systems can withstand failures of underlying components. In the earlier days of storage, organizations often depended on RAID (redundant arrays of independent discs) technology to be able to remain operational, sometimes with degraded performance, even if a disk drive component fails. Despite this initial layer of protection, storage devices could fail, often related to errors in controllers, drivers, or software causing wholesale data corruption. RAID continues to be a viable storage technology, though many other alternatives are implemented in large-scale data centers.
Given the inherent vulnerability of primary storage devices, organizations necessarily implement back-up schemes involving the regular transfer of data to secondary, usually offline, media. Secondary media are often kept offsite as an additional safeguard in case of catastrophic events at the organization's primary business location. Systems administrators often implement software designed to automate backup procedures, including the performance of incremental and comprehensive data backups, rotation of media, and routing of offsite copies.
The era of cloud computing has drastically altered the nature of disaster planning and recovery. Many organizations have shifted from local computing and storage equipment to cloud services or to hybrid environments. Cloud technologies provide many options for security that can provide even greater protection for data than were possible through traditional processes used for on-premises infrastructure. Many organizations now rely more on outsourced or contractual enforcement of data protection strategies than hands-on backup procedures.
Organizations increasingly rely on cloud-based data management and storage services to house their operational data. Amazon's S3 (simple storage service), for example, provides a highly reliable and widely used service for data storage. These cloud-based storage services implement multiple layers of hardware and software redundancy that automatically work around component failures, usually with no interruption of service or performance degradation.
Although cloud-based storage services have incredibly low failure rates, they must also be supplemented by additional layers of data protection. In the same way that an organization would never keep only one copy of its data on a single local storage device, multiple layers of redundancy should also be implemented when relying on cloud-based storage services.
Many organizations will deploy data strategies with simultaneous replication to multiple storage services in separate data centers, ideally in multiple geographic regions. Such a data architecture can be designed to withstand widespread component failures within a data center or even the complete loss or unavailability of an entire facility.
Other precautionary measures include routine archiving of data to offline storage services, such as Amazon Glacier. These measures support both disaster planning and recovery as well as organizational requirements for archiving and data retention. Archived copies of database files, for example, may be needed to satisfy requests for records active in a previous period, but since deleted from the active business applications.
Disaster planning in a cloud environment can also take into consideration business issues related to the service providers. Some organizations may want to make replicates of data on storage services from multiple providers. This layer of protection may incur significant cost, including duplicate storage service subscriptions as well as for connectivity fees. It is also possible to implement automated processes that regularly transfer data to on-premises storage devices. These measures ensure ongoing access to data in the event of a business failure or an account or contract dispute.
Both local and cloud-based data strategies can be crafted and implemented to protect operational data against almost any technical failure. It is much more difficult to guard against human error and malicious attacks.
Human error can introduce corruption or loss of data that can be extremely difficult to prevent or remediate. In the context of library systems, for example, a script or command to perform global changes can introduce widespread errors. Such problems can be especially difficult to repair if they are not detected quickly. Beyond a given threshold, errors will be propagated throughout the online and offline replicates of the databases involved. Protection against these scenarios requires careful testing of all procedures that update operational databases and produce replicate or backup copies.
The phenomenon of ransomware can also present quite a challenge to data protection strategies. Introduced through malware, these attacks attempt to encrypt the data of the organization, often including critical operational databases. Once encrypted, only the attacker would hold the digital key needed for decryption, which would be provided only if the ransom demand is met, usually in the form of large payments made via Bitcoin or other non-traceable currency. Should the encrypted version of the database be propagated to replicate and backup copies, recovery apart from paying the ransom can be complex and sometimes impossible. Protection against this type of attack can be accomplished by making copies of critical data on devices not directly connected to the filesystems or business applications of the primary business environment and implementing procedures that test for unauthorized mass encryption.
As libraries increasingly depend on vendor-hosting for their core business applications, such as their integrated library system, library services platform, or digital collections management applications, that involve critical operational data, it is important to understand the redundancies in technical infrastructure and the disaster recovery procedures instituted. Any contract or subscription agreement for a hosted service will include a service level agreement (SLA) specifying these details. The SLA will state the required level of system availability and penalties for excessive downtime. It should also detail the procedures to protect customer data, including operational replicates, offline backups, and processes to provide copies of the data upon termination of the service contract or other triggering events.
Disaster planning and recovery strategies remain an essential component of a library's technology strategy. The latest round of changes toward cloud technologies have brought significant changes, with a new set of pragmatic methods to accomplish the basic principles of data redundancy and easily recovered backups. Data management strategies must be reviewed periodically to ensure they remain viable as technologies evolve and as the organization makes changes in its infrastructure.