Trusted Digital Repository Development Lit Review

Trusted Digital Repositories | literature review and comprehensive exam

Abstract

Computer scientists who work with digital data that has long-term preservation value, archivists and librarians whose responsibilities include preserving digital materials, and other stakeholders in digital preservation have long called for the development and adoption of open standards in support of long-term digital preservation. Over the past fifteen years, preservation experts have defined “trust” and a “trustworthy” digital repository; defined the attributes and responsibilities of a trustworthy digital repository; defined the criteria and created a checklist for the audit and certification of a trustworthy digital repository; evolved this criteria into a standard; and defined a standard for bodies who wish to provide audit and certification to candidate trustworthy digital repositories. This literature review discusses the development of standards for the audit and certification of a trustworthy digital repository.

Citation

Ward, J.H. (2012). Managing Data: Preservation Standards & Audit & Certification Mechanisms (i.e., “policies”). Unpublished Manuscript, University of North Carolina at Chapel Hill. (pdf)

Creative Commons License
This work is licensed under a Creative Commons Attribution-NoDerivatives 4.0 International License.


Table of Contents

Abstract

Introduction

“Trust”

The Types of Audit and Certification

Trusted Digital Repositories: Attributes and Responsibilities

Trusted Digital Repositories
Attributes of a Trusted Digital Repository
Responsibilities of a Trusted Digital Repository
Certification of a Trusted Digital Repository
Summary

Trusted Digital Repositories: Audit and Certification

Trustworthy Repositories Audit & Certification: Criteria and Checklist
Organizational Infrastructure
Digital Object Management
Technologies, Technical Infrastructure, and Security
Audit and Certification of Trustworthy Digital Repositories Recommended Practice

Trusted Digital Repositories: Requirements for Certifiers

ISO/IEC 17021 Conformity Assessment
Requirements for Bodies Providing Audit and Certification of Candidate Trustworthy Digital Repositories Recommended Practice

Trusted Digital Repositories: Criticisms

Summary

References


Table of Figures

Figure 1 – TRAC, A1.1 (OCLC & CRL, 2007).

Figure 2 – Audit and Certification of Trustworthy Digital Repositories Recommended Practice, 3.1.1 (CCSDS, 2011).


Introduction

Computer scientists who work with digital data that has long-term preservation value, archivists and librarians whose responsibilities include preserving digital materials, and other stakeholders in digital preservation have long called for the development and adoption of open standards in support of long-term digital preservation (Lee, 2010; Science and Technology Council, 2007; Waters & Garrett, 1996). However, Hedstrom (1995) cautions that only “if” standards provide the conditions for the archive to conform to standard archival practices, software and hardware designers comply with the standards, and producers and users select and use the standards, will they then provide a high-level solution to some of the obstacles that may prevent the preservation of digital materials. The development of standards for the audit and certification of digital repositories as “trustworthy” is a major development towards ensuring that digital data will be curated and preserved for the indefinite long-term, as they provide the conditions so that all three of Hedstrom’s criteria may be met.

In 1996, the Commission on Preservation and Access and the Research Libraries Group released the now-seminal report, “Preserving Digital Information” (Waters & Garrett, 1996). The Research Libraries Group (RLG) (2002) noted three key points that lead to the interest in developing standards for the “attributes and responsibilities” of a “trusted digital repository”: the requirement for ‘a deep infrastructure capable of supporting a distributed system of digital archives’; ‘the existence of a sufficient number of trusted organizations capable of storing, migrating, and providing access to digital collections’; and, ‘a process of certification is needed to create an overall climate of trust about the prospects of preserving digital information’. A few years later, the Consultative Committee on Space Data Systems (CCSDS) released the “Reference Model for an Open Archival Information System (OAIS)” (CCSDS, 2002). This document defined a set of common terms, components, and concepts for a digital archive. It provided not just a technical reference, but outlined the organization of people and systems required to preserve information for the indefinite long-term and make it accessible (RLG, 2002).

However, experts and other stakeholders with an interest in preserving information for the long-term recognized that as part of defining an archival system, they also needed to form a consensus on the responsibilities and characteristics of a sustainable digital repository. In other words, they needed a method to “prove” (i.e., “trust”) that an organization’s systems were, in-fact, OAIS-compliant. First, they would have to define the attributes and responsibilities of a “trusted” digital repository. Next, they would have to develop a method to audit and certify that a repository may be “trusted”. And, finally, they would have to create an infrastructure to certify and train the auditors.

The essay “Managing Data: the Emergence & Development of Digital Curation & Digital Preservation Standards” contains sections that provide the motivations for the development of standards and an overview and example applications of, the “Audit and Certification of the Trustworthy Digital Repositories Recommended Practice” (CCSDS, 2011). That essay also covers the definitions of “reliable”, “authentic”, “integrity”, and “trustworthy”, et al. A very short discussion of this Recommended Practice and a detailed discussion of the OAIS Reference Model are available in the essay, “Managing Data: Preservation Repository Design (the OAIS Reference Model)“.

This essay on “preservation standards and audit and certification mechanisms” is an overview of “trust”; the types of audit and certification available generally; the development of standards for the audit and certification of a repository as “trustworthy”; a brief overview of the standards themselves; and, a very brief overview of the requirements for the certification of bodies that certify the auditors of said trusted digital repositories. Thus, the scope of this particular literature review is deliberately narrow to avoid the duplication of previously discussed topics.

“Trust”

Jøsang and Knapskog (1998) discussed “trust” as a “subjective belief” when they described a metric for a “trusted system”, while Lynch (2000) described “trust” as an elusive and subjective probability. Both the former and the latter wrote that a user trusts the evaluation of the certifier, not the actual system component. Jøsang and Knapskog drew attention to that fact that an evaluator only certifies that a system has been checked against a particular set of criteria; whether or not a user should or will trust that criteria is another matter. The two researchers pointed out that most end users of a certified system do not have the necessary expertise to evaluate the appropriateness and quality of the criteria used to audit the system. They must trust that the people who established the criteria chose relevant components, and that the evaluator had the skill and knowledge to assess the system.

This is similar to Lynch (2001), who wrote that users tend to assume digital system designers and content creators have users’ best interests at heart, which is not always the case; yet the idea of creating a formal system of trust “is complex and alien to most people”. Ross & McHugh (2006) posit that “trust” may be established with the various stakeholders affiliated with a repository by providing quantifiable “evidence” such as annual financial reports, business plans, policy documents, procedure manuals, mission statements, etc., so that a system’s “trustworthiness” is believable. Jøsang & Knapskog (1998) and Ross & McHugh’s (2006) research goal was to provide a methodical evaluation of system components to define “trust” in a system that in and of itself was trustworthy (RLG, 2002).

Finally, Merriam-Webster (Trust, 2011) defines “trust” as “one in which confidence is placed”; “a charge or duty imposed in faith or confidence or as a condition of some relationship”; and, “something committed or entrusted to one to be used or cared for in the interest of another”.

The Types of Audit and Certification

Jøsang and Knapskog (1998) described four types of roles generally assigned to “government driven evaluation schemes”: accreditor, certifier, evaluator, and, sponsor. They defined the accreditor as the body that accredits the evaluator, the certifier, and, sometimes, evaluates the system itself. They noted that the certifier is accredited based on “documented competence level, skill, and resources”. They stipulated that the certifier might also be a “government body issuing…certificates based on the evaluation reports from the evaluators”. They defined the evaluator as “yet another government agency” that is “accredited by the accreditor”, and “the quality of the evaluator’s work will be supervised by the certifier”. They described the sponsor as the party interested in having their system evaluated (Jøsang & Knapskog, 1998). In other words, the authors wrote that someone who would like their system audited and certified by a particular evaluation criteria (“the sponsor”) hires an auditor (“the evaluator”) who has been certified (“the certifier”) by an accredited agency (“the accreditor”).

RLG (2002) defined four approaches to certification: individual, program, process, and data. They described “individual” as personnel certification. This is also called professional certification or accreditation, and it is often given to an individual when they meet some combination of work experience, education, and professional competencies. RLG noted that at the time of writing, there were no professional certifications for digital repository management or electronic archiving. They cited “program” as a type of certification for an institution or a program achieved through a combination of site visits and “self-evaluation using standardized checklists and criteria”.

RLG explained that the assessment areas included access, outreach, collection preservation and development, staff, facilities, governing and legal authority, and financial resources. They provided examples of this type of certification that included museums, schools and programs within a university, etc. They defined “process” as “quantitative or qualitative guidelines…to internal and external requirements” that use various methods and procedures, such as the ISO 9000 family of standards (RLG, 2002).

Finally, the authors designated the “data” approach to certification as addressing “the persistence or reliability of data over time and data security”. They wrote that this certification requires adherence to procedures manuals and international standards, such as ISO, that ensure both external and internal quality control. They note that certification will require the managers of a repository to document migration processes, to maintain and create metadata, authenticate new copies, as well as update the data or files (RLG, 2002).

Trusted Digital Repositories: Attributes and Responsibilities

RLG (2002) defined a “trusted digital repository” as “one whose mission is to provide reliable, long-term access to managed digital resources to its designated community, now and in the future”. They described the “critical component” as “the ability to prove reliability and trustworthiness over time”. The authors’ stated goal for the report was to create a framework for large and small institutions that could cover different responsibilities, architectures, materials, and situations yet still provide a foundation with which to build a sustainable “trusted repository” (RLG, 2002).

Trusted Digital Repositories

The authors of the RLG document noted that repositories may be contracted to a third party or locally designed and maintained, regardless, the expectations for trust require that a digital repository must:

  • Accept responsibility for the long-term maintenance of digital resources on behalf of its depositors and for the benefit of current and future users;
  • Have an organizational system that supports not only long-term viability of the repository, but also the digital information for which it has responsibility;
  • Demonstrate fiscal responsibility and sustainability;
  • Design its system(s) in accordance with commonly accepted conventions and standards to ensure the ongoing management, access, and security of materials deposited within it;
  • Establish methodologies for system evaluation that meet community expectations of trustworthiness;
  • Be depended upon to carry out its long-term responsibilities to depositors and users openly and explicitly;
  • Have policies, practices, and performance that can be audited and measured; and
  • Meet the responsibilities detailed in Section 3 [sic] of this paper” (RLG, 2002).

Per the OAIS Reference Model (CCSDS, 2002), they noted that the repository’s “designated community” will be the primary determining factor in how the content is accessed and disseminated; managed and preserved; and what, including content and format, is deposited. The authors of the report discussed and defined “trust”, noting, “most cultural institutions are already trusted”. Regardless, they outlined three levels of trust that administrators of a repository must consider in order to be a “trusted repository”: the trust a cultural institution must earn from their designated community; the trust cultural institutions must have in third-party providers; and the trust users of the repository must have in the digital objects provided to them by the repository owner via the repository software.

The report authors wrote that archives, libraries, and museums must simply keep doing what they have been doing for centuries in order to maintain the trust of their user community; they do not need to develop that trust, as institutions, they have already earned it. RLG (2002) explained that while librarians, archivists, etc., are loath to use third-party providers who have not proven their reliability, the establishment of a certification program with periodic re-audits may overcome their reluctance. Finally, the authors stated that users must be able to trust that the digital items they receive from a repository are both authentic and reliable. In other words, the objects the users access must be unaltered and they must be what they purport to be (Bearman & Trant, 1998).

They established that this can be accomplished by the use of checksums and other forms of validation that are common in the Computer Science and digital security communities, although security does not equal integrity (Lynch, 1994). Waters & Garrett (1996) put forth that the “central goal” of an archival repository must be “to preserve information integrity”; this includes content, fixity, reference, provenance, and context.

For a discussion on “reliable”, “authentic”, “integrity”, and “trustworthy”, please see the essay, “Managing Data: the Emergence & Development of Digital Curation & Digital Preservation Standards“.

Attributes of a Trusted Digital Repository

RLG (2002) identified seven primary attributes of a trusted digital repository. They were and are: compliance with the OAIS Reference Model; administrative responsibility; organizational viability; financial sustainability; technological and procedural suitability; system security; and procedural accountability.

The authors defined “compliance with the OAIS” as the repository owners/administrators ensuring that the “overall repository system conforms” to the OAIS Reference Model. They described “administrative responsibility” as the repository administrators adhering to “community-agreed” best practices and standards, particularly with regards to sustainability and long-term viability. RLG (2002) explained “organizational viability” as creating and maintaining an organization and structure that is capable of curating the objects in the repository and providing access to them for the indefinite long-term. They included as part of this maintaining trained staff, legal status, transparent business practices, succession plans, and maintaining relevant policies and procedures.

RLG (2002) designated “financial sustainability” as maintaining financial fitness, engaging in financial planning, etc., with an ongoing commitment to remain financially viable over the long-term. The authors outlined “technological and procedural suitability” as the repository owners/administrators keeping the archives software and hardware up to date, as well as complying with applicable best practices and standards for technical digital preservation. They traced an outline for “system security” by describing the minimal requirements a repository must follow regarding best practices for risk management, including written policies and procedures for disaster preparedness, redundancy, firewalls, back up, authentication, data loss and corruption, etc.

Finally, RLG (2002) defined “procedural accountability” as the repository owners/administrators being accountable for all of the above. That is, the authors wrote that maintaining a trusted digital repository is a complex set of “interrelated tasks and functions”; the maintainer of the repository is responsible for ensuring that all required functions, tasks, and components are carried out (RLG, 2002).

Responsibilities of a Trusted Digital Repository

RLG (2002) described two primary responsibilities for the owners and administrators of a trusted digital repository: high-level organizational and curatorial responsibilities, and, operational responsibilities. They subdivided organizational and curatorial responsibilities into three levels. The authors noted that organizations must understand their local requirements, which other organizations may have similar requirements, and, how these responsibilities may be shared.

The authors of the report summarized five primary areas in support of those three levels: the scope of the collections, preservation and lifecycle management, the wide range of stakeholders, the ownership of material and other legal issues, and, cost implications (RLG, 2002).

  1. The scope of the collections: the repository owners and administrators must know exactly what they have in their digital collection, and how to adequately preserve the integrity and authenticity of the properties and characteristics of the individual items.
  2. Preservation and lifecycle management: the repository owners and administrators must commit to proactive planning with regards to preserving and curating the items in the repository.
  3. The wide range of stakeholders: the repository owners and administrators must take into account the interests of all stakeholders when planning for long-term access to the materials. In some instances, they will have to act in spite of their stakeholder’s wishes, as some stakeholders tend to have short-term views, and they will not care about the long-term preservation of, and access to, the materials. Other stakeholders will have a differing point of view, and they will want the material preserved in the long-term. The repository owners and administrators will have to balance these competing interests.
  4. The ownership of material and other legal issues: digital librarians and archivists will have to take a proactive role with content producers. They must seek to preserve materials by curating the data early in the life cycle of it, while being cognizant of the copyright and intellectual property concerns of the content producers and owners.
  5. Cost implications: repository owners and administrators must commit financial resources to maintaining the content over the indefinite long-term, while bearing in mind that the true costs of doing so are variable.

In sum, RLG (2002) recommended incorporating preservation planning into the everyday management of the preservation repository.

Next, the authors of this RLG report defined operational responsibilities in more detail than the organizational and curatorial responsibilities, above. They wrote the operational responsibilities based on the OAIS Reference Model, and added to that the “critical role” of a repository in the “promotion of standards” (RLG, 2002). They defined these areas as:

  1. Negotiates for and accepts appropriate information from information producers and rights holders: this responsibility covers the submission agreement between a content Producer and the OAIS Archive. These responsibilities include preservation metadata, record keeping, authenticity checks, and legal issues. As part of fulfilling this role, a repository will have policies and procedures in place to cover collection development, copyright and intellectual property rights concerns, metadata standards, provenance and authenticity, appropriate archival assessment, and, records of all transactions with the Producer.
  2. Obtains sufficient control of the information provided to support long-term preservation: this responsibility refers to the “staging” process, where ingested content is stored after submission from a Producer and before the material is ingested into the archive. The responsibilities of a repository administrator at this point encompass best practices for the ingest of materials, which includes an analysis of the digital content itself, including its “significant properties”; what requirements must be fulfilled to provide access to the material continuously; a metadata check against the repository’s standards (including adding metadata to bring the current metadata up to par); the assignment of a persistent and unique identifier; integrity/fixity/authentication checks; the creation of an OAIS Archival Storage Package (AIP); and, storage into the OAIS Archive.
  3. Determines, either by itself of [sic] with others, the users that make up its designated community, which should be able to understand the information provided: the repository administrators and owners must determine who their user base is so that they may understand how best to serve their Designated Community.
  4. Ensures that the information to be preserved is “independently understandable” to the designated community; that is, the community can understand the information without needing the assistance of experts: the repository owner and administrator must make the information available using generic tools that are available to the Designated Community. For example, documents might be made available via .pdf or .rtf because the software to render these documents is available for free to most users. A repository owner and/or administrator may not wish to preserve documents in the .pages file format, as this Apple file format is not commonly used and the software to render it is not free beyond a limited day trial period.
  5. Follows documented policies and procedures that ensure the information is preserved against all reasonable contingencies and enables the information to be disseminated as authenticated copies of the original or as traceable to the original: the repository owners and administrators will document any unwritten policies and procedures, and follow best practice recommendations and standards where possible. These policies must include policies to define the Designated Community and its knowledge base; policies for material storage, including service-level agreements; policies for authentication and access control; a collection development policy, including preservation planning; a policy to keep policies updated with current recommendations, standards, and best practices; and, finally, links between procedures and policies, to ensure compliance across all collections in the repository.
  6. Makes the preserved information available to the designated community: the repository owners and administrators must comply with legal responsibilities such as licensing, copyright, and intellectual property regarding access to the content in the repository. Within that framework, however, they should plan to provide user support, record keeping, pricing (where applicable), authentication, and, most importantly, a method for resource discovery.
  7. Works closely with the repository’s designated community to advocate the use of good and (where possible) standard practice in the creation of digital resources; this may include an outreach program for potential depositors: the repository owners and administrators should work with all stakeholders to advocate the use of standards and recommended best practices (RLG, 2002). As the Science and Technology Council (2007) noted, using standards will reduce costs for all parties involved and better ensure the longevity of the material.

In conclusion, the OAIS Reference Model has provided a useful framework “for identifying the responsibilities of a trusted digital repository” (RLG, 2002).

Certification of a Trusted Digital Repository

As part of the certification framework, the authors of the RLG report intended to support Waters & Garrett’s (1996) assertion that archival repositories “must be able to prove that they are who they say they are by meeting or exceeding the standards and criteria of an independently-administered program for archival certification”.

RLG (2002) described two types of certification then in use within the libraries and archives community: the standards model and the audit model. The “standards” model is an informal process. They stated that standards are created when best practices and guidelines are established by the consensus of the expert community and then “certified” by other practitioners’ acceptance and/or use of the “standard”. In other words, librarians, archivists, and computer scientists who work with libraries decide what constitutes a “standard”; only rarely does a standard become formalized via ISO or another international organization. The authors described the audit model as an output of legislation or policies and procedures established by national agencies, such as the U.S. Department of Defense. That is, a governing body passes laws or policies, and the information repository’s policies must conform to the governing body’s requirements (RLG, 2002).

For a discussion of other approaches to certification, please see an earlier section, “Types of Audit and Certifications”.

Summary

RLG (2002) described a framework for a trusted digital repository’s responsibilities and attributes. They noted that these apply to repositories both large and small that hold a wide variety of content. The authors summarized their work above with several recommendations.

  • Recommendation 1: Develop a framework and process to support the certification of digital repositories.
  • Recommendation 2: Research and create tools to identify the attributes of digital materials that must be preserved.
  • Recommendation 3: Research and develop models for cooperative repository networks and services.
  • Recommendation 4: Design and develop systems for the unique, persistent identification of digital objects that expressly support long-term preservation.
  • Recommendation 5: Investigate and disseminate information about the complex relationship between digital preservation and intellectual property rights.
  • Recommendation 6: Investigate and determine which technical strategies best provide for continuing access to digital resources.
  • Recommendation 7: Investigate and define the minimal-level metadata required to manage digital information for the long term. Develop tools to automatically generate and/or extract as much of the required metadata as possible (RLG, 2002).

The remainder of this essay focuses on the results of Recommendation 1, above, regarding the development of certification standards for digital repositories.

Trusted Digital Repositories: Audit and Certification

Several researchers have addressed the problem of audit and certification. For example, Ross & McHugh (2006) created the Digital Repository Audit Method Based On Risk Assessment (DRAMBORA) to provide a self-audit method for repository administrators that provided quantifiable results (Digital Curation Centre, 2011). Dobratz, Schoger, and Strathmann (2006) created nestor, the Network of Expertise in Long-Term Storage of Digital Resources. Other lesser-known researchers such as Becker, et al. (2009) described a decision-making procedure for preservation planning that provides a means for repository administrators to consider various alternatives.

This section will examine the audit and certification method known as the “Trustworthy Repositories Audit & Certification (TRAC): Criteria and Checklist” and its follow up document, the “Audit and Certification of Trustworthy Repositories Recommended Practice”. Researchers and practitioners across the globe — including Ross, McHugh, Dobratz, et al. – combined their efforts and contributed their expertise into developing TRAC from a draft into a final version (Research Libraries Group, 2005; Dale, 2007). Their efforts have led to the development and refinement of TRAC into a CCSDS “Recommended Practice”; this may eventually become an ISO standard.

The essay, “Managing Data: the Emergence & Development of Digital Curation & Digital Preservation Standards” describes some of the related work in this area not covered below.

Trustworthy Repositories Audit & Certification: Criteria and Checklist

The authors of TRAC created it as part of a larger international effort to define an audit and certification process to ensure the longevity of digital objects. They defined a checklist that any repository manager could use to assess the trustworthiness of the repository. The checklist provided examples of the required evidence, but the list is considered “prescriptive”; the authors did not try to list every possible type of example. It contained three sections: “organizational infrastructure”, “digital object management”, and, “technologies, technical infrastructure, and security”.

The authors provided a spreadsheet-style “audit checklist” called “Criteria for Measuring Trustworthiness of Digital Repositories and Archives”. They note that the criteria measured is applicable to any kind of repository, using documentation (evidence), transparency (both internal and external), adequacy (individual context), and, measurability (i.e., objective controls). The authors stated that a full certification process must include not just an external audit, but tools to allow for self-examination and planning prior to an audit (OCLC & CRL, 2007). The terminology in the audit checklist conformed to the OAIS Reference Model.
A typical policy in TRAC followed the model of statement, explanation, and evidence (see Figure 1, below).


Figure 1 - Is from TRAC, section A1.1 (OCLC & CRL, 2007). | Ward, J.H. (2012). Managing Data: Preservation Standards & Audit & Certification Mechanisms (i.e., "policies"). Unpublished Manuscript, University of North Carolina at Chapel Hill. Creative Commons License: Attribution-NoDerivatives 4.0 International (CC BYND 4.0)
Figure 1 – TRAC, A1.1 (OCLC & CRL, 2007).

I. Organizational Infrastructure

The authors of TRAC considered the organizational infrastructure to be as critical a component as the technical infrastructure (OCLC & CRL, 2007). This reflected the view of the authors of the OAIS Reference Model, who consider an OAIS to be “an archive, consisting of an organization of people and systems, that has accepted the responsibility to preserve information and make it available for a Designated Community” (CCSDS, 2002). OCLC & CRL (2007) considered “organizational attributes” to be a characteristic of a trusted digital repository, and these characteristics are reflected RLG’s (2002) grouping of financial sustainability, organizational viability, procedural accountability, and administrative responsibility as four of the seven attributes of a trusted digital repository.

The authors of TRAC considered the following ten elements to be part of organizational infrastructure, but they did not limit it to only these elements.

  1. Governance
  2. Organizational structure
  3. Mandate or purpose
  4. Scope
  5. Roles and responsibilities
  6. Policy framework
  7. Funding system
  8. Financial issues, including assets
  9. Contracts, licenses, and liabilities
  10. Transparency (OCLC & CRL, 2007).

In addition, they grouped the above elements into five areas:

  1. Governance and organizational viability: the owners and managers of a repository must commit to established best practices and standards for the long term. This includes mission statements, and succession/contingency plans.
  2. Organizational structure and staffing: the repository owners and managers must commit to hiring an appropriate number of qualified staff that receives regular ongoing professional development.
  3. Procedural accountability and policy framework: the repository owners and managers must provide transparency with regards to documentation related the long-term preservation and access of the archival data. This requirement provides evidence to stakeholders of the repository’s trustworthiness. This documentation may define the Designated Community, what policies and procedures are in place, legal requirements and obligations, reviews, feedback, self-assessment, provenance and integrity, and operations and management.
  4. Financial sustainability: the repository owners and administrators must follow solid business practices that provide for the long-term sustainability of the organization and the digital archive. This includes business plans, annual reviews, financial audits, risk management, and possible funding gaps.
  5. Contracts, licenses, and liabilities: the repository owners and administrators must make contracts and licenses “available for audits so that liabilities and risks may be evaluated”. This requirement includes deposit agreements, licenses, preservation rights, collection maintenance agreements, intellectual property and copyright, and, ingest (OCLC & CRL, 2007).

II. Digital Object Management

The authors described this section as a combination of technical and organizational aspects. They organized the requirements for this section to align with six of the seven OAIS Functional Entities: Ingest, Archival Storage, Preservation Planning, Data Management, Administration, and Access (OCLC & CRL, 2007; CCSDS, 2002). The authors of the TRAC audit & checklist defined these six sections as follows.

  1. The initial phase of ingest that addresses acquisition of digital content.
  2. The final phase of ingest that places the acquired digital content into the forms, often referred to as Archival Information Packages (AIPs), used by the repository for long-term preservation.
  3. Current, sound, and documented preservation strategies along with mechanisms to keep them up to date in the face of changing technical environments.
  4. Minimal conditions for performing long-term preservation of AIPs.
  5. Minimal-level metadata to allow digital objects to be located and managed within the system.
  6. The repository’s ability to produce and disseminate accurate, authentic versions of the digital objects (OCLC & CRL, 2007).

The authors further elucidated the above areas as follows.

  1. Ingest: acquisition of content

    This section covered the process required to acquire content; this generally falls under the realm of a Submission Agreement between the Producer and the repository. The Producer may be external or internal to the repository’s governing organization. The authors recommended considering the object’s properties, any information that needs to be associated with the submitted object (s), mechanisms to authenticate the materials, verify each ingested object for integrity, maintaining control of the bits so that none may be altered at any time, regular contact with the Producer as appropriate, a formal acceptance process with the Producer for all content, and, an audit trail of the Ingest process.

  2. Ingest: creation of the archival package

    The actions in this section covered the creation of an AIP. These actions involved documentation: of each AIP preserved by the repository; that each AIP created is actually adequate for preservation purposes; of the process of constructing an AIP from a SIP; of the actions performed on each SIP (deletion or creation as an AIP); of the use of persistent and unique naming schemas/identifiers, else, of the preservation of the existing unique naming schema; of the context for each AIP; of an audit trail of the metadata records ingested; of associated preservation metadata; of testing the ability of current tools to render the information content; of the verification of completeness of each AIP; of an integrity audit mechanism for the content; and, of any actions and process related to AIP creation.

  3. Preservation planning

    The authors’ recommended four simple actions a repository administrator may take regarding keeping the archive current. The administrator must document their current preservation strategies; monitor format, etc., obsolescence; adjust the preservation plan if or when conditions change; and, provide evidence that the preservation plan used is actually effective.

  4. Archival storage & preservation/maintenance of AIPs

    The actions in this section covered what is required to ensure that an AIP is actually being preserved. This involved examining multiple aspects of object maintenance, including, but not limited to, storage, tracking, checksums, migration, transformations, and copies/replicas. The repository administrator must be able to demonstrate the use of standard preservation strategies; that the repository actually implements these strategies; that the Content Information is preserved; that the integrity of the AIP is audited; and that there is an audit trail of any actions performed on an AIP.

  5. Information management

    This section addressed the requirements related to descriptive metadata. The repository owner must identify the minimal metadata required for retrieval by the Designated Community; create a minimal amount of descriptive metadata and attach it to the described object; and, prove there is referential integrity between each AIP and its associated metadata (both creation and maintenance of).

  6. Access management

    The authors designed this section to address methods for providing access to the content (i.e., DIPs) in the repository to the Designated Community; they wrote that the degree of sophistication of this would vary based on the context of the repository itself and the requirements of the Designated Community. They further subdivided this section into four areas: access conditions and actions, access security, access functionality, and, provenance. In order to fulfill the requirements presented in this section, a repository owner must: provide information to the Designated Community as to what access and delivery options are actually available; require an audit of all access actions; only provide access to particular Designated Community members as agreed to with the Producer; ensure access policies are documented and comply with deposit agreements; fully implement the stated access policy; log all access failures; demonstrate the DIP generated is what the user requested; prove that access success or failure is made known to the user within a reasonable length of time; and, all DIPs generated may be traced to an authentic original and themselves authentic (OCLC & CRL, 2007).

In summary, OCLC & CRL (2007) designed this section to make it mandatory for a trustworthy digital repository to be able to produce a DIP, “however primitive”.

III. Technologies, Technical Infrastructure, and Security

The authors of TRAC did not want to make specific software and hardware requirements, as many of these would fall under standard computer science best practices and they are covered by other standards. Therefore, they addressed general information technology areas as related to digital preservation. These areas fall under one of three categories: system infrastructure, appropriate technologies, and security (OCLC & CRL, 2007).

  1. System infrastructure

    This section addressed the basic infrastructure required to ensure the trustworthiness of any actions performed on an AIP. This meant that the repository administrator must be able to demonstrate that the operating systems and other core software are maintained and updated; the software and hardware are adequate to provide back ups; the number and location of all digital objects, including duplicates, are managed; all known copies are synched; audit mechanisms are in place to discover bit-level changes; any such bit-level changes are reported to management, including the steps taken to prevent further loss and replace/repair the current corruption and loss; processes are in place for hardware and software changes (e.g., migration); a change management process is in place to mitigate changes to critical processes; there is process for testing the effect of critical changes prior to an actual implementation; and, software security updates are implemented with an awareness of the risks versus benefits of doing so.

  2. Appropriate technologies

    The authors recommended that a repository administrator should look to the Designated Community for relevant standards and strategies. They proposed that the hardware and software technologies in place are appropriate for the Designated Community, and that appropriate monitoring is in place to update hardware and software as appropriate.

  3. Security

    This section addressed non-IT security, as well as IT security. The authors recommended that a repository administrator conducts a regular risk assessment of internal and external threats; ensures controls are in place to address any assessed threats; decides which staff members are authorized to do what and when; and, has an appropriate disaster preparedness plan in place, including off-site recovery plan copies (OCLC & CRL, 2007).

In conclusion, the archivists, librarians, computer scientists, and other experts who contributed to the development of TRAC created a document that encompassed the minimum requirements for an OAIS Archive to be considered “trustworthy”.

Audit and Certification of Trustworthy Digital Repositories Recommended Practice

The CCSDS released the “Audit and Certification of Trustworthy Digital Repositories Recommended Practice” (v. CCSDS 652.0-M-1, the “Magenta Book”) in September 2011 (CCSDS, 2011). This section will discuss the Recommended Practice only with regards to major differences with TRAC (OCLC & CRL, 2007), above. This is because the two documents are similar enough that to repeat a description of each of the sections would be gratuitous.

The CCSDS described the purpose of the Recommended Practice as that of providing the documentation “on which to base an audit and certification process for assessing the trustworthiness of digital repositories” (CCSDS, 2011). The essay “Managing Data: the Emergence & Development of Digital Curation & Digital Preservation Standards” contains an overview of this Recommended Practice. This section will cover areas not covered by the overview in that essay or earlier in this document.

The three major sections of the Recommended Practice are the same as for TRAC, except that the last section has been re-named. Therefore, instead of “organizational infrastructure”, “digital object management”, and, “technologies, technical infrastructure, & security”, the authors of the Recommended Practice renamed the last section, “infrastructure and security risk management”. Within that technology section, the sections were reduced from three to two. Therefore, instead of, “system infrastructure”, “appropriate technologies”, and “security”, the Recommended Practice contains sub-sections on “technical infrastructure risk management” and “security risk management”. The subsections for “organizational infrastructure” and “digital object management” remained the same. The CCSDS re-worded, re-organized, and expanded the content of the sub-sections, but the general ideas behind each section stayed in place. So for example, Figure 2, below, is the Recommended Practice version of the same content in the same section in TRAC from Figure 1, above.


Figure 2 - Audit and Certification of Trustworthy Digital Repositories Recommended Practice, Section 3.1.1 (CCSDS, 2011). | Ward, J.H. (2012). Managing Data: Preservation Standards & Audit & Certification Mechanisms (i.e., "policies"). Unpublished Manuscript, University of North Carolina at Chapel Hill. Creative Commons License: Attribution-NoDerivatives 4.0 International (CC BYND 4.0)
Figure 2 – Audit and Certification of Trustworthy Digital Repositories Recommended Practice, 3.1.1 (CCSDS, 2011).

In short, the members of the CCSDS evolved and expanded the original TRAC checklist to create the Recommended Practice, but overall, the ideas in the original version have held up well during the four-year transition to a Recommended Standard.

Trusted Digital Repositories: Requirements for Certifiers

Both Waters & Garrett (1996) and RLG (2002) recommended the creation of a certification program for trusted digital repositories. As a result, librarians, archivists, computer scientists and other experts and stakeholders in digital preservation created the “Trustworthy repositories audit & certification: criteria and checklist” in order to create a common set of standards and terminology by which a repository may be certified. These experts and others then took TRAC, via the CCSDS, and created the “Audit and Certification of Trustworthy Digital Repositories (CCSDS 652.0-M-1) Recommended Practice”. As part of the process of creating this Recommended Practice, these experts also determined the requirements for bodies that will provide the audit and certification of “candidate” trustworthy digital repositories.

They created a second Recommended Practice, “Requirements for bodies providing audit and certification of candidate trustworthy digital repositories CCSDS 652.1-M-1”. This Recommended Practice for bodies providing audit and certification is a supplement to an existing ISO Standard that outlines the requirements for a body performing audit and certification, “Conformity assessment — Requirements for bodies providing audit and certification of management systems” (ISO/IEC 17021, 2011).

ISO/IEC 17021 Conformity Assessment

The authors of this standard covered seven primary areas: principles, general requirements, structural requirements, resource requirements, information requirements, process requirements, and, management of system requirements for certification bodies. They defined “principles” as covering impartiality, competence, responsibility, openness, confidentiality, and responsiveness to complaints. They described “general requirements” as covering legal and contractual matters, management of impartiality, and liability and financing. They kept “structural requirements” simple — this is about the organizational structure and top management, and a committee for safeguarding impartiality.

The authors detailed “resource requirements” as covering the competence of management and personnel, the personnel involved in the certification activities, the use of individual auditors and external technical experts, personnel records, and outsourcing. They outlined “information requirements” as publicly accessible information, certification documents, directory of certified clients, reference to certification and use of marks, confidentiality, and the information exchange between a certification body and its clients. The authors delineated “process requirements” as covering general requirements, audit and certification, surveillance activities, recertification, special audits, suspending, withdrawing or reducing the scope of certification, appeals, complaints, and, the records of applicants and clients.

Finally, the authors provided three options for “management systems requirements for certification bodies” that includes general management requirements and management system requirements that are in accordance with ISO 9001. In document appendices, the authors discussed the required knowledge and skills to be an auditor, the possible types of evaluation methods, provided an example of a process flow for determining and maintaining competence, desired personal behaviors, the requirements for a third-party audit and certification process, and, considerations for the audit programme, scope or plan (ISO/IEC 17021, 2011).

Requirements for Bodies Providing Audit and Certification of Candidate Trustworthy Digital Repositories Recommended Practice

This section of this essay will address the areas in which the Recommended Practice for bodies providing audit and certification differs from “ISO/IEC 17021 Conformity Assessment”.

The CCSDS created the Recommended Practice, “Requirements for bodies providing audit and certification of candidate trustworthy digital repositories” as a supplement to “Conformity assessment — Requirements for bodies providing audit and certification of management systems” (ISO/IEC 17021, 2011). They created the document to provide additional information on which an organization that is assessing a digital repository for certification as trustworthy may base their operations for issuance of such certification (CCSDS, 2011). In other words, the CCSDS (2011) created the document to support the accreditation of bodies providing certification. They created the document with a secondary purpose of providing repository owners with documentation by which they may understand the processes involved in achieving certification. They wrote the document using terminology from the OAIS Reference Model.

The authors defined a “Primary Trustworthy Digital Repository Authorisation Body” (PTAB) as an organization that accredits training courses for auditors, accredits other certification bodies, and that provides audit and certification of candidate trustworthy digital repositories. The membership consists of “internationally recognized experts in digital preservation” (CCSDS, 2011). They defined the primary tasks of the organization as: accrediting other trustworthy digital repository certification bodies; certifying auditors; making certification decisions; accrediting auditor qualifications; undertaking audits; and, last, having a mechanism to add new experts to PTAB as needed. They noted that PTAB will also be accredited by ISO and will become a member of the International Accreditation Forum (IAF). In the event of any possible conflicts of interest, the authors designated two areas that are not considered conflicts by those members who are certifiers: lecturing, including in training courses, and identifying areas of improvement during the course of an audit (CCSDS, 2011).

The CCSDS outlined the criteria for the training of audit team members. This training must include: understanding digital preservation, including the technical aspects related to the audited activity; understanding of knowledge management systems; a general knowledge of the regulatory requirements related to trustworthy digital repositories; an understanding of the basic principles related to auditing, per ISO standards; an understanding of risk management and risk assessment with regards to digitally encoded information; and, finally, an understanding of the Recommended Practice, “Audit and Certification of Trustworthy Digital Repositories (CCSDS 652.0-M-1)”.

Furthermore, the authors specified that the audit team should have or find members with appropriate technical knowledge for the scope of the digital repository certification, the necessary comprehension of any applicable regulatory requirements for that repository, and knowledge of the repository owner’s organization, such that an appropriate audit may be conducted. The CCSDS wrote that the audit team might be supplemented with the necessary technical expertise, as needed. As well, the authors charged PTAB with assessing the conduct of auditors and experts and monitoring their performance, as well as selecting these experts and auditors based on appropriate experience, competence, training, and qualifications (CCSDS, 2011).

The CCSDS outlined the required levels of work experience for a trusted digital repository auditor. They required these auditors to have completed five days of training via PTAB or an accredited agency; some prior experience assessing trustworthiness, including participating in two audit certifications for a total of 20 days; four years of workplace experience focusing on digital preservation; remained current with regards to digital preservation best practices and standards; current experience; and, received certification from PTAB. The authors stipulated three additional requirements for audit team leaders. They must be able to effectively communicate in writing and orally; have been an auditor previously for two completed trustworthy digital repository audits; and, have the capability and knowledge of managing an audit certification process (CCSDS, 2011).

The authors outlined additional recommendations, including a requirement that the auditor must have access to the client organization’s records. If these records may not be accessed, then it is possible the audit may not be performed. The CCSDS defined the criteria against which an audit is performed as those defined in the Recommended Practice, “Audit and Certification of Trustworthy Digital Repositories (CCSDS 652.0-M-1)”. They require two auditors to be present on site; other auditors may work remotely. The authors’ note in an appendix on security that all auditors maintain confidentiality with respect to an organization’s systems, content, structure, data, etc., as required (CCSDS, 2011).

In conclusion, the CCSDS has created a method for a larger umbrella organization — PTAB — to certify the certifiers of a trusted digital repository by creating a “Recommended Practice for bodies providing audit and certification” as a supplement to the existing ISO/IEC standard for “Conformity assessment — Requirements for bodies providing audit and certification of management systems”. By creating both a certification program and the criteria for certification of trustworthiness, these experts believe they have ensured the availability of digital information over the indefinite long-term.

Trusted Digital Repositories: Criticisms

Gladney (2005; 2004) has been a vocal critic of the repository-centric approach to digital preservation, which he considers “unworkable”. He has proposed, instead, the creation of durable digital objects that encode all required preservation information within the digital object itself. R. Moore has reservations about the “top-down” approach, in which standards are handed-down from a body of experts to be used by practitioners. He would like to know what policies preservation data grid administrators are actually implementing at the machine-level (Ward, 2011).

Similar to R. Moore’s concerns, Thibodeau (2007) supports the development of standards for digital preservation, but he believes these standards should be supplemented by empirical data regarding the purpose of each repository. For example, practitioners should not assess a repository based solely on whether or not the repository is OAIS-compliant. He writes that practitioners should consider the purpose of the repository, its mission, and its user base, and whether or not the repository owner’s are fulfilling those requirements. Thibodeau (2007) defined a five-point framework for repository evaluation that considers service, collaboration, “state”, orientation, and coverage. He believes that this broader context, along with the OAIS Reference Model and the Recommended Practice for the Audit and Certification of Trustworthy Repositories, provide a more realistic determiner of a repository’s “success” or “failure”.

Summary

Archivists, librarians, computer scientists and other stakeholders and experts in digital preservation wanted to create certification standards for trustworthy digital repositories, and they voiced this desire in a 1996 report, “Preserving Digital Information” (Waters & Garrett, 1996). As one part of this enthusiasm for standards, the CCSDS released the OAIS Reference Model (CCSDS, 2002). Experts recognized that a technical framework was only part of a preservation repository, and so they worked to define the attributes and responsibilities of a trusted digital repository (RLG, 2002). They created an audit and certification checklist based on these attributes and responsibilities, called TRAC (OCLC & CRL, 2007). After receiving feedback from the preservation community, the CCSDS evolved TRAC into the Recommended Practice for the Audit and Certification of Trustworthy Digital Repositories (2011), and released the Recommended Practice for Requirements for Bodies Providing Audit and Certification of Candidate Trustworthy Digital Repositories (2011).

Thus, after many years of work, stakeholders with an interest in the preservation of digital material now have criteria against which to judge whether or not a repository and its contents are likely to last for the indefinite long-term, as well as an umbrella organization that will provide certified and trained auditors. To reiterate these accomplishments, over the past fifteen years, preservation experts have defined “trust” and a “trustworthy” digital repository; defined the attributes and responsibilities of a trustworthy digital repository; defined the criteria and created a checklist for the audit and certification of a trustworthy digital repository; evolved this criteria into a standard; and defined a standard for bodies who wish to provide audit and certification to candidate trustworthy digital repositories.

The significance of these accomplishments cannot be overstated — at stake in the concerns over the preservation of digital objects and information are the cultural and scientific heritage, and personal information, of humanity.

References


Bearman, D. & Trant, J. (1998). Authenticity of digital resources: towards a statement of requirements in the research process. D-Lib Magazine. Retrieved April 14, 2009, from http://www.dlib.org/dlib/june98/06bearman.html

Becker, C., Kulovits, H., Guttenbrunner, M., Strodl, S., Rauber, A., & Hofman, H. (2009). Systematic planning for digital preservation: evaluating potential strategies and building preservation plans. International Journal of Digital Libraries, 10(4), 133-157.

CCSDS. (2011). Requirements for bodies providing audit and certification of candidate trustworthy digital repositories recommended practice (CCSDS 652.1-M-1). Magenta Book, November 2011. Washington, DC: National Aeronautics and Space Administration (NASA).

CCSDS. (2011). Audit and certification of trustworthy digital repositories recommended practice (CCSDS 652.0-M-1). Magenta Book, September 2011. Washington, DC: National Aeronautics and Space Administration (NASA).

CCSDS. (2002). Reference model for an Open Archival Information System (OAIS) (CCSDS 650.0-B-1). Washington, DC: National Aeronautics and Space Administration (NASA). Retrieved April 3, 2007, from http://nost.gsfc.nasa.gov/isoas/

Dale, R. (2007). Mapping of audit & certification criteria for CRL meeting (15-16 January 2007). Retrieved September 11, 2007, from http://wiki.digitalrepositoryauditandcertification.org/pub/Main/ReferenceInputDocuments/TRAC-Nestor-DCC-criteria_mapping.doc

Digital Curation Centre. (2011). DRAMBORA. Retrieved December 9, 2011, from http://www.dcc.ac.uk/resources/tools-and-applications/drambora

Dobratz, S., Schoger, A., & Strathmann, S. (2006). The nestor Catalogue of Criteria for Trusted Digital Repository Evaluation and Certification. Paper presented at the workshop on “digital curation & trusted repositories: seeking success”, held in conjunction with the ACM/IEEE Joint Conference on Digital Libraries, June 11-15, 2006, Chapel Hill, NC, USA. Retrieved December 1, 2011, from http://www.ils.unc.edu/tibbo/JCDL2006/Dobratz-JCDLWorkshop2006.pdf

Gladney, H.M. & Lorie, R.A. (2005). Trustworthy 100-Year digital objects: durable encoding for when it is too late to ask. ACM Transactions on Information Systems, 23(3), 229-324. Retrieved December 29, 2011, from http://eprints.erpanet.org/7/

Gladney, H.M. (2004). Trustworthy 100-Year digital objects: evidence after every witness is dead. ACM Transactions on Information Systems, 22(3), 406-436. Retrieved July 12, 2008, from http://doi.acm.org/10.1145/1010614.1010617

Hedstrom, M. (1995). Electronic archives: integrity and access in the network environment. American Archivist, 58(3), 312-324.

ISO/IEC 17021. (2011.) Conformity assessment — Requirements for bodies providing audit and certification of management systems. Retrieved December 30, 2011, from http://www.iso.org/iso/iso_catalogue/catalogue_tc/catalogue_detail.htm?csnumber=56676

Jøsang, A. & Knapskog, S.J. (1998). A metric for trusted systems. In Proceedings of the 21st National Information Systems Security Conference (NISSC), October 6-9, 1998, Crystal City, Virginia. Retrieved December 27, 2011, from http://csrc.nist.gov/nissc/1998/proceedings/paperA2.pdf

Lee, C. (2010). Open archival information system (OAIS) reference model. In Encyclopedia of Library and Information Sciences, Third Edition. London: Taylor & Francis.

Lynch, C. (2001). When documents deceive: trust and provenance as new factors for information retrieval in a tangled web. Journal of the American Society for Information Science and Technology, 52(1), 12-17.

Lynch, C. (2000). Authenticity and integrity in the digital environment: an exploratory analysis of the central role of trust. Authenticity in a digital environment. Washington, DC: Council in Library and Information Resources. Retrieved April 14, 2009, from http://www.clir.org/pubs/reports/pub92/pub92.pdf

Lynch, C. A. (1994). The integrity of digital information: mechanics and definitional issues. Journal of the American Society for Information Science, 45(10), 737-744.

OCLC & CRL. (2007). Trustworthy repositories audit & certification: criteria and checklist version 1.0. Dublin, OH & Chicago, IL: OCLC & CRL. Retrieved September 11, 2007, from http://www.crl.edu/sites/default/files/attachments/pages/trac_0.pdf

Research Libraries Group. (2005). An audit checklist for the certification of trusted digital repositories, draft for public comment. Mountain View, CA: Research Libraries Group. Retrieved April 14, 2009, from http://worldcat.org/arcviewer/1/OCC/2007/08/08/0000070511/viewer/file2416.pdf

Research Libraries Group. (2002). Trusted digital repositories: attributes and responsibilities an RLG-OCLC report. Mountain View, CA: Research Libraries Group. Retrieved September 11, 2007, from http://www.oclc.org/programs/ourwork/past/trustedrep/repositories.pdf

Ross, S. & McHugh, A. (2006). The role of evidence in establishing trust in repositories. D-Lib Magazine 12(7/8). Retrieved May 6, 2007, from http://www.dlib.org/dlib/july06/ross/07ross.html

Science and Technology Council. (2007). The digital dilemma strategic issues in archiving and accessing digital motion picture materials. The Science and Technology Council of the Academy of Motion Picture Arts and Sciences. Hollywood, CA: Academy of Motion Picture Arts and Sciences.

Thibodeau, K. (2007). If you build it, will it fly? Criteria for success in a digital repository. Journal of Digital Information, 8(2). Retrieved December 27, 2011, from http://journals.tdl.org/jodi/article/view/197/174

Trust. (2011). Merriam-Webster.com. Encyclopaedia Britannica Company. Retrieved December 30, 2011, from http://www.merriam-webster.com/dictionary/trust

Ward, J.H. (2011). Classifying Implemented Policies and Identifying Factors in Machine-Level Policy Sharing within the integrated Rule-Oriented Data System (iRODS). In Proceedings of the iRODS User Group Meeting 2011, February 17-18, 2011, Chapel Hill, NC.

Waters, D. and Garrett, J. (1996). Preserving Digital Information. Report of the Task Force on Archiving of Digital Information. Washington, DC: CLIR, May 1996.

Trusted Digital Repository Development Lit Review

If you would like to work with us on a digital curation and preservation or data governance project, please review our services page.

Trusted Digital Repository Development Lit Review

OAIS Reference Model & Preservation Design Summary

OAIS Reference Model | Literature Review and Comprehensive Exams

Abstract

In 1995, the Consultative Committee for Space Data Systems (CCSDS) began to coordinate the development of standard terminology and concepts for the long-term archival storage of various types of data. Under the auspices of the CCSDS, experts and stakeholders from academia, government, and research contributed their knowledge to the development of what is now called the Open Archival Information Systems (OAIS) Reference Model. The conclusion from a variety of experienced repository managers is that the authors of the OAIS Reference Model created flexible concepts and common terminology that any repository administrator or manager may use and apply, regardless of content, size, or domain. This literature review summarizes the standard attributes of a preservation repository using the OAIS Reference Model, including criticisms of the current version.

Citation

Ward, J.H. (2012). Managing Data: Preservation Repository Design (the OAIS Reference Model). Unpublished manuscript, University of North Carolina at Chapel Hill. (pdf)

Creative Commons License
This work is licensed under a Creative Commons Attribution-NoDerivatives 4.0 International License.


Table of Contents

Abstract

Introduction

The OAIS Reference Model: Definition

The OAIS Reference Model: Key Concepts

The OAIS Reference Model: Key Responsibilities

The OAIS Reference Model: Key Models

The OAIS Functional Model
The OAIS Information Model
I. The Logical Model for Archival Information
II. The Logical Model of Information in an Open Archival Information System
III. Data Management Information
Information Package Transformations

The OAIS Reference Model: Preservation Perspectives

Information Preservation
Access Service Preservation

The OAIS Reference Model: Archive Interoperability

The OAIS Reference Model: Compliance

The OAIS Reference Model: Example Deployments

The OAIS Reference Model: Other Criticisms

Conclusions and Future Work

References


Table of Figures

Figure 1 – Environment Model of an OAIS (CCSDS, 2002).

Figure 2 – Obtaining Information from Data (CCSDS, 2002).

Figure 3 – Information Package Concepts and Relationships (CCSDS, 2002).

Figure 4 – OAIS Archive External Data (CCSDS, 2002).

Figure 5 – OAIS Functional Entities (CCSDS, 2002).

Figure 6 – Composite of Functional Entities (CCSDS, 2002).

Figure 7 – High-Level Data Flows in an OAIS (CCSDS, 2002).


Introduction

Various organizations and the individuals who work for those organizations have a vested interest in keeping information accessible over time, although there may be reasons to delete or destroy some data and information once a certain amount of time has passed. The reasons for this interest are varied. Librarians and archivists have a professional expectation that they will do their best to curate and preserve cultural heritage data, scientific data, and other types of information for future generations of scholars and laymen. Some interest may be personal — most people would like to be able to view their children’s baby pictures, and their descendants may wish to know how their ancestors looked.

Regardless of the motivation for keeping this information available over time, most practitioners and laymen will agree that standards are one way to ensure this happens. Standards provide a common terminology that aid in discussions of repository infrastructure and needs (Beedham, et al., 2005; Lee, 2010). According to the members of the Science and Technology Council of the Academy of Motion Picture Arts and Sciences (2007), when preservationists and curators collaborate among and between industries and domains to create and use standards, the resulting economy of scale should reduce costs for all involved. For example, Galloway (2004) wrote that the proliferation of file formats increased costs, and that this problem must be solved in order to reduce preservation costs.

If costs are reduced, then the likelihood of a community having the resources to preserve and curate the material increases, or, by the same token, the amount of information that can be saved for the same price increases. This is true across the board, as standards beget other standards. If practitioners and researchers develop a standard terminology for a preservation repository, then common standards for metadata, file formats, filenames, metadata, metadata registries, and archiving and distributing are likely either to follow or to have preceded the preservation repository standard. In other words, standards development is an iterative process.

In 1995, the Consultative Committee for Space Data Systems (CCSDS) convened to coordinate “the development of archive standards for the long-term storage of archival data” (Beedham, et al., 2005). As part of this task, the members of the CCSDS determined that there was no common model or foundation from which to build an archive standard. Lavoie (2004) describes how the members realized they would have to create terminology and concepts for preservation; characterizations of the functions of a digital archiving system; and determine the attributes of the digital objects to be preserved. Thus, the members agreed to create a reference model that would describe the minimum requirements of an archival system, including terminology, concepts, and system components. The members of the CCSDS recognized from the beginning that the application of a common model extended beyond the space data system, and they involved practitioners and researchers from across a broad spectrum in academia, private industry, and government (Lavoie, 2004; Lee, 2010).

This essay summarizes the standard attributes of a preservation repository as defined by the CCSDS with the Open Archival Information Systems (OAIS) Reference Model, and addresses some of the weaknesses of the model.

The OAIS Reference Model: Definition

An Open Archival Information System (OAIS) is an electronic archive that is maintained by a group or association of people and/or organizations as a system. This member organization has accepted the responsibility of providing access to information for the stakeholders of the electronic archive. These stakeholders are referred to as the Designated Community. The owners and maintainers of the electronic archive have either implicitly or explicitly agreed to preserve the information in the electronic archive and make it available to the Designated Community for the indefinite long-term (CCSDS, 2002).

The CCSDS created the document for the OAIS Reference Model to outline the responsibilities of the owners and maintainers of the electronic archive. If they meet those responsibilities, then the electronic archive may be referred to as an “OAIS archive”. When the CCSDS members used the word “Open” as part of the name of the Reference Model, they referred to the fact that the standard was developed and continues to be developed in open forums. They are clear that the use of the word, “open” does not mean that access to the OAIS system itself or its contents is unrestricted (CCSDS, 2002).

The OAIS Reference Model: Key Concepts

The members of the CCSDS created three OAIS concepts. They called these the “OAIS Environment”, the “OAIS Information”, and the “OAIS High-level External Interactions”.

The “OAIS Environment” consists of the “Producers”, “Consumers”, and “Management” in the environment that surrounds an OAIS archive. The “Producer” is a system or people who provide the information (data) that is ingested into the archive to be preserved. The “Consumer” is a system or people who use the archive to access the preserved information. “Management” is a role played by people who are not involved in the day-to-day functioning of the archive, but who also set overall OAIS policy. Other OAIS or non-OAIS compliant archives may interact with the OAIS archive as either a “Producer” or a “Consumer” (CCSDS, 2002). The CCSDS represented these concepts with in Figure 1, below.

Figure 1 - Environment Model of an OAIS (CCSDS, 2002).
Figure 1 – Environment Model of an OAIS (CCSDS, 2002).

The CCSDS wrote the “OAIS Information” concept to consist of the “information definition”, the “information package definition”, and the “information package variants”.

First, the CCSDS defined “information”. Information is “any type of knowledge that can be exchanged, and this information is always expressed (i.e., represented) by some kind of data” (CCSDS, 2002). A person or system’s Knowledge Base allows them to understand the received information (see Figure 2, below). Thus, “‘data interpreted using its Representation Information yields Information'” would mean in practice that ASCII characters (the data) representing a language (such as English or French grammar and language, i.e., “the Knowledge Base” or Representation Information) provided Information to the person. Therefore, in order for Information to be represented with any meaning to a Designated Community, the appropriate Representation Information for a Data Object must also be preserved.

Figure 2 - Obtaining Information from Data (CCSDS, 2002).
Figure 2 – Obtaining Information from Data (CCSDS, 2002).

Second, whether data is disseminated to a Designated Community member, or ingested via a Producer, the information must be packaged. The CCSDS described an Information Package as consisting of the Packaging information, the Content Information (the information to be preserved and its representation information), and the Preservation Description Information (provenance, context, reference, and fixity). Provenance describes the source of the information; context provides any related information about the object; reference is the unique identifier or set of identifiers for the content; and fixity assures that the content has not been altered, either intentionally or unintentionally. The Packaging Information binds the Content Information and Preservation Description Information, per Figure 3, below.

Figure 3 - Information Package Concepts and Relationships (CCSDS, 2002).
Figure 3 – Information Package Concepts and Relationships (CCSDS, 2002).

Third, the CCSDS defined three variants of the Information Package: the Submission Information Package (SIP), the Archival Information Package (AIP), and the Dissemination Information Package (DIP). These three versions may be the same, but they may also be different. For example, a Producer may submit a SIP to an OAIS archive that is then augmented by the archive managers to meet their policies and standards. Once ingested, the AIP the repository owner stores may or may not be the same as the DIP accessed by the Consumer. Beedham, et al. (2005) criticize the developers of the OAIS Reference Model for assuming that all OAIS archives will have three different versions of an Information Package. The authors note that this concept is not practical for data archives, for example, because all relevant information about a data set must be gathered at the time of submission, and it is impractical to store different versions of an information object within an archive. Thus, a consumer may receive a DIP that is an exact copy of the AIP and the original SIP.

Finally, the CCSDS documented the concepts of the “OAIS High-level External Interactions”, in Figure 4, below. In short, they described the external data flows between and among the actors in an “OAIS Environment”: management, producer, and consumer. The CCSDS provided example interactions for Management, such as: funding, reviews, pricing policies, and “conflict resolution involving Producers, Consumers, and OAIS internal administration” (CCSDS, 2002).

Figure 4 - OAIS Archive External Data (CCSDS, 2002).
Figure 4 – OAIS Archive External Data (CCSDS, 2002).

The members of the CCSDS described “Producer Interaction” as involving the initial contact, the establishment of a Submission Agreement (which lays out what is to be submitted, how, and other expectations per the two parties) and the Data Submission Session(s) (in which the SIPS are submitted to the OAIS). The authors of the Reference Model conceded that there might be many types of Consumer Interactions with the OAIS managers. They described a variety of interactions, which include catalog searches, orders, help, etc. Beedham, et al. (2005) again criticized the CCSDS for assuming that all OAIS archives will provide order functions to their Designated Communities. The authors point out that some repository’s owner policies require that data is available for free, particularly when the owner of the archive is a national government agency, and the Designated Community are taxpayers.

The OAIS Reference Model: Key Responsibilities

The CCSDS established the minimal responsibilities required for a repository to be considered an OAIS archive. The OAIS must:

  • Negotiate for and accept appropriate information from information Producers.
  • Obtain sufficient control of the information provided to the level needed to ensure Long-Term Preservation.
  • Determine, either by itself or in conjunction with other parties, which communities should become the Designated Community and, therefore, should be able to understand the information provided.
  • Ensure that the information to be preserved is Independently Understandable to the Designated Community. In other words, the community should be able to understand the information without needing the assistance of the experts who produced the information.
  • Follow documented policies and procedures which ensure that the information is preserved against all reasonable contingencies, and which enable the information to be disseminated as authenticated copies of the original, or as traceable to the original.
  • Make the preserved information available to the Designated Community (CCSDS, 2002).

Beedham, et al. (2005) wrote that the authors of the OAIS created an “inbuilt limitation” because they assume “both an identifiable and relatively homogeneous consumer (user) community”. They note that this is not the case for national archives and libraries; their Consumers hold a wide variety of skills, educational levels, and knowledge.

The OAIS Reference Model: Key Models

The members of the CCSDS described the functional entities of the OAIS as three models, a “Functional Model”, the “Information Model”, and “Information Package Transformations”. The authors of the Reference Model included this section to provide a common set of preservation system terminology, and to provide a model from which future systems designers may work.

The OAIS Functional Model

The functional model of the OAIS consists of “six functional entities and related interfaces” (CCSDS). The six functional entities are ingest, archival storage, data management, administration, preservation planning, access, and common services. The seventh entity, “Common Services”, is described in the document, but it is not included in the image of the OAIS Functional Entities (see Figure 5, below) because “it is so pervasive”.

Figure 5 - OAIS Functional Entities (CCSDS, 2002).
Figure 5 – OAIS Functional Entities (CCSDS, 2002).

1. INGEST: Functions of the Ingest entity include accepting SIPs from internal or external Producers and then preparing the SIP(s) for management and storage within the repository. As part of preparing the SIP for storage within the repository, the repository employee in charge of ingest will check the quality of the SIP(s), create an AIP that complies with the standards of the repository and with the Submission Agreement, extract any Descriptive Information, and sync updates between Ingest and Archival Storage/Data Management.

Practitioners such as Beedham, et al. (2005) criticized the lack of detail available for the Ingest process; the authors of the Reference Model made it appear to be a very simple function, when, in fact, it can be a very complex process. As a result of this criticism, the CCSDS wrote a more detailed description of the Ingest Process in the Producer-Archive Interface Methodology Abstract Standard (CCSDS, 2004). However, many practitioners are clear that “pre-ingest functions are…essential for efficient and effective archiving” and the authors of the OAIS would serve the preservation repository community better by expanding the Ingest section of the OAIS Reference Model documentation, rather than creating a separate model and documentation (Beedham, et al., 2005).

Partially due to the lack of detail related to Ingest, much less the Ingest of records, archivists and records managers at Tufts University and Yale University applied the OAIS Reference Model and developed an Ingest Guide to aid practitioners in preserving university records (Fedora and the Preservation of University Records Project, 2006). (This project was discussed in a previous literature review on digital curation and preservation.)

2. ARCHIVAL STORAGE: Functions of archival storage include maintaining the integrity of the digital files, including the bits. Thus, the functions of this entity include not only receiving the AIP from Ingest and to Access, but also refreshing and migrating the media and file formats on and in which the data is stored. Other tasks of this entity include error checking and disaster recovery.

3. DATA MANAGEMENT: The data management entity provides the functions and services for accessing, maintaining, and populating administrative data and Descriptive Information. These include generating reports from result sets which are based on queries on the data management data; updating the database; and maintaining and administering archive database functions, such as referential integrity and view/schema definitions.

Beedham, et al. (2005) concluded that this entity is a simple idea that is messy in practice. When they mapped the different data management entities, their results created an “explosion” to all the different archival systems and processes.

4. ADMINISTRATION: The functions of this entity involve the overall management of the archive. This includes setting policies and standards; supporting and aiding the Designated Community; migrating and refreshing the archive contents, software, and hardware; and soliciting, negotiating, auditing Submission Agreements with both internal and external producers; and, any other administrative related duties as required.

These functions are designed for large organizations with automated processes; the authors of the Reference Model did not design this entity for small-scale digital repositories (Beedham, et al., 2005). However, most of these functions are an organic part of many archive’s functioning, even if the roles are all performed by one or two people. Beedham, et al. (2005) wrote that the functions of this entity are sufficient for most archives, but the listed tasks do not stand on their own, as each archive has its own set of responsibilities, requirements, procedures, and policies.

5. PRESERVATION PLANNING: This preservation planning entity is related to the Administrative entity, but it focuses purely on the preservation aspects of maintaining the archive for the indefinite long-term and ensuring the content is available to the Designated Community. The functions of the entity primarily involve monitoring the internal and external environments of the archive to ensure hardware and software are up to date; that the archive follows best practices with regards to the preservation of digital content; and that plans are in place to enable Administration goals, such as migration.

Repository managers criticized this entity because “real” archives do not operate as cleanly as the OAIS Reference Model authors envision; not all decisions and processes can or should be made proactively. Beedham, et al. (2005) concluded that the OAIS is at times overly bureaucratic and formalized.

6. ACCESS: This function provides the Designated Community with a method to obtain the desired information from the archive, assuming such access is not restricted and that the user in question is, in fact, allowed to access this particular information from this particular archive. The services and functions provided by the Access entity allow the Designated Community to determine the existence, location, availability, and description of the stored information. This function provides the information to the Designated Community as a DIP.

7. COMMON SERVICES: The “common services” functional entity refers to supporting services common in a distributed computing environment. These services involve operating systems, network services, and security services. Operating system services include the core services required to administer and operate an application platform, and provide an interface. These include: system management, operating system security services, real-time extension, commands and utilities, and, kernel operations. Network services provide the means for the archive to operate in a distributed network environment, including: remote procedure calls, network security services, interoperability with other systems, file access, and data communication. Security services protect the content in the archive from external and internal threats by providing the following capabilities and mechanisms: non-repudiation services (i.e., the sender and receiver log copies of the transmission and receipt of the information), data confidentiality and integrity services, access control services, and authentication (CCSDS, 2002).

A detailed mapping of the Ingest, Archival Storage, Data Management, Administration, Preservation Planning, and Access functional entities is included in Figure 6, below.

Figure 6 - Composite of Functional Entities (CCSDS, 2002).
Figure 6 – Composite of Functional Entities (CCSDS, 2002).

Again, Common Services is not included because it is a supporting service of distributed computing (CCSDS, 2002).

The OAIS Information Model

The Information Model “defines the specific Information Objects that are used within the OAIS to preserve and access the information entrusted to the archive” (CCSDS, 2002). The CCSDS intended for this section to be conceptual, and it is written for an Information Architect to use when designing an OAIS-compliant system. The authors divided the Information Model into three sections: the logical model for archival information, the logical model of information in an open archival information system (OAIS), and data management information.

I. The Logical Model for Archival Information

The CCSDS defined information as a combination of data and representation information. The Information Object itself is either a physical or digital Data Object with Representation Information that “allows for the full interpretation of data into meaningful information” (CCSDS, 2002). The Representation Information provides a method for the data to be mapped to data types such as pixels, arrays, tables, numbers, and characters. The latter are referred to as the Structure Information and Semantic Information, in turn, supplements this. Semantic Information examples include the language expressed in the Structure Information, which kinds of operations may be performed on each data type, their interrelationships, etc. Representation Information may also reference other Representation Information; for example, “Representation Information expressed in ASCII needs the additional Representation Information for ASCII, which might be a physical document giving the ASCII Standard” (CCSDS, 2002).

Representation Rendering Software and Access software are two special types of Representation Information. The latter provides a method for some or all of the content of an Information Object to be in a form understandable to systems or a human. The former displays the Representation Information in an understandable form, such as a file and directory structure (CCSDS, 2002).

The CCSDS defined four types of Information Objects: Content, Preservation Description, Packaging, and Descriptive. The Content Information Object is “the set of information that is the original target of preservation by the OAIS” and it may be either a physical or digital object (CCSDS, 2002). In order to determine clearly what must be preserved, an administrator of an archive must determine which part of a Content Information Object is the Content Data Object and which part is the Representation Information.

The CCSDS defined Preservation Descriptive Information as “information that will allow the understanding of the Content Information over an indefinite period of time” (CCSDS, 2002). This descriptive information focuses on ensuring the authenticity and provenance of the Information Objects. The authors of the Reference Model described four parts to the Preservation Descriptive Information: reference (unique identifier(s)), context (why it was created and how it relates to other Information Objects), provenance (the history, origin, and source), and fixity (data integrity checks or validation/verification keys).

As stated previously, the Packaging Information logically binds the pieces of the package onto a specific media via an identifiable entity. Finally, Descriptive Information provides a method for the Designated Community to locate, analyze, retrieve, or order the desired information via some type of Access Aid, which is generally an application interface or document (CCSDS, 2002).

II. The Logical Model of Information in an Open Archival Information System

The authors of the Reference Model described three types of Information Packages that are based on the four types of Information Objects. That is, the Content, Preservation Description, Packaging, and Descriptive Information Objects may be used to create one of three types of Information Packages: the Submission Information Package (SIP), the Archival Information Package (AIP), and the Dissemination Information Package (DIP). The SIP is the data that is sent to an archive by an internal or external Producer. The form and content of the SIP(s) may or may not meet the requirements of the archive ingesting it, and the archive manager may require some additional information to be added prior to ingest, such as a unique ID, checksum validation, virus checks, file name standardization, or additional Representation Information (metadata).

The CCSDS defined the AIP as the Information Package that is stored for the indefinite long-term. The requirements for the Representation Information for an AIP are more stringent than for other types of Information Packages, because this is the actual information that is the focus of preservation. The Information Objects and the Representation Information that comprise an AIP are stored in an archive as one logical unit (Lavoie, 2004).

The authors described two subsets of the AIP, the Archival Information Unit (AIU) and the Archive Information Collection (AIC). The former “represents the type used for the preservation function of a single content atomic object”, while the latter “organizes a set of AIPs (AIUs and other AICs) along a thematic hierarchy….” (CCSDS, 2002). The CCSDS described the Collection Description as a subtype…”that has added structures to better handle the complex content information of an AIC” (CCSDS, 2002). The archive manager may use Collection Description to describe the entire collection or zero or more individual units within the collection. One benefit of Collection Description is the ability to generate new virtual collections based, for example, either on access or theme.

The Dissemination Information Package (DIP) is the Information Package ordered by or provided to the Designated Community. The CCSDS intended for the DIP to be a version of the AIP, but it is entirely possible for the AIP and the DIP to be exactly the same Information Package. Lavoie (2004) described possible variations between an AIP and a DIP. The Designated Community member accessing the archive may receive a different format, for example, a .jpeg instead of a .tiff. The DIP may contain less metadata than is available with the AIP, or even less content, since a DIP may correspond to one or more or even part of an AIP.

III. Data Management Information

Last, the CCSDS included Data Management Information as one part of the Logical Model of Information in an OAIS. That is, the authors of the Reference Model made the requirement that information needed for the operation of the archive is to be stored in the archive databases as persistent data classes. The type of information required includes: statistical information, such as access numbers; customer profile information; accounting information; preservation process history; event based order information; policy information, including pricing; security information; and, transaction tracking information (CCSDS, 2002). Other data management information may be added to the archive at the discretion of the archive managers or as requested by the Designated Community. However, Beedham, et al. (2005) concluded that the information categories in the Information Model are “too broad, functionally organised…and do not reflect the way metadata are packaged and used across particular archival practice”.

Information Package Transformations

The CCSDS members created the Functional Model to describe the architecture of an OAIS, and the Information Model to describe the content held by the OAIS. The authors also described the lifecycle of the Information Package and any associated objects, as well as its logical and physical transformations.

In short, when a Producer agrees to submit data to an OAIS, a Submission Agreement is created and approved with the OAIS administrator. The Producer then submits data in the form of a SIP to an OAIS, where the OAIS administrator stores it in a staging area. In the staging area, the OAIS manager will perform any necessary transformations to the SIP so it will meet the standards of the OAIS, and the criteria of the Submission Agreement. The OAIS manager will create AIPs from the SIP. This mapping may not be one-to-one. One SIP may produce one AIP or many AIPs, many SIPs may produce one AIP, many SIPs may produce many AIPs, and one SIP may produce no AIPs (CCSDS, 2002). The CCSDS described this process in more detail in the Producer-Archive Interface Methodology Abstract Standard (CCSDS, 2004).

At the same time as the SIPs are transformed into AIPs and stored in the OAIS, the Data Management functional entity augments the existing Collection Descriptions to include the contents of the Package Descriptions. When a Consumer, i.e., a member of the Designated Community, wishes to access the information contained in an OAIS, the member will do so via the Access functional area. Once the consumer has located the desired information via some type of finding aid, the information is provided to the Consumer in the form of a DIP. The authors of the Reference Model designed the DIP and AIP mapping to be similar to that between SIPs and AIPs. That is, the mapping may or may not be 1:1, depending on whether or not a transformation is performed.

Figure 7 - High-Level Data Flows in an OAIS (CCSDS, 2002).
Figure 7 – High-Level Data Flows in an OAIS (CCSDS, 2002).

Based on the Information Package Transformation in Figure 7, above, the authors of the OAIS Reference Model assumed that the Consumer from an AIP would create a DIP on demand. Beedham, et al. (2005) wrote, “this approach has serious drawbacks”. These data repository managers determined that by creating the DIP at the time of Ingest, they could ensure that the records accessed by the Consumer are in a “technically usable state” (Beedham, et al., 2005). They initially created DIPS from an AIP upon demand by a Consumer, but often times, the data is 5-10 years old at the time of ingest into the archive, and the data is often years older than that when accessed by a Consumer. This often meant that the DIP was not independently understandable by the Consumer, and the researchers who created the data either were no longer available, or could not answer queries regarding the data because too much time had passed.

Beedham, et al., (2005) discovered that by creating the DIP at the time of Ingest, they were able to eliminate many errors in the digital records while they still had co-operation from the Producer. This also improves the understanding and “preservability” of the AIP itself. As well, standard archival practice is to store the original version, and provide only a copy to users. In that sense, storing the AIP and creating a DIP at Ingest that is an exact replica of the AIP follows this practice, although “copy” does not have the same meaning in the digital world as it does in the physical. The OAIS Reference Model does not preclude this practice, but neither does it explicitly condone it.

The OAIS Reference Model: Preservation Perspectives

The members of the CCSDS used the Functional Model and the Information Model just described and applied them to information preservation and access service preservation. The former refers to the migration of digital information and the latter to the preservation of the services used to access the digital information.

Information Preservation

The CCSDS defined migration as “the transfer of digital information, while intending to preserve it, within the OAIS” (CCSDS, 2002). The authors distinguished migration from transfers based on three characteristics: the focus is on the preservation of the full information content; the new archival implementation is a replacement for the old; the responsibility for and full control of the transfer reside within the OAIS. The CCSDS (2002) members described three primary drivers for migration: the media on which the information resides is decaying; technology changes; and, the improved cost-effectiveness of newer technology over older or obsolete technology.

The committee members defined four types of migration: refreshment, replication, repackaging, and transformation. They determined that Refreshment refers to the replacement of a media instance with a similar piece of media, such that the bits comprising the AIP are simply copied over. An example of this would be replacing a computer disk. The authors defined Replication as a bit transfer to the same or new media-type, where there is “no change to the PDI, the Content Information, and the Packaging Information. An example of replication would be a full back up of the contents of an OAIS. The CCSDS described Repackaging as a change to the Packaging Information during transfer. If files from a CD-ROM are moved to new files on another media type, with a new file implementation and directory, then the files have been Repackaged.

Last, the CCSDS (2002) defined Transformation as “some change in the Content Information or PDI bits while attempting to preserve the full information content”. If an AIP undergoes Transformation, then the new AIP is considered a new Version of the previous AIP. For example, a file in the .doc format may be transformed to a .pdf for preservation purposes. Some transformations are Reversible, while others are Non-reversible. The CCSDS members state that only when an AIP is migrated using Transformation is the resulting AIP considered a new version; the AIP version is independent of Refreshment, Replication, and Repackaging.

Access Service Preservation

As part of examining preservation perspectives, the members of the CCSDS briefly addressed how to continue to provide Consumers access services as technology changes. A method archive managers use to maintain access is to develop Application Programming Interfaces (APIs) to provide access to AIPs. Another method they incorporate is to use emulation or provide the original source code to provide access to a set of AIUs while maintaining the same “look and feel” as the original access method.

The OAIS Reference Model: Archive Interoperability

A community of users and managers of digital repositories may wish to share data or cooperate with other archives. The reasons for this may vary; in some cases, the repository managers may wish to provide mutual back up and replication services with a similar archive, in order to prevent data loss and reduce costs. In another instance, a user community may prefer one point of entry to search for required information across multiple digital archives. Regardless of the motivations of an archive owner for interoperating with another archive, the interactions may be defined by two categories, technical and managerial.

The CCSDS defined four types of interoperating archives: independent, cooperating, federated, and shared resources. They described an independent archive as one that does not interact with other archives. There is no technical or management interaction between this type of archive and other archives. The authors defined cooperating archives as those archives that do not have a common finding aid, but otherwise share common dissemination standards, submission standards, and producers.

The members of CCSDS (2002) wrote that a federated archive consists of two communities, Local and Global, and those archives “provide access to their holdings via one or more common finding aids”. They note that Global dissemination and Ingest are optional, and that the needs of the Local community tend to take precedence over the Global community. Furthermore, they described three levels of functionality for a Federated archive: Central Site (i.e., one point of entry to all archive content via metadata harvested by the central site), Distributed Finding Aid (i.e., federated searching of all archives), and Distributed Access Aid (i.e., a “standard ordering and dissemination mechanism”) (CCSDS, 2002). They wrote that federated archives tend to have similar policy and technology issues, such as authentication and access management, preservation of federation access to AIPs, duplicate AIPs, and providing unique AIPs.

Last, the authors described “shared resources”, where archives enter into agreements to share resources for their mutual benefit, often to reduce costs. The wrote that this type of agreement does not alter the view of the archives by their respective Designated Communities, it merely requires the implementation of a variety of standards internal to the archive, such as ingest-storage and access-storage interface standards (CCSDS, 2002).

The CCSDS described the primary management issue related to archive interoperability in one word: autonomy. The members of the CCSDS (2002) characterized three primary autonomy levels: no association because there are no interactions; an association member’s autonomy with regards to the federation is maintained; and association members are bound to the federation by a contract.

The OAIS Reference Model: Compliance

What does it mean to be “OAIS Compliant”? The members of the CCSDS stated that if a repository “supports the OAIS information model”, commits to “fulfilling the responsibilities listed in chapter 3.1 of the reference model”, and uses the OAIS terminology and concept appropriately, then the archive is compliant (CCSDS, 2002; Beedham, et al., 2005). When the members of the CCSDS wrote the Reference Model documentation, they did not recommend any particular concrete implementation of hardware, software, etc., as the authors deliberately designed it to be a conceptual framework. How then, may an archive owner, manager, or member of a Designated Community “prove” that the archive of interest is, in fact, OAIS-compliant?

One method to audit OAIS-compliance is to create a set of standards that define the attributes of a trusted digital repository. The Research Libraries Group (RLG) and the Online Computer Library Center (OCLC) funded the development of the attributes of a “trusted digital repository” in March 2000. The two groups produced a report that defined the attributes and responsibilities of a trusted digital repository in 2002 (Research Libraries Group, 2002). Beedham, et al. (2005) notes that the authors of the report put compliance with the OAIS Reference Model first on the list of attributes of a trustworthy repository.

Based on this report, RLG, OCLC, the Center for Research Libraries (CRL), and the National Archives and Records Administration (NARA) produced a “criteria and checklist” in 2005 called, “Trustworthy Repositories Audit & Certification: Criteria and Checklist” (Research Libraries Group, 2005). The authors designed it so that archive managers could use it for audit and certification of the archive. Experts in the field merged the RLG and OCLC report from 2002 and the “Criteria and Checklist” from 2005 to develop a Recommended Practice under the auspices of the CCSDS. They called the document the “Audit and Certification of Trustworthy Digital Repositories Recommended Practice”, and the CCSDS released the document in September 2011. The CCSDS released the document to provide a basis for the audit and certification of the trustworthiness of a digital repository by providing detailed criteria by which an archive shall be audited (CCSDS, 2011). These documents will be discussed in detail in a separate literature review.

One criticism of the OAIS is that it is challenging to develop a from-scratch repository using the Reference Model. Egger (2006) conducted a use case analysis as part of a standard software development process, and determined that he must “develop additional specifications which fill the gap between the OAIS model and software development”. He wrote that is was difficult to map OAIS functions as use case scenarios, because the descriptions contain different levels of detail. For example, he states that some functions are written as general guidelines, while others are “specified nearly at the implementation level” (Egger, 2006). He also criticizes the authors for mixing technical functionality with management functionality, because in order to develop a technical system, the management functions must be removed. Egger (2006) recommends creating additional specifications that would “define system architectures and designs that conform to the OAIS model”, although he notes that the OAIS Reference Model is not a technical guideline.

Beedham, et al. (2005) wrote that as repository managers, they have to consider other legislation, standards, guidelines, and regulations when determining the archive’s OAIS compliance. For example, they must provide web access to the disabled as part of their charter as national archives, and they have specific responsibilities to the data depositor (the Producer) with regards to Intellectual Property and statistical disclosure. The authors of the Reference Model did not discuss how to comply with legislation, et al., when to do so would make the archive in question “not OAIS-compliant”, if audited.

The OAIS Reference Model: Example Deployments

Ball (2006) examined the OAIS Reference Model to determine the application of it to engineering repositories. Two common generic repository systems that use the OAIS Reference Model are DSpace and Fedora. The creators of DSpace designed it primarily for Institutional Repositories, while the researchers behind Fedora designed it to be a digital library that stores multimedia collections. Ball found five custom repositories that claim to be OAIS-compliant: the Centre deDonnées de la Physique des Plasmas (CDPP), MathArc, the European Space Agency (ESA) Multi-Mission Facility Infrastructure (MMFI), the National Oceanic and Atmospheric Administration (NOAA) Comprehensive Large Array-data Stewardship System (CLASS), and, the National Space Science Data Center (NSSDC). While Ball did discuss the efforts of RLG, OCLC, CRL, and NARA to provide a method for audit and certification, he did not note whether or not the creators and owners of DSpace, Fedora, or any of the custom systems, or their users, had formally audited any of the repository software for OAIS compliance.

Vardigan & Whiteman (2007) did apply the OAIS Reference Model to the social science data archive for the Inter-university Consortium for Political and Social Research (ICPSR). The authors wished to determine their repository’s conformance to the OAIS Reference Model. After an extensive audit, they realized that the ICPSR digital repository did fulfill many of the key responsibilities of an OAIS archive, with two exceptions. First, they need to publish a preservation policy, and second, they discovered that their Preservation Description Information is not always clearly labeled and it is often incomplete (Vardigan & Whiteman, 2007).

Data grids are an example of a general systems deployment of the OAIS Reference Model. A grid administrator may map the policies and procedures that govern the data flow of the data grid to specific OAIS components. For example, if the grid administrator would like to create authentic copies, then s/he will implement access policies that govern the generation of DIPs. The grid administrator may implement replication and integrity checking by implementing storage policies; and may implement the processing of SIPs and the creation of AIPs by implementing ingest policies (Reagan Moore, personal communication, December 22, 2011). Other specific OAIS components may be mapped to the data grid’s policies and procedures data flow as needed; these are but a few examples.

The OAIS Reference Model: Other Criticisms

Higgins and Semple (2006) compiled a list of recommendations for updates to the OAIS Reference Model in preparation for the CCSDS’ review of the recommendation at the five-year mark. The authors compiled the list of recommendations on behalf of the Digital Curation Centre and the Digital Preservation Coalition. Among the general recommendations, the authors listed: supplementary documents such as OAIS-lite for managers, a self-testing manual, an implementation checklist, and a best practice guide. The authors requested more concrete and up-to-date examples for implementers.

Higgins and Semple noted the CCSDS’ tendency to be very prescriptive and detailed in some sections, and overly general in others. They re-iterated that the CCSDS should create a better description of minimal requirements, as not everything must be implemented. The authors requested a review of the terminology clashes between the OAIS Reference Model, PREMIS, and other standards, and asked the CCSDS to resolve these differences. Higgins and Semple requested terminology and clarification updates by chapter, including updates to words such as “repository”, “preservation”, “security”, etc. They also identified a variety of outdated material.

The members of the CCSDS Data Archiving and Ingest Working Group did respond to this list of recommendations. They adopted some of the recommendations and made changes to the text of the OAIS Reference Model, but they refused to make other requested changes. Higgins and Boyle (2008) compiled a response to the CCSDS, again on behalf of the Digital Curation Centre and the Digital Preservation Coalition. Their concerns related to the changes rejected by the CCSDS Data Archiving and Ingest Working Group. Higgins and Boyle (2008) wanted “to ensure that the revised standard” would:

  • remain up-to-date until the next planned review;
  • remain applicable to the current heterogeneous user base;
  • be easier to understand through a structure which clearly delimits normative text, use cases and examples;
  • contain guidelines on how to achieve an implementation;
  • follow ISO practice by clearly referencing other applicable standards; and,
  • clarify its applicability to digital material (Higgins & Boyle, 2008).

It will be interesting to note which, if any, of these recommendations the members of the CCSDS include in the next revision of the OAIS Reference Model.

Conclusions and Future Work

Practitioners note that one benefit of the OAIS Reference Model has been “the utility of the OAIS language as a means of communication” between partnering repository administrators, who often had different terminology (Beedham, et al., 2005). The authors recommend that current archives should adopt the OAIS language in lieu of their own terminology, and new archive administrators should adopt it from the inception of the archive. Allinson (2006) writes that the OAIS Reference Model “ensures good practice”, as it “draws attention to the important role of preservation repositories” by providing a standard model so that preservation is considered part and parcel of other archive functions and activities. When the CCSDS outlined an archive manager’s Mandatory Responsibilities, the authors asked only that an archive’s “preservation has been planned for and a strategy identified”, as most repository managers already fulfill those tasks as a de facto part of the repository’s functioning (Allinson, 2006).

One area of future work may be to create an “OAIS lite” for smaller archives, who do not have the personnel or need for such a bureaucratic model (Beedham, et al., 2005). Another area for future work is to de-homogenize the definition of Designated Community, as not every repository has a narrow audience of users. The CCSDS might consider recommending other metadata documentation to supplement the Reference Model, or create a separate recommendation; similar to the way the Producer-Archive Interface Methodology Abstract Standard (CCSDS, 2004) supplements the Ingest entity. This documentation would describe how the different information packages breakdown or how to apply metadata schemas (Beedham, et al., 2005; Allinson, 2006).

Egger (2006), Allinson (2006), and Beedham, et al. (2005), among others, complained that the authors of the OAIS Reference Model are inconsistent in the specifications, as some specifications are very general, while others are very detailed. Therefore, one area for future work is for the CCSDS to create consistency within the Reference Model document with regards to specificity. Finally, Beedham, et al., concluded that the authors of the Reference Model may want to re-word the recommendation to take into account that a SIP, AIP, and DIP may all be one and the same, rather than assume that each of these are different types of Information Packages.

In spite of the various criticisms, the overall conclusion from a variety of experienced repository managers is that the authors of the OAIS Reference Model created flexible concepts and common terminology that any repository administrator or manager may use and apply, regardless of content, size, or domain (e.g., academia, private industry, and government).

References

Allinson, J. (2006). OAIS as a reference model for repositories an evaluation. Bath, England: UKOLN. Retrieved December 19, 2011, from http://www.ukoln.ac.uk/repositories/publications/oais-evaluation-200607/Drs-OAIS-evaluation-0.5.pdf

Ball, A. (2006). Briefing paper: the OAIS Reference Model. Bath, England: UKOLN. Retrieved December 19, 2011, from http://homes.ukoln.ac.uk/~ab318/docs/ball2006oais/

Beedham, H., Missen, J., Palmer, M. & Ruusalepp, R. (2005). Assessment of UKDA and TNA compliance with OAIS and METS standards. UK Data Archive and The National Archives, 2005. Retrieved: December 20, 2011, from: http://www.jisc.ac.uk/uploaded_documents/oaismets.pdf

CCSDS. (2002). Reference model for an Open Archival Information System (OAIS) (CCSDS 650.0-B-1). Washington, DC: National Aeronautics and Space Administration (NASA). Retrieved April 3, 2007, from http://nost.gsfc.nasa.gov/isoas/

CCSDS. (2004). Producer-archive interface methodology abstract standard (CCSDS 651.0-B-1). Washington, DC: National Aeronautics and Space Administration (NASA). Retrieved August 18, 2007, from http://public.ccsds.org/publications/archive/651x0b1.pdf

CCSDS. (2011). Audit and Certification of Trustworthy Digital Repositories (CCSDS 652.0-M-1). Magenta Book, September 2011. Washington, DC: National Aeronautics and Space Administration (NASA).

Egger, A. (2006). Shortcomings of the Reference Model for an Open Archival Information System (OAIS). IEEE TCDL Bulletin, 2(2). Retrieved October 23, 2009, from http://www.ieee-tcdl.org/Bulletin/v2n2/egger/egger.html

Fedora and the Preservation of University Records Project. (2006). 2.1 Ingest Guide, Version 1.0 (tufts:central:dca:UA069:UA069.004.001.00006). Retrieved April 16, 2009, from the Tufts University, Digital Collections and Archives, Tufts Digital Library Web site: http://repository01.lib.tufts.edu:8080/fedora/get/tufts:UA069.004.001.00006/bdef:TuftsPDF/getPDF

Galloway, P. (2004). Preservation of digital objects. In B. Cronin (Ed.), Annual Review of Information Science and Technology, 38(1), (pp. 549-590).

Higgins, S. & Boyle, F. (2008). Responses to CCSDS’ comments on the ‘OAIS five-year review: recommendations for update 2006’. London: Digital Curation Center and Digital Preservation Coalition.

Higgins, S. & Semple, N. (2006). OAIS five‐year review: recommendations for update. London: Digital Curation Center and Digital Preservation Coalition.

Lavoie, B. (2004). The open archival information system reference model: introductory guide. Technology Watch Report. Dublin, OH: Digital Preservation Coalition. Retrieved March 6, 2007, http://www.dpconline.org/docs/lavoie_OAIS.pdf

Lee, C. (2010). Open archival information system (OAIS) reference model. In Encyclopedia of Library and Information Sciences, Third Edition. London: Taylor & Francis.

Research Libraries Group. (2002). Trusted digital repositories: attributes and responsibilities an RLG-OCLC report. Mountain View, CA: Research Libraries Group. Retrieved September 11, 2007, from http://www.oclc.org/programs/ourwork/past/trustedrep/repositories.pdf

Research Libraries Group. (2005). An audit checklist for the certification of trusted digital repositories, draft for public comment. Mountain View, CA: Research Libraries Group. Retrieved April 14, 2009, from http://worldcat.org/arcviewer/1/OCC/2007/08/08/0000070511/viewer/file2416.pdf

Science and Technology Council. (2007). The digital dilemma strategic issues in archiving and accessing digital motion picture materials. The Science and Technology Council of the Academy of Motion Picture Arts and Sciences. Hollywood, CA: Academy of Motion Picture Arts and Sciences.

Vardigan, M. & Whiteman, C. (2007). ICPSR meets OAIS: applying the OAIS reference model to the social science archive context. Archival Science, 7(1). Netherlands: Springer. Retrieved February 20, 2008, from http://www.springerlink.com/content/50746212r6g21326/

OAIS Reference Model & Preservation Design Summary

If you would like to work with us on a digital preservation and curation or data governance project, please review our services page.

OAIS Reference Model & Preservation Design Summary

Manage Data: Preservation Standards & Management

Community data standards stewardship preservation curation

Abstract

Archivists, librarians, computer scientists and other researchers and scientists have been concerned about the long-term survivability of data for decades. This data may be in the form of actual data sets, or data that represents and describes published works, art, video, audio, or other file formats. This literature review describes the emergence of digital curation and digital preservation standards in the context of managing data. Standards for digital curation and digital preservation augment the ability of data owners and users to ensure the survivability of their data, but these standards do not directly “cause” the long-term preservation of the data itself. The conclusion is that the survivability of data depends on the will and desire of the data owners and users, and the availability of financial resources to do so.

Citation

Ward, J.H. (2012). Managing Data: the Emergence & Development of Digital Curation & Preservation Standards. Unpublished manuscript, University of North Carolina at Chapel Hill. (pdf)

Creative Commons License
This work is licensed under a Creative Commons Attribution-NoDerivatives 4.0 International License.


Table of Contents

Abstract

Introduction

Why Preserve and Curate Data?

Basic Definitions

Motivating Factors for the Development of Digital Curation and Digital Preservation Standards

Persistence

Overview of the OAIS Reference Model and the Audit and Certification of Trustworthy Digital Repositories Recommended Practice

Applications of the Audit and Certification of Trustworthy Digital Repositories Recommended Practice and the OAIS Reference Model

Other Technical Issues

General Digital Repository Management

Funding

Current Status and Future Challenges/Further Work

Conclusion

References


Table of Figures

Figure 1 – the Digital Curation Centre Curation Lifecycle Model (Higgens, 2007).

Figure 2 – Scientific and Technical Information Lost Over Time (Nelson, 2000).

Figure 3 – Core data grid capabilities and functions for implementing a persistent archive (Moore & Merzky, 2002).

Figure 4 – the OAIS Reference Model “Functional Model” (CCSDS, 2002).


Introduction

Librarians and archivists have spent centuries wringing their hands and having multiple heated discussions about the best method for recording information for the purposes of transmitting it to succeeding generations. It could be, perhaps, that librarians and archivists of ancient Egypt were in agreement that clay tablets were the best form to transmit information, but one imagines that even within that framework there was much discussion with many agreements and disagreements. Which clay would last? Who was the best potter to fire it? How do we know that potter is actually selling us the quality of clay that he promised? “There must be standards!”…and so on and so forth. Regardless, those ancient librarians and archivists chose well, as 5,000 years later, those clay tablets have stood the test of time and are still readable now (Krasner-Khait, 2001).

Then “someone” determined that papyrus was better than clay as an information transmission form. After all, it was lighter, it couldn’t break, and it was much easier to carry over long distances. The material required less storage space, as well, which would reduce overall costs. One can imagine the “old school” librarians and archivists with their clay fetish, snubbing the new papyrus advocates. However, the papyrus advocates eventually won, and the rolls of papyrus replaced clay tablets as the information medium of choice. Papyrus remained the primary information storage method of choice for around 3,000 years, until the development of the codex by the Romans in the first century A.D. (Zen College Life, 2011).

One can only imagine the consternation old school papyrus librarians and archivists faced with the invention of the codex. Should they change all of their holdings of clay tablets and papyrus rolls to codices? Should they leave this information in the old technologies and only store new information in the codex format? How many resources of time, money, and personnel would it take to migrate information from the old formats to the new? By 300 A.D., the codex was as popular as the papyrus scroll — and the first and current format used for the Christian Bible. These debates, and one can be sure there were discussions, were not purely academic. There were then, as now, practical reasons to be concerned with the transmission of historical, cultural, political, and literary information to succeeding generations. By the time Gutenberg invented the movable type press in the 15th Century, the codex had evolved into the book, and another information revolution occurred. Books became more prevalent, and no doubt librarians and archivists of Western Europe, Asia and the Middle East felt an information deluge of their own as they figured out how to organize, lend, copy, store, and find these books as libraries and archives grew and evolved from the middle ages to the 20th Century (Zen College Life, 2011).

The mid-20th Century brought the computer, and then networked computers that share and store information as bits and bytes. The formats these bits are stored in evolve every few years, as do the software to run the formats, and the hardware that runs the software. Format changes now occur every few years, and make the 3,000 year reign of clay tablets as the information transmission form of choice unimaginable. Yet, one is certain that current librarians and archivists are solving the same problems their counterparts faced 5,000 years ago. How do you select, preserve, maintain, collect and archive information in order to make it available to succeeding generations? This is the essence of curation, whether digital or physical. However, the focus of this paper is to discuss the curation and preservation of binary data; therefore, curation methods as applied to physical artifacts are out of the scope of this discussion.

Why Preserve and Curate Data?

There are many, many motivations for preserving data, regardless of the content. It would be challenging to cover every possible reason why some person or organization might want to curate and preserve their data. A few themes are common, though. In some instances, preservation is motivated by the human desire to preserve the current record (in a general sense) for future generations to access and use. Other motivations may be more base — to help a particular company or organization comply with legal requirements or provide a source of revenue. In some cases, cultural heritage concerns may overlap with financial incentives, such as with digital movies. For example, executives at movie companies have a huge financial incentive to ensure that their libraries are accessible in the future as formats change, so that they may sell and re-sell their titles for public consumption (Science and Technology Council, 2007). These films also represent the cultural heritage of humanity, whether the film in question is “Harold and Kumar Take Guantanamo Bay” or “Citizen Kane”. In other organizations such as the National Archives, federal legal requirements overlap with a professional desire and charge to preserve the United States’ materials “for the life of the republic” (Thibodeau, 2007). Individuals’ health records must be available for the life of the person. Most of us would like our photographs to be accessible by our descendants and relatives, and not lost in a hard drive or a hard drive crash. These are but a few examples of “what” and “why” data are deemed preservation-worthy.

Basic Definitions

Archive“, “digital archive”, “data“, “information“, “knowledge“, “wisdom“, “digital preservation“, “digital curation“, “reliable“, “authentic“, “integrity“, and “trustworthy“.

Tibbo (2003) writes that computer scientists tend to use “archive” simply as a term to describe the storage and backup of digital data in an offline electronic environment, while archivists see the process of archiving data as part of a complex process that encompasses the entire lifecycle of a digital object (Waters & Garrett, 1996; Higgens, 2007). One may also see the difference between “archive” in the computer science sense as simply storing data, whereas an “archive” per an archivist is an entire information system lifecycle that encompasses data, information, knowledge, and, perhaps, wisdom that will be made accessible for the indefinite long-term.

As well, practitioners who work with digital libraries and digital archives often use “digital library” to mean a “digital archive”, and vice versa. What then, is a digital archive?

Waters and Garrett (1996) defined

digital archives strictly in functional terms as repositories of digital information that are collectively responsible for ensuring, through the exercise of various migration strategies, the integrity and long-term accessibility of the nation’s social, economic, cultural and intellectual heritage instantiated in digital form. Digital archives are distinct from digital libraries in the sense that digital libraries are repositories that collect and provide access to digital information, but may or may not provide for the long-term storage and access of that information. Digital libraries thus may or may not be, in functional terms, digital archives and, in fact, much of the recent work on digital libraries is notably silent on the archival issues of ensuring long-term storage and access….Conversely, digital archives necessarily embrace digital library functions to the extent that they must select, obtain, store, and provide access to digital information. Many of the functional requirements for digital archives defined in this report thus overlap those for digital libraries.

The Society of American Archivists (1999) defines the core curation functions of any archive as appraisal, accession, arrangement, description, preservation, access and use. The basic archival principles remain the same whether an archive contains physical artifacts or data (Hedstrom, 1995). How an archivist applies these concepts may vary depending on the digital objects or physical artifacts to be preserved. Within the limitations of digital data, however, most applications of a data archive as of this writing use the Open Archival Information System (OAIS) (Consultative Committee for Space Data Systems, 2002) as a reference model. This model will be discussed briefly in a later section. However, the Consultative Committee for Space Data Systems (2002) notes that an “OAIS Archive” is distinguished from other uses of the term “archive” because it consists of an “organization of people and systems, that has accepted the responsibility to preserve information and make it available for a Designated Community”. The archive must meet the set of responsibilities outlined in the OAIS Reference Model to be considered an “OAIS archive” (Consultative Committee for Space Data Systems, 2002). Otherwise, it is merely an “archive”.

Data are “any information in digital form” (Higgens, 2007) that “correspond to the bits (zeroes and ones) that comprise a digital entity” (Moore, 2002). Data include both simple and complex objects, as well as structured collections. A simple object may be a text file or image; a complex file may comprise an entire web site; and a database is an example of a structured collection (Higgens, 2007). Furthermore, Galloway (2004) notes that to be digital the objects must “require a computer to support their existence and display”.

Moore (2002) writes from a computer science perspective that information “corresponds to any tags associated with bits”, while Buckand (1991) defines information via the lens of Information Science. He describes “information-as-process”, “information-as-knowledge”, and, “information-as-thing”. According to Buckland, “information-as-process” is the act of informing, while “information-as-knowledge” is the actual knowledge communicated during “information-as-process”. He defines “information-as-thing” by objects such as text and data, for example, because they impart and communicate knowledge; and notes that knowledge may be contained in text, etc. that describes these information objects. Ackoff (1989) takes a management science approach and posits that information is contained in answers to questions posed with “who”, “what” “where”, and “when”.

Knowledge “corresponds to any relationship that is defined between information attributes” (Moore, 2002); it is the application of data and information. Knowledge refines information and makes “possible the transformation of information into instructions” by answering the “how” questions (Ackoff, 1989). Wisdom is at the pinnacle of Ackoff’s hierarchy as an ideal state that evaluates the long-term consequences of an act. One might argue that repositories with audit mechanisms to ensure “authenticity” and “trust” apply wisdom in the form of policies to curate data, information, and knowledge “as things”.

The phrases “digital curation” and “digital preservation” are often used interchangeably, but they have slightly different meanings. The term “digital preservation” refers to a “series of managed activities necessary to ensure continued access to digital materials for as long as necessary” (Digital Preservation, 2009). Members of the Digital Preservation Coalition made this definition deliberately broad in order to refer “to all of the actions required to maintain access to digital materials beyond the limits of media failure or technological change” (Digital Preservation, 2009). As part of that definition, these members framed digital preservation as short-term, medium-term, and long-term. They defined “short-term” as access to the materials for the foreseeable future or for a defined period of time; “medium-term” as providing access for the near-term but not indefinitely; and, “long-term” as providing continued access to the materials for the indefinite future (e.g., as long as possible). Hedstrom (1995) writes that

preservation of an electronic record entails retaining its content; maintaining the ability to reproduce its structure; and providing linkages between an archival document and related records, its creator and recipient, the function or activity that it derived from, and its place in a larger body of documentary evidence.

Researchers and practitioners at the Digital Curation Centre (DCC) have defined digital curation as involving “maintaining, preserving and adding value to digital research data throughout its lifecycle” (Digital Curation Centre, 2010). An archivist, librarian or other data manager begins curation at the time the collection is assembled or acquired. He or she actively manages the collection in order to “mitigate the risk of digital obsolescence” and “to reduce threats to [the data’s] long-term research value” (Digital Curation, 2010). According to DCC researchers and practitioners, curation serves two other primary purposes that include providing a means to share data and reducing duplication of effort in data creation.

Higgens (2007) conceptualized an ideal model of digital curation as a lifecycle with three primary areas: full lifecycle actions, sequential actions, and occasional actions. These actions may be applied across the entire digital lifecycle or sequentially through it (Higgens, 2007). She defines “full lifecycle actions” as encompassing preservation planning; description and representation information; and, curation and preservation. Higgens models sequential actions as: conceptualization; creation or reception; access, use, and re-use; appraisal and selection; ingestion; storage; preservation action; and, transformation.

 Figure 1 - the Digital Curation Centre Curation Lifecycle Model (Higgens, 2007).
Figure 1 – the Digital Curation Centre Curation Lifecycle Model (Higgens, 2007).

The significance of the model is that provides a visual tool and summary from which a repository manager may plan the curation tasks appropriate for the collection and the repository at any stage in the curation lifecycle.
Duranti (1995) defined the terms “reliability” and “authenticity” based on diplomatic concepts. A record is “reliable” when the degree of completeness of the form and the degree of control of the procedure of creation meet the requirements of the socio-juridical system in which it is created. A reliable record is a “fact in itself, that is, as the entity of which it is evidence” (Duranti, 1995). If a document is what it claims to be, then the document is considered authentic. However, just because a document is authentic does not make it reliable. If a record is authentic, then it “does not result from any manipulation, substitution, or falsification occurring after the completion of its procedure of creation” (Duranti, 1995). Reliability takes precedence over authenticity.

The way to guarantee both reliability and authenticity is to have a standard for record completeness along with a controlled procedure for creation as well as a procedure to control the transmission and storage of the records. For example, a birth certificate will be considered reliable and authentic if all fields required by law have entries, the person providing the information has the authority to do so (i.e., is the attending physician or midwife) from a knowledgeable source (i.e., one or both parents as well as their own attendance at the birth), the authorized person enters the information provided correctly, the parents provide the correct information to begin with, and the birth certificate is stored in a government repository with access controls to the repository records. If a parent or physician provides false information on the birth certificate and the government stores it, then subsequent copies obtained of the birth certificate may be authentic, but they will not be reliable.

In order to provide reliable, authentic records in a digital environment, the keepers of the data objects must be able to maintain the objects’ integrity and provide evidence that that repository itself is trustworthy. The primary evidence of an objects’ integrity relate to its content, fixity, reference, provenance, and context (Waters & Garrett, 1996). Integrity builds upon, and to some degree, is concerned with authenticity, but it is not security (Lynch, 1994). Some examples of integrity violations include bit flipping, data corruption, disk errors, and malicious intrusions (Sivathanu, Wright, and Zadok, 2005).

At a minimum, for a repository to be trustworthy, it must begin with “‘a mission to provide reliable, long-term access to managed digital resources to its Designated Community, now and into the future'” (Consultative Committee for Space Data Systems, 2011). Both Waters & Garrett (1996) and the Consultative Committee for Space Data Systems (2011) prefer that repository managers conduct transparent audits of the system itself in order to assure “trustworthiness” to both internal and external stakeholders.

Motivating Factors for the Development of Digital Curation and Digital Preservation Standards

The movement to set standards for preservation and curation developed to provide order to chaos, and provide the information necessary so that individuals and organizations may make informed decisions about which data objects are reliable and authentic, and which repositories are trustworthy and mindful of data object integrity. That is, practitioners need to be able to determine if the people running a repository are actually doing so in a way that will preserve the objects for the specified time required in such a way that those objects can be found. More importantly, practitioners and users also must be certain that the objects preserved are both authentic and reliable. One way to ensure the reliability, authenticity, integrity, and trustworthiness of data objects and the repositories that house them are for the stakeholders to come together and agree on the procedures and definitions for those, and in the process, create standards for digital curation and digital preservation.

Previously, different industries worked within their domain to develop standards for preservation and curation. Book publishers worked within book publishing; filmmakers within filmmaking, and so on and so forth (Science and Technology Council, 2007). The mass use of digital data has created the need for broad standards that cross all industries. This is not a situation where knowledge about how to preserve one kind of format tends to exclude knowledge of how to preserve another kind of format, e.g., paper vs. film. A digital file is a digital file, whether it resides in a repository at the Library of Congress or in a graphic designer’s personal laptop. All industries are facing similar problems; a short list of these problems include format obsolescence, physical media changes, hardware and software migrations, personnel costs, as well as the costs of storing all of this data for perpetuity and making it accessible.

The latter — cost — ranks among one of the highest concerns. For example, the cost of storing a 4k digital master of a movie is 1,100 times higher than storing the same master as film (Science and Technology Council, 2007). A collection may be deemed worthy of saving into perpetuity by a consensus of experts, but without any resources to make that happen, the most one can hope for is that the machine the data is stored on will be turned off and put in a temperature-controlled closet until and if “someone” finds it and migrates the data to a new resource. (This preservation method assumes the data can be migrated and that there has not been any physical deterioration of the machine or disks, etc., during the time it was in storage.)

What is the best way to reduce long-term preservation costs? According to the members of the Science and Technology Council of the American Academy of Motion Picture Arts and Sciences (2007), the best way to reduce costs is to collaborate within and across industries and domains to develop and use standards, leveraging organizations such as the National Digital Information Infrastructure & Preservation Program (NDIIPP) for this purpose. The word “standard” includes, but is not limited to, file formats, filenames, metadata, metadata registries, distribution and archiving. Gallloway (2004) also concluded that the costs of preserving digital materials are exacerbated by the proliferation of proprietary formats, and that the format problem must be solved in order to limit cost.

Persistence

As stated earlier, digital curation and preservation standards grew out of established practices for the preservation of the human record, whether the purpose is research, legal requirements, cultural heritage, etc. One idea behind the development of standards, best practices, reference models, audit criteria, and a lifecycle model, etc., is to create a body of knowledge such that any person charged with preserving and curating a digital collection may readily find the information needed to accomplish their task.

Waters & Garrett (1996) were part of the Task Force on Archiving of Digital Information that examined the “state of the state” of digital preservation in the mid-1990s. Many of the task force’s recommendations contributed to the development of the final versions of the OAIS (Consultative Committee for Space Data Systems, 2002) and the standards for the Audit and Certification of Trustworthy Digital Repositories (Consultative Committee for Space Data Systems, 2011). Other recommendations from the 1996 task force include: creators, providers and owners of digital information are responsible for the preservation of the information; deep digital infrastructure must be developed to support a distributed preservation system; and, trustworthy, certified archives must be prepared and able to aggressively rescue data from repositories that are failing (Waters & Garrett, 1996).

While many large datasets have been preserved for decades without any formal standards for preservation and curation, it helps to have best practices with which to build a preservation program. For example, the Inter-University Consortium for Political and Social Research (ICPSR) has been migrating data since at least the early 1960s with few formal preservation criteria or curation standards to reference (Galloway, 2004). ICPSR personnel, partners, and users were committed to the longevity of the data, so it has been migrated repeatedly. Over the past few years, ICPSR has formalized their repository design to comply fully with the OAIS reference model, for example, because data managers believe this will further ensure the long-term availability of the social science data in the repository and lead to a “federated system of social science repositories” (Vardigan & Whiteman, 2007).

This year, Paul Ginsparg, physicists, mathematicians, computer scientists, and other scientists celebrated the 20th anniversary of arXiv, a pre-print archive (Ginsparg, 2011). Ginsparg began arXiv as an electronic bulletin board to continue physicists’ tradition of sharing research via mail and email. The bulletin board grew into a digital repository, and has survived a variety of funding sources, media, hardware, and software changes. The creators of arXiv and affiliated researchers have used it as a test bed from which to create a variety of standards that have aided in repository architecture design and interoperability such as the Dienst Protocol and the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) (Davis & Lagoze, 2000; Lagoze & Van de Sompel, 2001). Thus, practitioners had not yet created preservation and curation standards for repositories at the time of arXiv’s birth, yet it has survived for 20 years because the community that uses it wants to keep using it.

Although there are many technical problems associated with digital preservation that have yet to be solved, including the rapid obsolescence of software and hardware due to technology cycles (Thibodeau, 2002; Rothenberg, 1999), the primary problems associated with digital preservation and the curation of data are not technical, they are societal. Galloway (2004) notes that whether or not data are preserved has more to do with whether or not a given community chooses to preserve its own record; intellectual and social capital are the issue. Although we are in the midst of a data deluge that is not going to grow smaller any time soon, if ever, there are adequate systems and designs to support it. There must be an institutional commitment to support the preservation of a particular set of data, and this commitment must include an expenditure of resources, not just of will or desire for digital preservation (Consultative Committee for Space Data Systems, 2002). Galloway (2004) lists organizations that have consistently migrated data due to institutional will and personnel commitment, and these include science (for data sets), data warehouses, publishers, including authors (text files), and government agencies (e.g., the National Archives, the Library of Congress, and other federal and state agencies).

Plenty of data has been lost over the years, as well, by those same organizations. Rothenberg (1999) listed several cases of possible loss by U.S. government agencies; one of the more famous examples is the census data for 1960 (although Waters and Garrett (1996) note that the data loss was not as extensive as some think). He points out that computer scientists are notorious for accepting data loss as part of the price one pays to move to the next generation of hardware and software. He also writes that in 1990, a Congressional report “cited a number of cases of significant digital records that has already been lost or were in serious jeopardy of being lost”. To put this in perspective on a smaller scale, Nelson (2000) wrote that in a typical project at NASA c. 2000, the published research paper went to a library, the software to an FTP site, raw data was thrown away, and images were stored in a filing cabinet.

 Figure 2 - Scientific and Technical Information Lost Over Time (Nelson, 2000).
Figure 2 – Scientific and Technical Information Lost Over Time (Nelson, 2000).

In theory, all digital data could be preserved, but then the question becomes, “Should it be preserved?” If not, how do you cull that much data? Maybe it is better to keep it all? It takes personnel time to cull data, but it costs to store data that should otherwise be deleted, too.

The idea of how permanent or impermanent an archive’s collections should be is not a new one. O’Toole (1989) wrote that archivists had evolved their attitudes towards the “permanence” of artifacts and had begun to view permanence as an “unrealistic and unattainable” ideal. This is echoed in the digital realm. The members of the InterPARES (2001) project determined that it is acceptable to preserve a version of a record, so long as the integrity of the information is maintained. In other words, if a file format must be migrated from one form to another (.doc to .pdf, for example) in order to preserve it, an archivist does not have to preserve the original bits for the information itself to be considered authentic. Thibodeau (2002) also noted that it is more important to preserve the essential characteristics of an object — its look, feel, and content, for example — than it is to preserve the digital encoding of the object per se. All preservationists do not share this view, however. As late as 1999, Rothenberg (1999) expected documents to be preserved in their original bit form.

The members of the Science and Technology Council (2007) reached a conclusion similar to Thibodeau and the InterPARES members regarding film masters versus digital movie masters. The practice for the past 100 years or so has been to “save everything” when archiving a film. Thus, a director may go back 20 or 30 years later and create a new version of a movie, or film buffs with access to the film archive may study other aspects of the movie itself. The council members concluded that “save everything” is not feasible with digital movies, both due to the number and size of a digital movie, plus the cost of storing that much data over time. The digital movies will have to be migrated from the original file format, software, and hardware, to be compatible with new file formats, software, and hardware. This new version will supersede the old version of a movie, thus changing the idea of what is the actual canonical copy of a film. Therefore, the idea that the objects in a digital collection are ephemeral — both in terms of whether or not data will be kept in the first place, and that the canonical digital version itself will evolve over time — is an idea that has gained ground as digital curation and preservation have developed over the past decades

However, in spite of the idea that data are ephemeral either in terms of their lifespan or bits and bytes, another notion developed: that of “persistence”. An archivist or computer scientist may not want to keep an object long, or he or she may wish to migrate the format, but he or she wants to be able to find that data and do what is needed to the object, whether that means deleting it, migrating it, or some other task.

One of the first tasks upon ingesting an object into a repository is to assign it a unique identifier that is not shared by any other object in the archive, and, preferably, by any object in any archive. A full discussion of unique identifiers is beyond the scope of this paper, much less a discussion of the pros and cons of the various identifiers available to use with data. Some unique identifiers are one-of-a-kind to the archive or archive owner only. Some are part of a larger standard, such as Digital Object Identifiers (DOI), which are persistent names linked to redirection (Paskin, 2003). Some identifiers work only with URIs and can only be used via the World Wide Web (WWW), such as ARK (Archival Resources Key) (Kunze, 2003).

Most identifiers used with digital data may be used as URLs/URNs (Universal Resource Locator/Universal Resource Name). These are web-based, and run over the Internet. URLs are equivalent to a person’s address (e.g., http://sils.unc.edu/), and URNs are the equivalent of a person’s name, but the latter may be combined with existing non-Web identifiers to create a one-off, web-based identifier such as, “urn:isbn:n-nn-nnnnnn-n” (URI Planning Interest Group, 2001). Once a unique identifier is assigned, it is considered a best practice never to change that identifier, resource name, or resource locator (Berners-Lee, 1998). If it is necessary to do so for administrative or policy reasons, then within the system itself a “redirect” should be in place, so that the old location identifier points the system or user to the new location of the data.

As part of establishing persistent identifiers and locators for networked-based identifiers, researchers began to identify the features of a persistent (digital) archive, a persistent collection, and a persistent object. Moore & Merzky (2002) developed concepts for a persistent archive. They combined the functionality of a data grid with traditional archival processes (e.g., appraisal, accession, arrangement, and description) to create a matrix of core capabilities and functions.

 Figure 3 - Core data grid capabilities and functions for implementing a persistent archive (Moore & Merzky, 2002).
Figure 3 – Core data grid capabilities and functions for implementing a persistent archive (Moore & Merzky, 2002).

The authors proposed that this set of core capabilities would minimize the human labor involved in “implementing, managing, and evolving a persistent archive”. More importantly, they noted that these capabilities already exist in (then) current implementations of data grids.

Moore (2005) evolved these ideas to include the concept of a “persistent collection”. He defines a persistent collection as a “combination of digital libraries for the publication of digital entities, data grids for the sharing of digital entities, and persistent archives for the preservation of digital entities”. Moore concluded that while persistent collections are built on top of data grids, and data grids have been used successfully for data sharing, publication, and preservation, in order to use data grids for persistent collections, additional capabilities “to simplify the integration of new services and support the federation of independent data grid federations” must be added.
Brody (2000) and Carr (1999) “mined” the life of an ePrint archive and discovered that authors still made corrections to the papers and metadata after the respective author or authors had submitted them to the University of Southampton ePrint archive. (Neither Brody nor Carr provided an average end date as to when authors stopped committing changes either to the paper or the metadata.) Thus, even Thibodeau’s “essential characteristics” are subject to change, although the repository’s owners could change this characteristic be creating a policy that allowed or prohibited changes post-publication in the repository.

Another aspect of object persistence is whether or not the Web site that contains the object or data currently exists (as opposed to available but not accessible). Koehler (1999) examined the persistence of Web pages, Web sites, and server-level domains beginning in 1996. He reported that after 6 months, 20.5% of Web pages and 12.2% of Web sites monitored for the study failed to respond. After 12 months, those figures changed to 31.8% and 17.7% respectively. He inferred from this that the half-life of a Web page is about 1.6 years, and a Web site, 2.9 years. Koehler determined 3 kinds of Web persistence: permanence (it is not going anywhere); intermittent (sometimes it is there, sometimes it is not); and, disappearance (it is gone forever). He discovered that 99% of Web sites had changed after 12 months. Koehler (1999) concluded that if the World Wide Web is the equivalent of H.G. Wells’ (1938) “world brain”, then two things may be said of it: the world brain has a short memory, and when it does remember, it changes its mind a lot — how much and where depends.

Koehler (2004) revisited his study five years later. He reports that static collections — similar to the ePrints archive mentioned earlier in this paper — tend to stabilize after they have “aged”. As part of this paper, he reviewed the growing body of literature related to persistence — also referred to as “linkrot” — and found that the stability of collection-oriented Web sites (e.g., legal, academic, citation-based) varies based on the domain specialty. Nelson and Allen (2002) examined 1,000 digital objects in a variety of digital libraries over the course of a year. They discovered that 3% of all objects were no longer available after 12 months, but the resource half-life is about 2.5 years. Koehler writes that for other resource types, such as scholarly article citations, legal citations, biological science education resources, computer science citations, and random Web pages, the half-life of the resources ranges between 1.4 years to 4.6 years. While some of the URLs in both of Koehler’s studies stabilized for two years after losing two-thirds of the URLs in the first 4 years of the study, his overall conclusion was that the Web provides no guarantee of longevity for data, collections, or repositories.

McCown, Chan, Nelson, & Bollen (2005) revisited the Nelson and Allen (2002) study of D-Lib Magazine Web persistence and expanded upon it by examining outlinks — the URLs cited in D-Lib Magazine articles. They extracted 4387 unique URLS referenced during July 1995 to August 2004 in 453 articles. They discovered that approximately 30% of URLs failed to resolve, although only 16% of the content registered indicated more than a 1 KB change during this same testing period. The researchers concluded that the half-life of a URL referenced in a D-Lib Magazine article is around 10 years. To state the obvious, even scholarly articles referenced in a respected journal in the Information and Library Science field — where linkrot is a known problem — cannot maintain stable references.

These studies above represent but a small proportion of the literature documenting the ephemeral nature of data (“digital objects”), Web sites (“archives”), and Web pages. By the late 1990s to early 2000s, it had became apparent in all fields that in order to rely on digital resources, some objects need to be static, the repository that contains the objects needs to remain accessible, there needs to be audit mechanisms to prove that the objects in the repository are what they say they are, and that the repository is capable of persisting over time even as the content is migrated to newer software and hardware. In other words, “someone” needed to develop a standard model for archiving objects for some period, either short- or long-term. As well, “someone” needed to create audit mechanisms to determine that a repository is “trustworthy” and that the repository’s contents are “authentic” and “reliable” and have maintained their “integrity”. “Someone” had been doing just that: the CCSDS finalized the “Reference Model for an Open Archival System” (OAIS) as a standard in 2002. The CCSDS released the “Audit and Certification of Trustworthy Digital Repositories” as a Recommended Practice (Magenta Book) in September 2011.

Overview of the OAIS Reference Model and the Audit and Certification of Trustworthy Digital Repositories Recommended Practice

The Consultative Committee for Space Data Systems (CCSDS) convened an international workshop in 1995 with the purpose of advancing a proposal “to develop a reference model for an open archival information system” (Lavoie, 2004). The CCSDS had determined previously that there was no widely accepted model or framework that could serve as a standard for the long-term storage of space mission digital data. The members of the CCSDS recognized that fundamental questions related to digital preservation cut across all domains; therefore, the development scope of the model included stakeholders from a variety of domains, including government, private industry, and academia (Lee, 2010). The committee determined that the purpose of creating a reference model was to “address fundamental questions regarding the long-term preservation of digital material” (Lavoie, 2004). This model would define an archival system and outline the essential conditions a repository owner must meet in order to be considered a preservation archive.

The CCSDS defines an OAIS as “an archive, consisting of an organization of people and systems, that has accepted the responsibility to preserve information and make it available for a Designated Community. It meets a set of such responsibilities as defined in this document, and this allows an OAIS archive to be distinguished from other uses of the term ‘archive'” (CCSDS, 2002). The word, “Open” is used to note that the CCSDS (2002) developed the recommendation in open forums, and will continue to do so for any future iterations of the model. The use of the word, “Open” does not imply that access to the repository itself must be unrestricted, in order to meet the requirements of the OAIS model (Lee, 2010).

The committee described four categories of archives: independent, cooperating, federated, and shared resources. The owners of an independent archive do not interact with any other archive owners with regards to technical or management issues. The possessor of a cooperating archive does not have a “common” finding aid with other archive possessors, but otherwise shares common producers, submission standards, and dissemination standards. The owners of a federated archive serve both a global and local Designated Community with interests in these related archives, and these owners provide access to their holdings to the Designated Community via one or more shared finding aids. The holders of archives with shared resources have agreed to “share resources” with each other, generally to reduce cost. This type of arrangement requires the use of standards internal to the archive, such as for ingest and access, that do not “alter the user community’s view of the archive” (CCSDS, 2002; Lee, 2010).

The CCSDS divided the reference model into two “sub-models” – a Functional Model and an Information Model. Simply put, the Functional Model defines what an archive must do, and the Information Model defines what a repository must have in its collections (Lee, 2010). The former describes seven main functional entities, and however they interface with each other. These interfaces are: Common Services, Preservation Planning, Data Management, Ingest, Administration, Access, and Archival Services.

 Figure 4 - the OAIS Reference Model "Functional Model" (CCSDS, 2002).
Figure 4 – the OAIS Reference Model “Functional Model” (CCSDS, 2002).

The Information Model describes and defines the information beyond the content. The members of the CCSDS included this section because the long-term preservation of digital material will require more than simply the content itself. A few examples of the information described and defined within the Information Model include: representation, fixity, provenance, content, and preservation description.

In summary, if a repository is an OAIS-type archive, then the archive managers will implement each area of the Functional Model in order to preserve information as an information package via the Information Model, for a Designated Community (Lavoie, 2004). The CCSDS designed the OAIS to be a reference model – it is NOT an implementation. The committee members deliberately left it up to an archive’s owners to determine the technical details of the archival system. Egger (2006) writes that this is a disappointing aspect of the reference model, because it mixes technical and management functionality, rather than keeping them separate per standard software engineering practices. Vardigan and Whiteman (2007) successfully applied the OAIS reference model to the Inter-university Consortium for Political and Social Research (ICPSR) social science repository. The managers of the Online Computer Library Center (OCLC) Digital Archive based their service on the OAIS reference model while drawing data and metadata from a “wide array of OCLC organizational units” (Lavoie, 2004).

Another application of the conceptual work of the CCSDS (2002) with the OAIS reference model, and Waters and Garrett’s (1996) work with the Task Force on Archiving of Digital Information, is the CCSDS’ development of a “recommended practice” for the “audit and certification of trustworthy digital repositories” (CCSDS, 2011). This work is also based on the development of the requirement for a repository to be “reliable”, “authentic”, have “integrity”, and, be “trustworthy”, as defined in a previous section.

Lavoie (2004) writes that OCLC and the Research Libraries Group (RLG) sponsored an initiative in March 2000 to address the “attributes of a trusted digital repository”. The working group’s charge was “to reach consensus on the characteristics and responsibilities of trusted digital repositories for large-scale, heterogeneous collections held by cultural organizations” (Research Libraries Group, 2002). The purpose of determining these characteristics is to ensure that an OAIS Designated Community will be able to audit a repository and determine whether or not the repository owners have designed it, and are managing it, in such a way that the repository will actually preserve the Designated Community’s data for the indefinite long-term and make it accessible. The RLG/OCLC working group issued their report in 2002. Among the recommendations, the working group specified that a process needed to be developed to certify a digital repository (Research Libraries Group, 2002). Waters and Garrett (1996) had also made this recommendation.

What is a “trusted digital repository”? The working group of the Research Libraries Group (2002) defined it as a repository with “a mission to provide reliable, long-term access to managed digital resources to its designated community, now and into the future”. The NESTOR Working Group on Trusted Repository — Certification (2006) determined that the entire system must be looked at in order to determine whether or not a Designated Community should trust that it will last for the indefinite long-term. This includes its governance; procedures and policies; financial sustainability and fitness; organizational management, including employees; legal liabilities, contracts and licenses under which it operates; plus the trustworthiness of any organization or person who might inherit the data (NESTOR Working Group on Trusted Repository — Certification, 2006; Online Computer Library Center, Inc. & Center for Research Libraries, 2007).

A repository manager must also assess internal and external risks to the repository. Among many of the potential risks to a repository’s long-term availability, Rosenthal, et al (2005) include internal and external attacks; natural disasters; hardware, software and media obsolescence; hardware, software, media, network services, organizational, and economic failure; as well as simple communication errors. Regular audits and re-certification — e.g., transparency — are the keys to the long-term survivability of a repository (Online Computer Library Center, Inc. & Center for Research Libraries, 2007).

Researchers and practitioners then set about developing the criteria and checklists for audit and certification. OCLC and the Center for Research Libraries (CRL) developed the “Trustworthy Repositories Audit & Certification: Criteria and Checklist” (2007). The creators called this document “TRAC”, and provided a spreadsheet for practitioners to use that covered the requirements for “organizational infrastructure”, “digital object management”, and “technologies, technical infrastructure, & security”. The researchers with nestor (Network of Expertise in long-term STORage) also created guidelines around these three areas (Dobratz, Schoger, & Strathmann, 2006).

TRAC covered the following policy areas for audit and certification: governance & organizational viability; organizational structure & staffing; procedural accountability & policy framework; financial sustainability; contracts, licenses, & liabilities; ingest, including the creation of the archival package and acquisition of content; preservation planning; archival storage & the preservation and maintenance of AIPs; information and access management; system infrastructure; appropriate technologies; and, security (Online Computer Library Center, Inc. & Center for Research Libraries, 2007).

Ross and McHugh (2006) applied TRAC to examine mechanisms to provide audit and certification services for United Kingdom digital repositories. As part of this work, the researchers developed a toolkit, “Digital Repository Audit Method Based on Risk Assessment” (DRAMBORA). This toolkit is available online so that practitioners may “facilitate internal audit by providing repository administrators with a means to assess their capabilities, identify their weaknesses, and recognize their strengths” (Digital Curation Centre & Digital Preservation Europe, 2007). It is a self-audit that follows the workflow and criteria an external auditor would apply, so that a repository may self-assess prior to going through an external audit and certification. The toolkit provides a methodology by which a digital archivist might assess any risks to the repository she or he manages. While TRAC, DRAMBORA, and nestor are very similar, DRAMBORA provides a “documented understanding of the risks…expressed in terms of probability and impact” and provides “quantifiable insight into the severity of risks faced by repositories” along with a means to document those risks (Digital Curation Centre, 2011). In other words, TRAC is a more informal audit process that provides qualitative output, while DRAMBORA is a more detailed, formal, audit method that provides quantifiable results. The policies and risks covered by the DRAMBORA risk assessment are similar to the ones stated above for nestor and TRAC. The difference, to reiterate, is that the DRAMBORA method provides quantifiable output.

The next logical step in the development of an overall standard for the audit and certification of repositories was to merge the concepts and ideas behind TRAC, nestor, and DRAMBORA. Thus, representatives of The Digital Curation Centre (U.K.), DigitalPreservationEurope, NESTOR (Germany), and the Center for Research Libraries (North America) convened at that Chicago, IL offices of the Center for Research Libraries “to seek consensus on core criteria for digital preservation repositories, to guide further international efforts on auditing and certifying repositories” (Center for Research Libraries, 2007). Dale (2007) compared and contrasted the different methods, and created a matrix that displayed similarities and differences. Based on this matrix, and internal discussions, the attendees identified 10 core characteristics of a preservation repository:

  • The repository commits to continuing maintenance of digital objects for identified community/communities.
  • Demonstrates organizational fitness (including financial, staffing, and processes) to fulfill its commitment.
  • Acquires and maintains requisite contractual and legal rights and fulfills responsibilities.
  • Has an effective and efficient policy framework.
  • Acquires and ingests digital objects based upon stated criteria that correspond to its commitments and capabilities.
  • Maintains/ensures the integrity, authenticity and usability of digital objects it holds over time.
  • Creates and maintains requisite metadata about actions taken on digital objects during preservation as well as about the relevant production, access support, and usage process contexts before preservation.
  • Fulfills requisite dissemination requirements.
  • Has a strategic program for preservation planning and action.
  • Has technical infrastructure adequate to continuing maintenance and security of its digital objects (Center for Research Libraries, 2007).

A key idea to come out of this meeting is that the preservation activities must scale to the “needs and means of the defined community or set of communities” (Center for Research Libraries, 2007). In other words, some repositories may need to implement more preservation activities, and some may need to implement less.

The Consultative Committee for Space Data Systems released the “Magenta Book” version of the “Audit and Certification of Trustworthy Digital Repositories Recommended Practice” in September 2011. This recommendation is the culmination of years of best practice work by researchers and practitioners. This best practice work began with the development of TRAC, DRAMBORA, and nestor, among other projects. The CCSDS began the process to use these methods for audit and certification to create an ISO standard, based primarily on TRAC. The precursor to this ISO standard is the “Recommended Practice” for the “Audit and Certification of Trustworthy Digital Repositories” that is currently in release as the “Magenta Book”.

A “Recommended Practice” is not binding to any agency. The purpose of a “Recommended Practice” is to “provide general guidance about how to approach a particular problem associated with space mission support” and to provide a basis on which a community that has a stake in a digital repository may assess the trustworthiness of the repository (Consultative Committee for Space Data Systems, 2011). The CCSDS’ recommendations are aimed at any and all digital repositories. Another way to think of the purpose and scope of the “Recommended Practice” is that it establishes a method for a Designated Community to determine whether or not a repository of interest is actually OAIS-compliant. The following is a summary of this Recommended Practice.

The Recommended Practice covers audit and certification criteria, including defining a “trustworthy digital repository”, an evidence metric (e.g., “examples”) in support of a particular requirement, and related relevant standards, best practices, and controls. The policies required to be trustworthy fall under three primary categories: “Organizational Infrastructure”, “Digital Object Management”, and “Infrastructure and Security Risk Management”. The authors designed the document so that each of those sections follows a similar design.

First, the policy is stated. Second, the “Supporting Text” is presented; this is the “so what?” section. Third, the document provides “Examples of the Ways the Repository Can Demonstrate It Is Meeting This Requirement”. Finally, the authors provide a “Discussion” section that explains the previous three sections in order to remove any possible ambiguity.
So, for example, in section “3 Organizational Infrastructure”, “3.1 Governance and Organizational Viability”, section 3.1.1 states:

3.1.1 The repository shall have a mission statement that reflects a commitment to the preservation of, long-term retention of, management of, and access to digital information.

Supporting Text
This is necessary in order to ensure commitment to preservation, retention, management and access at the repository’s highest administrative level.

Examples of Ways the Repository Can Demonstrate It Is Meeting This Requirement
Mission statement or charter of the repository or its parent organization that specifically addresses or implicitly calls for the preservation of information and/or other resources under its purview; a legal, statutory, or government regulatory mandate applicable to the repository that specifically addresses or implicitly requires the preservation, retention, management and access to information and/or other resources under its purview.

Discussion
The repository’s or its parent organization’s mission statement should explicitly address preservation. If preservation is not among the primary purposes of an organization that houses a digital repository then preservation may not be essential to the organization’s mission. In some instances a repository pursues its preservation mission as an outgrowth of the larger goals of an organization in which it is housed, such as a university or a government agency, and its narrower mission may be formalized through policies explicitly adopted and approved by the larger organization. Government agencies and other organizations may have legal mandates that require they preserve materials, in which case these mandates can be substituted for mission statements, as they define the purpose of the organization. Mission statements should be kept up to date and continue to reflect the common goals and practices for preservation (CCSDS, 2011).

The policy areas covered by the Recommended Practice include: governance and organizational viability; organizational structure and staffing; procedural accountability and preservation policy framework; financial sustainability; contracts, licenses, and liabilities; ingest, including acquisition of content and creation of the AIP; preservation planning; AIP preservation; information management; access management; and, risk management, including technical infrastructure and security. These areas almost exactly replicate the original audit and certification criteria for the TRAC checklist, and they are also closely replicate the criteria used in nestor and DRAMBORA.

Applications of the Audit and Certification of Trustworthy Digital Repositories Recommended Practice and the OAIS Reference Model

A complete listing of all projects, repository designs, and organizations that have applied the OAIS reference model and some version of TRAC, DRAMBORA, or nestor is beyond the scope of this literature review. Instead, this section will discuss a few example applications for applying both TRAC and the OAIS Reference Model.

When Steinhart, Dietrich, and Green (2009) applied the TRAC checklist to a “data staging repository”, they made several observations and conclusions. First, the TRAC checklist is applicable “to the pilot phase of a staging repository”, which is a “transitory curation environment” (Steinhart, Dietrich, & Green, 2009). This meant that TRAC had practical applications beyond digital preservation audit and certification.

For example, the TRAC checklist may be used as an evaluation tool when repository owners want to purchase new repository software. The TRAC checklist may also be used as a standard from which to create machine-actionable rules, per Smith and Moore’s (2006) work on the PLEDGE project. By implementing TRAC policies at the machine-level, the amount of human effort required to enforce a policy is reduced because policy enforcement is built into the system itself (Moore & Smith, 2007).

Steinhart, Dietrich, and Green (2009) noted that there seemed to be two applications of TRAC: an audit of the system to satisfy auditors, or an audit of the system to satisfy users of the system (i.e., “the Community of Practice”). Implied in this observation is the idea that few audits seem to be conducted purely for an organization’s internal erudition. Regardless of the purpose for conducting the audit, however, TRAC has provided a method for repository owners to identify gaps in an organization’s workflows and policies, and provides the mechanisms (e.g., “knowledge”) for those owners to fill those gaps.

Another example of the application of TRAC to a repository is the audit of the MetaArchive repository. Contractor Matt Schultz conducted an audit of the MetaArchive Cooperative and made the results public. The author reported that the MetaArchive “conforms to all 84 criteria specified by TRAC” and “has undertaken 15 reviews and/or improvements to its documentation and operations as a result of its self-assessment findings” (Educopia Institute, 2010). The organization made the actual spreadsheet available that contained the results of the audit and certification of the MetaArchive.

A quick skim of titles containing the word, “TRAC” in journals such as D-Lib Magazine, JASIST, and other related journals indicate that TRAC has been used often as an assessment tool. What is missing, however, are papers with negative assessments of TRAC, or any negative results from applying TRAC. What is also missing is a formal assessment as to whether or not a top-down approach (e.g., formal established standards) is the most feasible, or, even, the only approach. Perhaps a bottom-up approach where someone analyzes what policies people are actually implementing versus what is recommended, would be a useful approach?

Perhaps the positive reviews of the application of TRAC, of which the two above are only a small portion, indicate that as a Recommended Practice it is, indeed, comprehensive and covers all required bases. Or the positive reviews of applying TRAC may reflect researchers’ and publishers’ biases towards not publishing negative results, otherwise known as the “file-drawer effect” (Fanelli, 2011). The likely answer to the lack of published critical reviews of TRAC and the Recommended Practice is that not enough time has gone by to evaluate whether or not following the recommended policies will make any difference in the longevity of the repository.

As stated previously, ICPSR employees have been migrating their social science archive forward since the 1960s, with no standards such as TRAC or the OAIS to follow (Vardigan & Whiteman, 2007). Other repositories disappeared or lost information. Would having international standards in place both for repository design and audit and certification policies really have prevented that kind of loss of information? It is hard to say, as even the authors of the OAIS Reference Model state that the long-term survival of a repository depends on the will and resources of the repository owners and the community of practice (CCSDS, 2002).

Archivists and librarians at Tufts and Yale applied the OAIS Reference Model to electronic records. Specifically, they created an ingest guide to aid in moving electronic records from a “recordkeeping system to a preservation system”. The practitioners designed the guide to describe the actions needed for a “trustworthy” ingest process. The authors used both the OAIS Reference Model and the “Producer-archive Interface Methodology Abstract Standard” (Consultative Committee for Space Data Systems, 2004) as the basis for the guide. According to the archivists and librarians who worked on the project, following the guide should allow “a reasonable person to presume that a record has maintained its level of authenticity during ingest” (Fedora and the Preservation of University Records Project, 2006).

The authors divided the ingest guide into two main sections: “negotiate submission agreement” and “transfer and validation”. The former section covers establishing a relationship with the collection owner, defining the project, assessing the records themselves, and finalizing the submission agreement. The latter section includes creating and transferring Submission Information Packages (SIPS), validation, transformation, metadata, formulating and assessing Archival Information Packages (AIPs), and formal accession. Each section contains an overview, an image of the flow of steps involved in that particular process, and a step-by-step written narrative for each step in the flow. The purpose of the document is not to provide “a detailed manual of procedures”, but to provide “a prescriptive guide for a trustworthy ingest process” (Fedora and the Preservation of University Records Project, 2006).

A different kind of application of both TRAC and the OAIS is to build or use a “trusted digital repository” to create “persistent collections” in a “persistent archive” (Moore, 2004). Some of these solutions are based on digital library systems such as DSPACE and FEDORA; other solutions include data grids such as the Storage Resource Broker (SRB) and the integrated Rule-Oriented Data System (iRODS) (Moore, 2005; Moore, Rajasekar, & Marciano, 2007). One unique aspect of iRODS is that preservation policies outlined in TRAC may be implemented at the machine level, in the code, via the use of rules. Rajasekar, et al (2006) call this “policy virtualization”.

For example, the following “human language example” regarding “Chain of Custody” from the Audit and Certification of Trustworthy Digital Repositories Recommended Practice (CCSDS, 2011):

5.1.2 The repository shall manage the number and location of copies of all digital objects.
This is necessary in order to assert that the repository is providing an authentic copy of a particular digital object.

may be written in machine language in iRODS v.3.0 as:

myTestRule {
#Input parameters are:
# Object identifier
# Buffer for results
#Output parameter is:
# Status
msiSplitPath(*Path, *Coll, *File);
msiExecStrCondQuery(“SELECT DATA_ID where COLL_NAME = ‘*Coll’ and DATA_NAME = ‘*File'”,*QOut);
foreach(*QOut) {
msiGetValByKey(*QOut,”DATA_ID”,*Objid);
msiGetAuditTrailInfoByObjectID(*Objid,*Buf,*Status);
writeBytesBuf(“stdout”,*Buf);
}
}
INPUT *Path=”/tempZone/home/rods/sub1/foo1″
OUTPUT ruleExecOut

This type of policy virtualization is the method by which the researchers who created iRODS implemented the OAIS Reference Model recommendations within the system architecture itself (Ward, de Torcy, Chua, & Crabtree, 2009).

Other Technical Issues

The other end of the digital curation spectrum from the OAIS Reference Model and the Audit and Certification of Trustworthy Repositories is bit level preservation. Moore (2002) wrote, “the challenge in digital archiving and preservation is not the management of the bits comprising the digital entities, but the maintenance of the infrastructure required to manipulate and display the image of reality that the digital entity represents”. Lynch (2000) also writes that infrastructure is key. However, since bit level preservation is followed by preservation of the media that contains the bits and bytes, which requires preservation of the software and hardware on which the media runs, which requires networked infrastructure, bit management must be addressed.

A bit is a “binary digit”. A “binary digit” is either a one or a zero in a binary system of notation (Binary digit, 2011). Chunks of bits make up a byte. Rothenberg (1999) writes that bytes may be any length, but 8 bytes provides considerably more freedom to create upper and lower case characters, punctuation, digits, control characters, and graphical elements. In very simple terms, to read a bit stream, the computer hardware must retrieve it from the media it is stored on (e.g., flash drive, CD, DVD, computer hard drive, etc.) and interpret it via software that is designed to render bits stored in that format (e.g., .pdf, .doc, .jpg., etc.).

If the bits become corrupted, then the content is unrenderable. If the media storage device deteriorates, the content is unrenderable. If the software and hardware are unavailable to read and render the file format, it is unrenderable. If the file format is unknown, then the content is unrenderable by any machine or available software. Thus, when a repository owner designs a preservation system to provide access to the content for the indefinite near-term, a decision must be made regarding migrating, refreshing, replicating, and emulating the file format, software, and hardware used to store and render the contents of a digital object.

Waters and Garrett (1996) describe migration as the transfer of data to a new operating system, programming code, or file format. The advantage of this method is it keeps the data current with technological changes. The disadvantage is that it is possible the rendering of the content may change, so that the representation is different in some way from the original (Rothenberg, 1999). In most instances, this is likely not to matter, but in some instances, it could be important. One way around this is to save the original files, migrate copies of those files to the new format/operating system/programming language, and then store the originals with the copies. The disadvantage to this, however, is that one must also save the hardware and software to read these files, which negates the advantages inherent in migration. Preservationists prefer migration to refreshing because it better retains the ability to retrieve, display and otherwise access the data (Research Libraries Group, 1996)

Archivists “refresh” data by copying data from old media onto new media, in an effort to stave off the effects media deterioration. However, this preservation method only works so long as the data and information are “encoded in a format that is independent of the particular hardware and software needed to use it and as long as there exists software to manipulate the format in current use” (Waters & Garrett, 1996). That is, the software and hardware used to read the information on the media must be backwards compatible and interoperable with different file formats, hardware, and software.

Rothenberg (1999) proposes emulation as the best solution to preservation. He defines emulation as a new system that replicates the functionality of a now-obsolete system, providing the user with the data, information, and functionality of the original system. Rothenberg writes that emulators may be built for hardware platforms, applications, and/or operating systems. However, emulation is expensive, as the cost of replicating the original system and actually being able to provide all of the functionality requires a great deal of resources, both human, financial, and time. Oltmans (2005) compared migration and emulation and concluded that emulation is more cost effective because it preserves the collection in its entirety when compared to migration. One could argue that preservationists would be better off simply maintaining the original system in the first place. However, few organizations or people have the resources to maintain that amount of hardware and software. Video game aficionados prefer to use emulators; otherwise, migration has been the method of choice for curators and preservationists of data.

Repository owners use replication as a way to back up data in multiple locations, preferable not in the same geographic or physical space. This prevents the accidental and permanent loss of data. If there is a fire, a flood, or a malicious act by some person to destroy the data, replication ensures that there are still copies of the data stored in a format such that a full restore is possible. Generally, repository managers create two replications of data. Often, this can be done in a shared format, so that one repository owner stores back up data for another organization, and vice versa. One challenge to replication is ensuring that all data stored in all locations are synced so that additions, deletions, updates, etc. are done so that the data in one location “matches” the data stored in the other two locations. The repository systems administrators much check the data on a regular basis for to ensure its continued integrity via tools such as fixity checks, access controls, and other data integrity techniques and mechanisms (Sivathanu, Wright, & Zadok, 2005). Software such as LOCKSS (“Lots of Copies Keep Stuff Safe”) and data grid “middleware” such as iRODS provide repository owners with proven technology to aid in the replication of their data (Moore & Merzky, 2003; Moore, 2004; Moore, 2006). Organizations such as Data-PASS (“The Data Preservation Alliance for the Social Sciences”) help their members replicate and preserve social science data by creating a common technical mechanism for data sharing/replication.

General Digital Repository Management

How does an archivist, librarian, or other technologist manage a preservation digital repository? The same way personnel manage a non-preservation digital repository (Lavoie & Dempsey, 2004). Material must be selected and ingested or digitized if it has not been born digital. Metadata must be created, or the quality of the metadata must be checked prior to ingest and possibly augmented if it does not meet the repository owner’s quality standards (Lavoie & Gartner, 2005; Shreeves, et al, 2005; Jackson, et al, 2008; Ward, 2004). The digitization and/or ingest project must be managed, and risks to the repository must be identified and solutions created (Lawrence, et al, 2000). Intellectual property and copyright to the data must be established and enforced internally and with the Community of Interest (National Initiative for a Networked Cultural Heritage, 2002). Lee, Tibbo, & Schaefer (2007) note that the manager of the repository also must hire trained personnel with the appropriate skill sets to create, manage, preserve, and curate the repository.

Funding

Last, but not least, the repository manager and the Community of Interest must ensure funding is available to maintain the repository over the indefinite long-term. And, should this funding fall short, the repository manager must ensure that there is a back up organization to take over the management of the repository, should the “owning” organization no longer exist (Waters & Garrett, 1996).

Both the members of AAMPAS (2007) and Waters & Garrett (1996) examined cost factors of preserving digital information over time. The AAMPAS members estimated the costs of digital video vs. film masters preservation, and Waters & Garrett examined digital book vs. paper book preservation and storage. Both groups reached the same conclusion: the curation and preservation of digital material is far more expensive than preserving and maintaining film or paper books over time. The AAMPAS committee determined that it would cost 1,100% more to store digital movie masters for 100 years than to store film masters for the same time period. Waters & Garrett’s (1996) cost model indicated that “storage costs…are 12 times higher for a digital archives composed of texts in image form, and the access costs are 50% higher” than for the same material as books. Chapman (2003) pondered the storage affordability question and concluded that the final costs are variable. He wrote that the true costs depend on the services provided around the repository, the type and amount of content, the choice of repository software, and the type of storage chosen (“dark archive”, publicly accessible, etc.).

Regardless, the final conclusion is that digital curation and preservation is not cheap. The members of the Blue Ribbon Task Force on Sustainable Digital Preservation and Access (2008) noted that “there is no general agreement” as to “who is responsible…and who should pay for the access to, and preservation of, valuable present and future digital information”.

Current Status and Future Challenges/Further Work

Librarians, archivists, computer scientists and other researchers are currently immersed in figuring out the “data deluge”. How big is this deluge? It is hard to estimate, but IDC estimates that the annual compound growth rate of data created and stored was almost 60% higher in 2008 than the 180 exabytes that existed in 2006 (Mearian, 2008). IDC further estimates that by 2011, “there will be 1,800 exabytes of electronic data in existence”. If those numbers are correct, Mearian (2008) writes that as of 2011 the number of bits stored exceeds the number of stars in the sky.

That is a lot of data.

Digital preservationists and domain scientists are now focusing their attention on access to research data, specifically, the preservation of research data sets from which research conclusions are drawn. The Committee on Ensuring the Utility and Integrity of Research Data in the Digital Age (2009), the National Science Foundation (2005), and other individuals and groups have drawn attention to the need to steward data for use and re-use by other researchers. As one part of this, members of these two organizations have recommended creating formal standards and strategies for data stewardship.

The editors of the journal Nature have participated in this effort, by drawing attention to the perils and advantages of data sharing (Butler, 2007; Butler, 2007; Nelson, 2009) and data neglect (Editor, 2009). The editors of Science have also followed suit, and examined data sharing and data restoration (Curry, 2011; Hanson, Sugden, & Alberts, 2011) Over the course of the past year, both the National Science Foundation and the National Institutes of Health have required grant applicants to provide data management plans as part of the application process. One can only wonder at how well researchers’ data management plans conform to established best practice recommendations for the preservation of data, such as the OAIS Reference Model and the Audit and Certification of Trustworthy Digital Repositories Recommended Practice.

The logic behind the interest in preserving, accessing, and sharing data sets is twofold: to ensure that science can be replicated (and the science cannot be replicated if the original data set is lost or unavailable); and to ensure that taxpayers receive the full benefits of their investment in research by allowing other researchers access to data generated with taxpayer money. If stakeholders wish to share data, then it must be stewarded when the data is gathered, on through the initial research, and includes storage of the data set(s) post-dissemination of any results. It must also be stored for the indefinite long-term, should a future researcher wish to access the data set(s).

Practitioners’ initial research into this area indicates that some kind of institutional support in the form of data centers where researchers may store and share their data may be required in some instances (Beagrie, Beagrie, & Rowlands, 2009; Research Information Network, 2011). Skinner & Walters (2011) advocate a new role for librarians and archivists — that of data curator. Their recommendation is that academic and research librarians should provide curatorial guidance with regards to digital content. They write that librarians and archivists should go to the researchers, rather than wait for the researchers to come to them for advice. Most academic and research libraries and archives do offer research data management advice, including a “data curation toolkit”, to aid in interviewing the researcher about their data curation requirements (Witt, Carlson, & Brandt, 2009).

Conclusion

The problem of preserving data, information, knowledge, and wisdom is not a new problem. Whether it is clay tablets, papyrus, books, data or some other format, the people who are interested in preserving the cultural, research, and other heritage of our world on earth have faced challenges of one sort or another. Some data has been preserved for centuries, and others, unnecessarily lost. War, weather, politics, fire, and other factors have destroyed valuable information objects in all centuries. The value of the data to one or more individuals is a major factor that leads to its curation and long-term survivability. The ability of the owner and users of it to fund its preservation is equally important.

Librarians, archivists, and computer scientists’ establishment of standards for digital preservation and curation aid in the survivability of this data, but do not “cause” it. What has changed over time is the type of data preserved and the method for doing so. What has not changed over the millennia is that the preservation and curation of objects is not guaranteed and it is not cheap.


References

Ackoff, R.L. (1989). From data to wisdom. Journal of Applied Systems Analysis, 16(1), 3-9.

Beagrie, N., Beagrie, R., & Rowlands, I. (2009). Research data preservation and access: the views of researchers. Ariadne, 60. Retrieved August 18, 2009, from http://www.ariadne.ac.uk/issue60/beagrie-et-al/

Berners-Lee, T. (1998). Cool URIs don’t change. W3C. Retrieved July 15, 2008, from http://www.w3.org/Provider/Style/URI.html

Binary digit. (2011). Google.com. Retrieved December 13, 2011, from http://www.google.com/search?client=safari&rls=en&q=define:+binary+digit&ie=UTF-8&oe=UTF-8

Blue Ribbon Task Force on Sustainable Digital Preservation and Access. (2008, December). Sustaining the digital investment: issues and challenges of economically sustainable digital preservation. San Diego, CA: San Diego Supercomputer Center. Retrieved January 24, 2009, from http://brtf.sdsc.edu/biblio/BRTF_Interim_Report.pdf

Brody, T. (2000). Mining the social life of an ePrint archive. Retrieved September 16, 2001, from the University of Southampton, OpCit Project Web site: http://opcit.eprints.org/tdb198/opcit/q2/

Buckland, M.K. (1991). Information as thing. Journal of the American Society for Information Science, 42(5), 351-360.

Butler, D. (2007). Agencies join forces to share data. Nature, 446, 354.

Butler, D. (2007). Data sharing: the next generation. Nature, 446, 10-11.

Carr, L. (1999). Metadata changes to XXX papers in a three month period. Retrieved October 13, 2001, from the University of Southampton, Electronics and Computer Science Web site: http://users.ecs.soton.ac.uk/lac/XXXmetadatadeltas.html

Center for Research Libraries (2007). Ten principles. Retrieved December 8, 2011, from http://www.crl.edu/archiving-preservation/digital-archives/metrics-assessing-and-certifying/core-re

Committee on Ensuring the Utility and Integrity of Research Data in the Digital Age; National Academy of Sciences. (2009). Ensuring the Integrity, Accessibility, and Stewardship of Research Data in the Digital Age. Executive Summary. Washington, DC: the National Academies Press. Retrieved January 7, 2009, from http://www.nap.edu/catalog.php?record_id=12615

CCSDS. (2002). Reference model for an Open Archival Information System (OAIS) (CCSDS 650.0-B-1). Washington, DC: National Aeronautics and Space Administration (NASA). Retrieved April 3, 2007, from http://nost.gsfc.nasa.gov/isoas/

CCSDS. (2004). Producer-archive interface methodology abstract standard (CCSDS 651.0-B-1). Washington, DC: National Aeronautics and Space Administration (NASA). Retrieved August 18, 2007, from http://public.ccsds.org/publications/archive/651x0b1.pdf

CCSDS. (2011). Audit and Certification of Trustworthy Digital Repositories (CCSDS 652.0-M-1). Magenta Book, September 2011. Washington, DC: National Aeronautics and Space Administration (NASA).

Curry, A. (2011). Rescue of Old Data Offers Lesson for Particle Physicists. Science, 331, 694-695.

Dale, R. (2007). Mapping of audit & certification criteria for CRL meeting (15-16 January 2007). Retrieved September 11, 2007, from http://wiki.digitalrepositoryauditandcertification.org/pub/Main/ReferenceInputDocuments/TRAC-Nestor-DCC-criteria_mapping.doc

Davis, J. R. & Lagoze, C. (2000). NCSTRL: design and deployment of a globally distributed digital library. Journal of the American Society for Information Science, 51(3), 273-280.

Digital Curation Centre. (2011). DRAMBORA. Retrieved December 9, 2011, from http://www.dcc.ac.uk/resources/tools-and-applications/drambora

Digital Curation Centre. (2010). What is digital curation? Retrieved November 6, 2011, from http://www.dcc.ac.uk/digital-curation/what-digital-curation

Digital Curation Centre & Digital Preservation Europe. (2007). DCC and DPE digital repository audit method based on risk assessment (DRAMBORA). Retrieved August 1, 2007, from http://www.repositoryaudit.eu/download

Digital Preservation. (2009). Introduction – definitions and concepts. Digital Preservation Coalition. Retrieved November 6, 2011, from http://dpconline.org/advice/preservationhandbook/introduction/definitions-and-concepts

Dobratz, S., Schoger, A., & Strathmann, S. (2006). The nestor Catalogue of Criteria for Trusted Digital Repository Evaluation and Certification. Paper presented at the workshop on “digital curation & trusted repositories: seeking success”, held in conjunction with the ACM/IEEE Joint Conference on Digital Libraries, June 11-15, 2006, Chapel Hill, NC, USA. Retrieved December 1, 2011, from http://www.ils.unc.edu/tibbo/JCDL2006/Dobratz-JCDLWorkshop2006.pdf

Duranti, L. (1995). Reliability and authenticity: the concepts and their implications. Archivaria, 39 (Spring), 5-10.

Editor. (2009). Data’s shameful neglect. Nature, 461, 145.

Educopia Institute. (2010, April). Metaarchive cooperative TRAC audit checklist. Prepared by M. Schultz. Atlanta, CA: Educopia Institute. Retrieved December 10, 2010 from http://www.metaarchive.org/sites/default/files/MetaArchive_TRAC_Checklist.pdf

Egger, A. (2006). Shortcomings of the Reference Model for an Open Archival Information System (OAIS). IEEE TCDL Bulletin, 2(2). Retrieved October 23, 2009, from http://www.ieee-tcdl.org/Bulletin/v2n2/egger/egger.html

Fedora and the Preservation of University Records Project. (2006). 2.1 Ingest Guide, Version 1.0 (tufts:central:dca:UA069:UA069.004.001.00006). Retrieved April 16, 2009, from the Tufts University, Digital Collections and Archives, Tufts Digital Library Web site: http://repository01.lib.tufts.edu:8080/fedora/get/tufts:UA069.004.001.00006/bdef:TuftsPDF/getPDF

Fanelli, D. (2011). Negative results are disappearing from most disciplines and countries. Scientometrics, September 2011, 1-14.

Galloway, P. (2004). Preservation of digital objects. In B. Cronin (Ed.), Annual Review of Information Science and Technology, 38(1), (pp. 549-590).

Ginsparg, P. (2011). ArXiv at 20. Nature, 476, 145-147.

Hanson, B., Sugden, A., & Alberts, B. (2011). Making Data Maximally Available. Science, 331, 649.

Hedstrom, M. (1995). Electronic archives: integrity and access in the network environment. American Archivist, 58(3), 312-324.

Higgens, S. (2007). Draft DCC curation lifecycle model. International Journal of Digital Curation, 2(2). Retrieved March 22, 2008, from http://www.ijdc.net/index.php/ijdc/article/view/46

InterPARES. (2001). The long-term preservation of authentic electronic records: findings of the InterPARES project. Retrieved October 5, 2007, from http://www.interpares.org/ip1/ip1_index.cfm

Jackson, A. S., Han, M., Groetsch, K., Mustafoff, M., & Cole, T. W. (2008). Dublin Core metadata harvested through the OAI-PMH (pre-print). Journal of Library Metadata, 8(1).

Koehler, W. (1999). An analysis of web page and web site constancy and permanence. Journal of the American Society for Information Science, 50(2), 162-180.

Koehler, W. (2004). A longitudinal study of web pages continued: a consideration of document persistence. Information Research, 9(2).

Krasner-Khait, B. (2001). Survivor: the history of the library. History Magazine, October/November 2011. Retrieved August 30, 2011, from http://www.history-magazine.com/libraries.html

Kunze, J. (2003). Towards electronic persistence using ARK identifiers. Retrieved July 10, 2008, from the University of California, California Digital Library, Inside CDL Web site: http://www.cdlib.org/inside/diglib/ark/arkcdl.pdf

Lagoze, C. and Van de Sompel, H. (2001). The Open Archives Initiative: building a low-barrier interoperability framework. In Proceedings of the First ACM/IEEE-CS Joint Conference on Digital Libraries, June 24-28, 2001, Roanoke, VA. pp. 54-62.

Lavoie, B. (2004). The open archival information system reference model: introductory guide.Technology Watch Report. Dublin, OH: Digital Preservation Coalition. Retrieved March 6, 2007, http://www.dpconline.org/docs/lavoie_OAIS.pdf

Lavoie, B. & Dempsey, L. (2004). Thirteen ways of looking at…digital preservation. D-Lib Magazine, 10(7/8). Retrieved May 7, 2007, from http://www.dlib.org/dlib/july04/lavoie/07lavoie.html

Lavoie, B. & Gartner, R. (2005). Preservation metadata. Technology Watch Report. Dublin, OH: Digital Preservation Coalition. Retrieved June 20, 2009, http://www.dpconline.org/docs/reports/dpctw05-01.pdf

Lawrence, G.W., Kehoe, W.R., Rieger, O.Y., Walters, W.H., & Kenney, A.R. (2000). Risk management of digital information: a file format investigation. Washington, DC: Council on Library and Information Resources. Retrieved October 22, 2007, from http://www.clir.org/pubs/reports/pub93/contents.html

Lee, C. (2010). Open archival information system (OAIS) reference model. In Encyclopedia of Library and Information Sciences, Third Edition. London: Taylor & Francis.

Lee, C., Tibbo, H.R., & Schaefer, J.C. (2007). Defining what digital curators do and what they need to know: The DigCCurr Project. In Proceedings of the 2007 ACM/IEEE Joint Conference on Digital Libraries, 49-50.

Lynch, C. A. (1994). The integrity of digital information: mechanics and definitional issues. Journal of the American Society for Information Science, 45(10), 737-744.

Lynch, C. (2000). Authenticity and integrity in the digital environment: an exploratory analysis of the central role of trust. Authenticity in a digital environment. Washington, DC: Council in Library and Information Resources. Retrieved April 14, 2009, from http://www.clir.org/pubs/reports/pub92/pub92.pdf

McCown, F., Chan, S., Nelson, M.L., & Bollen, J. (2005). The availability and persistence of Web references in D-Lib Magazine. Paper presented at the 5th International Web Archiving Workshop (IWAW05), Vienna, Austria. Retrieved July 14, 2008, from http://arxiv.org/abs/cs.OH/0511077

Mearian, L. (2008). Study: digital universe and its impact bigger than we thought. Computerworld, March 11, 2008. Retrieved March 14, 2008, from http://www.computerworld.com/action/article.do?command=viewArticleBasic&articleId=9067639

Moore, R. (2002). The preservation of data, information, and knowledge. In Proceedings of the World Library Summit, April 24-26, 2002, Singapore. Retrieved April 1, 2009, from http://www.sdsc.edu/NARA/Publications/Web/moore-rw.doc

Moore, R. (2004). Evolution of data grid concepts. Paper presented at the workshop on “data” at the 10th Global Grid Forum, Berlin, Germany, March 9-13, 2004. Retrieved March 23, 2009, from http://www.npaci.edu/DICE/Pubs/Grid-evolution.doc

Moore, R.W. (2004). Preservation Environments. In Proceedings of the NASA/IEEE MSST 2004 Twelfth NASA Goddard Conference on Mass Storage Systems and Technologies in cooperation with the Twenty-First IEEE Conference on Mass Storage Systems and Technologies (MSST 2004), April 13-16, 2004, Adelphi, Maryland, USA. Retrieved September 26, 2010, from http://ntrs.nasa.gov/archive/nasa/casi.ntrs.nasa.gov/20040121020_2004117345.pdf

Moore, R. (2005). Persistent collections. In S.H. Kostow & S. Subramaniam (Eds.), Databasing the brain: from data to knowledge (neuroinformatics) (pp. 69-82). Hoboken, NJ: John Wiley and Sons.

Moore, R. (2006). Building preservation environments with data grid technology. American Archivist, 69(1), 139-158.

Moore, R. & Merzky, A. (2003). Persistent archive concepts. Paper presented at the 7th Global Grid Forum, Tokyo, Japan, March 4-7, 2003. Retrieved March 4, 2009, from http://www.npaci.edu/DICE/Pubs/Data-PAWG-PA.doc

Moore, R., Rajasekar, A., & Marciano, R. (2007). Implementing Trusted Digital Repositories. In Proceedings of the DigCCurr2007 International Symposium in Digital Curation, University of North Carolina – Chapel Hill, Chapel Hill, NC USA, 2007. Retrieved September 24, 2010, from http://www.ils.unc.edu/digccurr2007/papers/moore_paper_6-4.pdf

Moore, R. & Smith, M. (2007). Automated Validation of Trusted Digital Repository Assessment Criteria. Journal of Digital Information, 8(2). Retrieved March 2, 2010, from http://journals.tdl.org/jodi/article/view/198/181

National Initiative for a Networked Cultural Heritage. (2002). Rights management. In the NINCH guide to good practice in the digital representation and management of cultural heritage materials, v.1.0. Glasgow: University of Glasgow (HATII) & NINCH. Retrieved April 17, 2009, from http://www.nyu.edu/its/humanities/ninchguide/IV/

National Science Foundation. (2005). Long-lived digital data collections enabling research and education in the 21st century (NSB-05-40). Arlington, VA: National Science Foundation. Retrieved May 5, 2008, from http://www.nsf.gov/pubs/2005/nsb0540/

Nelson, B. (2009). Data sharing: empty archives. Nature, 461, 160-163.

Nelson, M.L. (2000). Buckets: Smart Objects for Digital Libraries (Doctoral Dissertation). Retrieved December 14, 2011, from http://www.cs.odu.edu/~mln/phd/

Nelson, M.L., & Allen, B.D. (2002). Object persistence and availability in digital libraries. D-Lib Magazine, 8(1). Retrieved July 18, 2007, from http://www.dlib.org/dlib/january02/nelson/01nelson.html

NESTOR Working Group on Trusted Repository — Certification. (2006). Catalog of criteria for trusted digital repositories version 1 draft for public comment (urn:nbn:de:0008-2006060703). Berlin: nestor Working Group — Certification. Retrieved April 14, 2009, http://edoc.hu-berlin.de/series/nestor-materialien/8en/PDF/8en.pdf

Oltmans, E. & Kol, N. (2005). A comparison between migration and emulation in terms of costs. RLG DigiNews 9(2). Retrieved September 10, 2007, from http://worldcat.org/arcviewer/1/OCC/2007/08/08/0000070519/viewer/file959.html

Online Computer Library Center, Inc. & Center for Research Libraries. (2007). Trustworthy repositories audit & certification: criteria and checklist version 1.0. Dublin, OH & Chicago, IL: OCLC & CRL. Retrieved September 11, 2007, from http://www.crl.edu/sites/default/files/attachments/pages/trac_0.pdf

O’Toole, J.M. (1989). On the idea of permanence. American Archivist, 52, 10-25.

Paskin, N. (2003). DOI: A 2003 progress report. D-Lib Magazine, 9(6). Retrieved July 9, 2008, from http://www.dlib.org/dlib/june03/paskin/06paskin.html

Rajasekar, A., Wan, M., Moore, R., & Schroeder, W. (2006). A prototype rule-based distributed data management system. Paper presented at a workshop on “next generation distributed data management” at the High Performance Distributed Computing Conference, June 19-23, 2006, Paris, France.

Research Information Network. (2011). Data centres: their use, value, and impact. A Research Information Network report. London, UK: JISC, September 2011.

Research Libraries Group. (1996). Preserving digital information report of the task force on archiving of digital information. Final report of the Task Force on Archiving of Digital Information commissioned by the Commission on Preservation and Access and the Research Libraries Group. Retrieved September 24, 2007, from http://www.eric.ed.gov/ERICWebPortal/custom/portlets/recordDetails/detailmini.jsp?nfpb=true&&ERICExtSearch_SearchValue_0=ED395602&ERICExtSearch_SearchType_0=eric_accno&accno=ED395602

Research Libraries Group. (2002). Trusted digital repositories: attributes and responsibilities an RLG-OCLC report. Mountain View, CA: Research Libraries Group. Retrieved September 11, 2007, from http://www.oclc.org/programs/ourwork/past/trustedrep/repositories.pdf

Rosenthal, D.S.H., Robertson, T., Lipkis, T., Reich, V., Morabito, S. (2005). Requirements for digital preservation systems a bottom-up approach. D-Lib Magazine, 11(11). Retrieved August 11, 2007, from http://www.dlib.org/dlib/november05/rosenthal/11rosenthal.html

Ross, S. & McHugh, A. (2006). The role of evidence in establishing trust in repositories. D-Lib Magazine 12(7/8). Retrieved May 6, 2007, from http://www.dlib.org/dlib/july06/ross/07ross.html

Rothenberg, J. (1999). Avoiding technological quicksand: finding a viable technical foundation for digital preservation (pub 77). A report to the Council on Library and Information Resources. Washington, DC: Council on Library and Information Resources. Retrieved April 16, 2009, from http://www.clir.org/pubs/reports/rothenberg/pub77.pdf

Rothenberg, J. (1999). Ensuring the longevity of digital information. Washington, DC: Council on Library and Information Resources. Retrieved April 16, 2009, from http://www.clir.org/pubs/archives/ensuring.pdf

Society of American Archivists. (1999). Core Archival Functions. Guidelines for College and University Archives. Prepared by the College and University Archives Section of the Society of American Archivists (SAA). Retrieved May 26, 2010, from http://www.archivists.org/governance/guidelines/cu_guidelines4.asp

Science and Technology Council. (2007). The digital dilemma strategic issues in archiving and accessing digital motion picture materials. The Science and Technology Council of the Academy of Motion Picture Arts and Sciences. Hollywood, CA: Academy of Motion Picture Arts and Sciences.

Shreeves, S. L., Knutson, E. M., Stvilia, B., Palmer, C. L., Twidale, M. B., & Cole, T. W. (2005). Is ‘quality’ metadata ‘shareable’ metadata? The implications of local metadata practices for federated collections. In Proceedings of the Twelfth National Conference of the Association of College and Research Libraries, April 7-10 2005, Minneapolis, MN, 223-237.

Sivathanu, G., Wright, C.P., & Zadok, E. (2005). Ensuring data integrity in storage: techniques and applications. In Proceedings of the first ACM International Workshop on Storage Security and Survivability (StorageSS 05), held in conjunction with the 12th ACM Conference on Computer and Communications Security (CCS 2005), November 7-11, 2005, Alexandria, VA. Retrieved October 4, 2007, from http://www.fsl.cs.sunysb.edu/docs/integrity-storagess05/integrity.html

Smith, M. & Moore, R. (2006). Digital Archive Policies and Trusted Digital Repositories. Paper presented at the 2nd International Digital Curation Conference, November 21 – 22, 2006, Glasgow, Scotland. Retrieved November 2, 2009, from http://pledge.mit.edu/images/6/6f/Smith-Moore-DCC-Nov-2006.pdf

Steinhart, G., Dietrich, D., & Green, A. (2009). Establishing trust in a chain of preservation the TRAC checklist applied to a data staging repository (DataStaR). D-Lib Magazine 15(9/10). Retrieved October 13, 2009 from http://www.dlib.org/dlib/september09/steinhart/09steinhart.html

Thibodeau, Kenneth. (2002). Overview of technological approaches to digital preservation and challenges in coming years. In Proceedings of the State of Digital Preservation: An International Perspective, at the Institutes for Information Science, April 24-25, 2002, Washington, DC. Retrieved September 26, 2007 from http://www.clir.org/pubs/reports/pub107/thibodeau.html
Thibodeau, K. (2007). The Electronic Records Archives Program at the National Archives and Records Administration. First Monday, 12(7). Retrieved January 15, 2009 from http://firstmonday.org/issues/issue12_7/thibodeau/index.html

Tibbo, H.R. (2003). On the nature and importance of archiving in the digital age. Advances in Computers, 57, 1-67.

URI Planning Interest Group. (2001). URIs, URLs, and URNs: Clarifications and Recommendations 1.0. Report from the joint W3C/IETF URI Planning Interest Group, W3C Note, 21 September 2001. Retrieved November, 8, 2011, from http://www.w3.org/TR/uri-clarification/

Vardigan, M. & Whiteman, C. (2007). ICPSR meets OAIS: applying the OAIS reference model to the social science archive context. Archival Science, 7(1). Netherlands: Springer. Retrieved February 20, 2008, from http://www.springerlink.com/content/50746212r6g21326/

Walters, T. & Skinner, K. (2011). New roles for new times: digital curation for preservation. Report prepared for the Association of Research Libraries. Washington, D.C.: Association of Research Libraries. Retrieved April 2, 2011, from http://www.arl.org/bm~doc/nrnt_digital_curation17mar11.pdf.

Ward, J. (2004). Unqualified Dublin Core usage in OAI-PMH Data Providers. OCLC Systems and Services, 20(1), 40-47.

Ward, J.H., de Torcy, A., Chua, M., and Crabtree, J. (2009). Extracting and Ingesting DDI Metadata and Digital Objects from a Data Archive into the iRODS extension of the NARA TPAP using the OAI-PMH. In Proceedings of the 5th IEEE International Conference on e-Science, Oxford, UK, December 9-11, 2009.

Waters, D. and Garrett, J. (1996). Preserving Digital Information. Report of the Task Force on Archiving of Digital Information. Washington, DC: CLIR, May 1996.

Wells, H.G. (1938). World brain. Garden City, NY: Doubleday, Doran and Co.

Witt, M., Carlson, J., & Brandt, D.S. (2009). Constructing data curation profiles. International Journal of Digital Libraries, 3(4), 93-103.

Zen College Life. (2011). The history of libraries through the ages. Retrieved August 30, 2011, from http://www.zencollegelife.com/the-history-of-libraries-through-the-ages/

Manage Data: Preservation Standards & Management

If you would like to work with us on a data governance or digital preservation and curation project, please review our consulting services.

Manage Data: Preservation Standards & Management