Trusted Digital Repository Development Lit Review

Trusted Digital Repositories | literature review and comprehensive exam

Abstract

Computer scientists who work with digital data that has long-term preservation value, archivists and librarians whose responsibilities include preserving digital materials, and other stakeholders in digital preservation have long called for the development and adoption of open standards in support of long-term digital preservation. Over the past fifteen years, preservation experts have defined “trust” and a “trustworthy” digital repository; defined the attributes and responsibilities of a trustworthy digital repository; defined the criteria and created a checklist for the audit and certification of a trustworthy digital repository; evolved this criteria into a standard; and defined a standard for bodies who wish to provide audit and certification to candidate trustworthy digital repositories. This literature review discusses the development of standards for the audit and certification of a trustworthy digital repository.

Citation

Ward, J.H. (2012). Managing Data: Preservation Standards & Audit & Certification Mechanisms (i.e., “policies”). Unpublished Manuscript, University of North Carolina at Chapel Hill. (pdf)

Creative Commons License
This work is licensed under a Creative Commons Attribution-NoDerivatives 4.0 International License.


Table of Contents

Abstract

Introduction

“Trust”

The Types of Audit and Certification

Trusted Digital Repositories: Attributes and Responsibilities

Trusted Digital Repositories
Attributes of a Trusted Digital Repository
Responsibilities of a Trusted Digital Repository
Certification of a Trusted Digital Repository
Summary

Trusted Digital Repositories: Audit and Certification

Trustworthy Repositories Audit & Certification: Criteria and Checklist
Organizational Infrastructure
Digital Object Management
Technologies, Technical Infrastructure, and Security
Audit and Certification of Trustworthy Digital Repositories Recommended Practice

Trusted Digital Repositories: Requirements for Certifiers

ISO/IEC 17021 Conformity Assessment
Requirements for Bodies Providing Audit and Certification of Candidate Trustworthy Digital Repositories Recommended Practice

Trusted Digital Repositories: Criticisms

Summary

References


Table of Figures

Figure 1 – TRAC, A1.1 (OCLC & CRL, 2007).

Figure 2 – Audit and Certification of Trustworthy Digital Repositories Recommended Practice, 3.1.1 (CCSDS, 2011).


Introduction

Computer scientists who work with digital data that has long-term preservation value, archivists and librarians whose responsibilities include preserving digital materials, and other stakeholders in digital preservation have long called for the development and adoption of open standards in support of long-term digital preservation (Lee, 2010; Science and Technology Council, 2007; Waters & Garrett, 1996). However, Hedstrom (1995) cautions that only “if” standards provide the conditions for the archive to conform to standard archival practices, software and hardware designers comply with the standards, and producers and users select and use the standards, will they then provide a high-level solution to some of the obstacles that may prevent the preservation of digital materials. The development of standards for the audit and certification of digital repositories as “trustworthy” is a major development towards ensuring that digital data will be curated and preserved for the indefinite long-term, as they provide the conditions so that all three of Hedstrom’s criteria may be met.

In 1996, the Commission on Preservation and Access and the Research Libraries Group released the now-seminal report, “Preserving Digital Information” (Waters & Garrett, 1996). The Research Libraries Group (RLG) (2002) noted three key points that lead to the interest in developing standards for the “attributes and responsibilities” of a “trusted digital repository”: the requirement for ‘a deep infrastructure capable of supporting a distributed system of digital archives’; ‘the existence of a sufficient number of trusted organizations capable of storing, migrating, and providing access to digital collections’; and, ‘a process of certification is needed to create an overall climate of trust about the prospects of preserving digital information’. A few years later, the Consultative Committee on Space Data Systems (CCSDS) released the “Reference Model for an Open Archival Information System (OAIS)” (CCSDS, 2002). This document defined a set of common terms, components, and concepts for a digital archive. It provided not just a technical reference, but outlined the organization of people and systems required to preserve information for the indefinite long-term and make it accessible (RLG, 2002).

However, experts and other stakeholders with an interest in preserving information for the long-term recognized that as part of defining an archival system, they also needed to form a consensus on the responsibilities and characteristics of a sustainable digital repository. In other words, they needed a method to “prove” (i.e., “trust”) that an organization’s systems were, in-fact, OAIS-compliant. First, they would have to define the attributes and responsibilities of a “trusted” digital repository. Next, they would have to develop a method to audit and certify that a repository may be “trusted”. And, finally, they would have to create an infrastructure to certify and train the auditors.

The essay “Managing Data: the Emergence & Development of Digital Curation & Digital Preservation Standards” contains sections that provide the motivations for the development of standards and an overview and example applications of, the “Audit and Certification of the Trustworthy Digital Repositories Recommended Practice” (CCSDS, 2011). That essay also covers the definitions of “reliable”, “authentic”, “integrity”, and “trustworthy”, et al. A very short discussion of this Recommended Practice and a detailed discussion of the OAIS Reference Model are available in the essay, “Managing Data: Preservation Repository Design (the OAIS Reference Model)“.

This essay on “preservation standards and audit and certification mechanisms” is an overview of “trust”; the types of audit and certification available generally; the development of standards for the audit and certification of a repository as “trustworthy”; a brief overview of the standards themselves; and, a very brief overview of the requirements for the certification of bodies that certify the auditors of said trusted digital repositories. Thus, the scope of this particular literature review is deliberately narrow to avoid the duplication of previously discussed topics.

“Trust”

Jøsang and Knapskog (1998) discussed “trust” as a “subjective belief” when they described a metric for a “trusted system”, while Lynch (2000) described “trust” as an elusive and subjective probability. Both the former and the latter wrote that a user trusts the evaluation of the certifier, not the actual system component. Jøsang and Knapskog drew attention to that fact that an evaluator only certifies that a system has been checked against a particular set of criteria; whether or not a user should or will trust that criteria is another matter. The two researchers pointed out that most end users of a certified system do not have the necessary expertise to evaluate the appropriateness and quality of the criteria used to audit the system. They must trust that the people who established the criteria chose relevant components, and that the evaluator had the skill and knowledge to assess the system.

This is similar to Lynch (2001), who wrote that users tend to assume digital system designers and content creators have users’ best interests at heart, which is not always the case; yet the idea of creating a formal system of trust “is complex and alien to most people”. Ross & McHugh (2006) posit that “trust” may be established with the various stakeholders affiliated with a repository by providing quantifiable “evidence” such as annual financial reports, business plans, policy documents, procedure manuals, mission statements, etc., so that a system’s “trustworthiness” is believable. Jøsang & Knapskog (1998) and Ross & McHugh’s (2006) research goal was to provide a methodical evaluation of system components to define “trust” in a system that in and of itself was trustworthy (RLG, 2002).

Finally, Merriam-Webster (Trust, 2011) defines “trust” as “one in which confidence is placed”; “a charge or duty imposed in faith or confidence or as a condition of some relationship”; and, “something committed or entrusted to one to be used or cared for in the interest of another”.

The Types of Audit and Certification

Jøsang and Knapskog (1998) described four types of roles generally assigned to “government driven evaluation schemes”: accreditor, certifier, evaluator, and, sponsor. They defined the accreditor as the body that accredits the evaluator, the certifier, and, sometimes, evaluates the system itself. They noted that the certifier is accredited based on “documented competence level, skill, and resources”. They stipulated that the certifier might also be a “government body issuing…certificates based on the evaluation reports from the evaluators”. They defined the evaluator as “yet another government agency” that is “accredited by the accreditor”, and “the quality of the evaluator’s work will be supervised by the certifier”. They described the sponsor as the party interested in having their system evaluated (Jøsang & Knapskog, 1998). In other words, the authors wrote that someone who would like their system audited and certified by a particular evaluation criteria (“the sponsor”) hires an auditor (“the evaluator”) who has been certified (“the certifier”) by an accredited agency (“the accreditor”).

RLG (2002) defined four approaches to certification: individual, program, process, and data. They described “individual” as personnel certification. This is also called professional certification or accreditation, and it is often given to an individual when they meet some combination of work experience, education, and professional competencies. RLG noted that at the time of writing, there were no professional certifications for digital repository management or electronic archiving. They cited “program” as a type of certification for an institution or a program achieved through a combination of site visits and “self-evaluation using standardized checklists and criteria”.

RLG explained that the assessment areas included access, outreach, collection preservation and development, staff, facilities, governing and legal authority, and financial resources. They provided examples of this type of certification that included museums, schools and programs within a university, etc. They defined “process” as “quantitative or qualitative guidelines…to internal and external requirements” that use various methods and procedures, such as the ISO 9000 family of standards (RLG, 2002).

Finally, the authors designated the “data” approach to certification as addressing “the persistence or reliability of data over time and data security”. They wrote that this certification requires adherence to procedures manuals and international standards, such as ISO, that ensure both external and internal quality control. They note that certification will require the managers of a repository to document migration processes, to maintain and create metadata, authenticate new copies, as well as update the data or files (RLG, 2002).

Trusted Digital Repositories: Attributes and Responsibilities

RLG (2002) defined a “trusted digital repository” as “one whose mission is to provide reliable, long-term access to managed digital resources to its designated community, now and in the future”. They described the “critical component” as “the ability to prove reliability and trustworthiness over time”. The authors’ stated goal for the report was to create a framework for large and small institutions that could cover different responsibilities, architectures, materials, and situations yet still provide a foundation with which to build a sustainable “trusted repository” (RLG, 2002).

Trusted Digital Repositories

The authors of the RLG document noted that repositories may be contracted to a third party or locally designed and maintained, regardless, the expectations for trust require that a digital repository must:

  • Accept responsibility for the long-term maintenance of digital resources on behalf of its depositors and for the benefit of current and future users;
  • Have an organizational system that supports not only long-term viability of the repository, but also the digital information for which it has responsibility;
  • Demonstrate fiscal responsibility and sustainability;
  • Design its system(s) in accordance with commonly accepted conventions and standards to ensure the ongoing management, access, and security of materials deposited within it;
  • Establish methodologies for system evaluation that meet community expectations of trustworthiness;
  • Be depended upon to carry out its long-term responsibilities to depositors and users openly and explicitly;
  • Have policies, practices, and performance that can be audited and measured; and
  • Meet the responsibilities detailed in Section 3 [sic] of this paper” (RLG, 2002).

Per the OAIS Reference Model (CCSDS, 2002), they noted that the repository’s “designated community” will be the primary determining factor in how the content is accessed and disseminated; managed and preserved; and what, including content and format, is deposited. The authors of the report discussed and defined “trust”, noting, “most cultural institutions are already trusted”. Regardless, they outlined three levels of trust that administrators of a repository must consider in order to be a “trusted repository”: the trust a cultural institution must earn from their designated community; the trust cultural institutions must have in third-party providers; and the trust users of the repository must have in the digital objects provided to them by the repository owner via the repository software.

The report authors wrote that archives, libraries, and museums must simply keep doing what they have been doing for centuries in order to maintain the trust of their user community; they do not need to develop that trust, as institutions, they have already earned it. RLG (2002) explained that while librarians, archivists, etc., are loath to use third-party providers who have not proven their reliability, the establishment of a certification program with periodic re-audits may overcome their reluctance. Finally, the authors stated that users must be able to trust that the digital items they receive from a repository are both authentic and reliable. In other words, the objects the users access must be unaltered and they must be what they purport to be (Bearman & Trant, 1998).

They established that this can be accomplished by the use of checksums and other forms of validation that are common in the Computer Science and digital security communities, although security does not equal integrity (Lynch, 1994). Waters & Garrett (1996) put forth that the “central goal” of an archival repository must be “to preserve information integrity”; this includes content, fixity, reference, provenance, and context.

For a discussion on “reliable”, “authentic”, “integrity”, and “trustworthy”, please see the essay, “Managing Data: the Emergence & Development of Digital Curation & Digital Preservation Standards“.

Attributes of a Trusted Digital Repository

RLG (2002) identified seven primary attributes of a trusted digital repository. They were and are: compliance with the OAIS Reference Model; administrative responsibility; organizational viability; financial sustainability; technological and procedural suitability; system security; and procedural accountability.

The authors defined “compliance with the OAIS” as the repository owners/administrators ensuring that the “overall repository system conforms” to the OAIS Reference Model. They described “administrative responsibility” as the repository administrators adhering to “community-agreed” best practices and standards, particularly with regards to sustainability and long-term viability. RLG (2002) explained “organizational viability” as creating and maintaining an organization and structure that is capable of curating the objects in the repository and providing access to them for the indefinite long-term. They included as part of this maintaining trained staff, legal status, transparent business practices, succession plans, and maintaining relevant policies and procedures.

RLG (2002) designated “financial sustainability” as maintaining financial fitness, engaging in financial planning, etc., with an ongoing commitment to remain financially viable over the long-term. The authors outlined “technological and procedural suitability” as the repository owners/administrators keeping the archives software and hardware up to date, as well as complying with applicable best practices and standards for technical digital preservation. They traced an outline for “system security” by describing the minimal requirements a repository must follow regarding best practices for risk management, including written policies and procedures for disaster preparedness, redundancy, firewalls, back up, authentication, data loss and corruption, etc.

Finally, RLG (2002) defined “procedural accountability” as the repository owners/administrators being accountable for all of the above. That is, the authors wrote that maintaining a trusted digital repository is a complex set of “interrelated tasks and functions”; the maintainer of the repository is responsible for ensuring that all required functions, tasks, and components are carried out (RLG, 2002).

Responsibilities of a Trusted Digital Repository

RLG (2002) described two primary responsibilities for the owners and administrators of a trusted digital repository: high-level organizational and curatorial responsibilities, and, operational responsibilities. They subdivided organizational and curatorial responsibilities into three levels. The authors noted that organizations must understand their local requirements, which other organizations may have similar requirements, and, how these responsibilities may be shared.

The authors of the report summarized five primary areas in support of those three levels: the scope of the collections, preservation and lifecycle management, the wide range of stakeholders, the ownership of material and other legal issues, and, cost implications (RLG, 2002).

  1. The scope of the collections: the repository owners and administrators must know exactly what they have in their digital collection, and how to adequately preserve the integrity and authenticity of the properties and characteristics of the individual items.
  2. Preservation and lifecycle management: the repository owners and administrators must commit to proactive planning with regards to preserving and curating the items in the repository.
  3. The wide range of stakeholders: the repository owners and administrators must take into account the interests of all stakeholders when planning for long-term access to the materials. In some instances, they will have to act in spite of their stakeholder’s wishes, as some stakeholders tend to have short-term views, and they will not care about the long-term preservation of, and access to, the materials. Other stakeholders will have a differing point of view, and they will want the material preserved in the long-term. The repository owners and administrators will have to balance these competing interests.
  4. The ownership of material and other legal issues: digital librarians and archivists will have to take a proactive role with content producers. They must seek to preserve materials by curating the data early in the life cycle of it, while being cognizant of the copyright and intellectual property concerns of the content producers and owners.
  5. Cost implications: repository owners and administrators must commit financial resources to maintaining the content over the indefinite long-term, while bearing in mind that the true costs of doing so are variable.

In sum, RLG (2002) recommended incorporating preservation planning into the everyday management of the preservation repository.

Next, the authors of this RLG report defined operational responsibilities in more detail than the organizational and curatorial responsibilities, above. They wrote the operational responsibilities based on the OAIS Reference Model, and added to that the “critical role” of a repository in the “promotion of standards” (RLG, 2002). They defined these areas as:

  1. Negotiates for and accepts appropriate information from information producers and rights holders: this responsibility covers the submission agreement between a content Producer and the OAIS Archive. These responsibilities include preservation metadata, record keeping, authenticity checks, and legal issues. As part of fulfilling this role, a repository will have policies and procedures in place to cover collection development, copyright and intellectual property rights concerns, metadata standards, provenance and authenticity, appropriate archival assessment, and, records of all transactions with the Producer.
  2. Obtains sufficient control of the information provided to support long-term preservation: this responsibility refers to the “staging” process, where ingested content is stored after submission from a Producer and before the material is ingested into the archive. The responsibilities of a repository administrator at this point encompass best practices for the ingest of materials, which includes an analysis of the digital content itself, including its “significant properties”; what requirements must be fulfilled to provide access to the material continuously; a metadata check against the repository’s standards (including adding metadata to bring the current metadata up to par); the assignment of a persistent and unique identifier; integrity/fixity/authentication checks; the creation of an OAIS Archival Storage Package (AIP); and, storage into the OAIS Archive.
  3. Determines, either by itself of [sic] with others, the users that make up its designated community, which should be able to understand the information provided: the repository administrators and owners must determine who their user base is so that they may understand how best to serve their Designated Community.
  4. Ensures that the information to be preserved is “independently understandable” to the designated community; that is, the community can understand the information without needing the assistance of experts: the repository owner and administrator must make the information available using generic tools that are available to the Designated Community. For example, documents might be made available via .pdf or .rtf because the software to render these documents is available for free to most users. A repository owner and/or administrator may not wish to preserve documents in the .pages file format, as this Apple file format is not commonly used and the software to render it is not free beyond a limited day trial period.
  5. Follows documented policies and procedures that ensure the information is preserved against all reasonable contingencies and enables the information to be disseminated as authenticated copies of the original or as traceable to the original: the repository owners and administrators will document any unwritten policies and procedures, and follow best practice recommendations and standards where possible. These policies must include policies to define the Designated Community and its knowledge base; policies for material storage, including service-level agreements; policies for authentication and access control; a collection development policy, including preservation planning; a policy to keep policies updated with current recommendations, standards, and best practices; and, finally, links between procedures and policies, to ensure compliance across all collections in the repository.
  6. Makes the preserved information available to the designated community: the repository owners and administrators must comply with legal responsibilities such as licensing, copyright, and intellectual property regarding access to the content in the repository. Within that framework, however, they should plan to provide user support, record keeping, pricing (where applicable), authentication, and, most importantly, a method for resource discovery.
  7. Works closely with the repository’s designated community to advocate the use of good and (where possible) standard practice in the creation of digital resources; this may include an outreach program for potential depositors: the repository owners and administrators should work with all stakeholders to advocate the use of standards and recommended best practices (RLG, 2002). As the Science and Technology Council (2007) noted, using standards will reduce costs for all parties involved and better ensure the longevity of the material.

In conclusion, the OAIS Reference Model has provided a useful framework “for identifying the responsibilities of a trusted digital repository” (RLG, 2002).

Certification of a Trusted Digital Repository

As part of the certification framework, the authors of the RLG report intended to support Waters & Garrett’s (1996) assertion that archival repositories “must be able to prove that they are who they say they are by meeting or exceeding the standards and criteria of an independently-administered program for archival certification”.

RLG (2002) described two types of certification then in use within the libraries and archives community: the standards model and the audit model. The “standards” model is an informal process. They stated that standards are created when best practices and guidelines are established by the consensus of the expert community and then “certified” by other practitioners’ acceptance and/or use of the “standard”. In other words, librarians, archivists, and computer scientists who work with libraries decide what constitutes a “standard”; only rarely does a standard become formalized via ISO or another international organization. The authors described the audit model as an output of legislation or policies and procedures established by national agencies, such as the U.S. Department of Defense. That is, a governing body passes laws or policies, and the information repository’s policies must conform to the governing body’s requirements (RLG, 2002).

For a discussion of other approaches to certification, please see an earlier section, “Types of Audit and Certifications”.

Summary

RLG (2002) described a framework for a trusted digital repository’s responsibilities and attributes. They noted that these apply to repositories both large and small that hold a wide variety of content. The authors summarized their work above with several recommendations.

  • Recommendation 1: Develop a framework and process to support the certification of digital repositories.
  • Recommendation 2: Research and create tools to identify the attributes of digital materials that must be preserved.
  • Recommendation 3: Research and develop models for cooperative repository networks and services.
  • Recommendation 4: Design and develop systems for the unique, persistent identification of digital objects that expressly support long-term preservation.
  • Recommendation 5: Investigate and disseminate information about the complex relationship between digital preservation and intellectual property rights.
  • Recommendation 6: Investigate and determine which technical strategies best provide for continuing access to digital resources.
  • Recommendation 7: Investigate and define the minimal-level metadata required to manage digital information for the long term. Develop tools to automatically generate and/or extract as much of the required metadata as possible (RLG, 2002).

The remainder of this essay focuses on the results of Recommendation 1, above, regarding the development of certification standards for digital repositories.

Trusted Digital Repositories: Audit and Certification

Several researchers have addressed the problem of audit and certification. For example, Ross & McHugh (2006) created the Digital Repository Audit Method Based On Risk Assessment (DRAMBORA) to provide a self-audit method for repository administrators that provided quantifiable results (Digital Curation Centre, 2011). Dobratz, Schoger, and Strathmann (2006) created nestor, the Network of Expertise in Long-Term Storage of Digital Resources. Other lesser-known researchers such as Becker, et al. (2009) described a decision-making procedure for preservation planning that provides a means for repository administrators to consider various alternatives.

This section will examine the audit and certification method known as the “Trustworthy Repositories Audit & Certification (TRAC): Criteria and Checklist” and its follow up document, the “Audit and Certification of Trustworthy Repositories Recommended Practice”. Researchers and practitioners across the globe — including Ross, McHugh, Dobratz, et al. – combined their efforts and contributed their expertise into developing TRAC from a draft into a final version (Research Libraries Group, 2005; Dale, 2007). Their efforts have led to the development and refinement of TRAC into a CCSDS “Recommended Practice”; this may eventually become an ISO standard.

The essay, “Managing Data: the Emergence & Development of Digital Curation & Digital Preservation Standards” describes some of the related work in this area not covered below.

Trustworthy Repositories Audit & Certification: Criteria and Checklist

The authors of TRAC created it as part of a larger international effort to define an audit and certification process to ensure the longevity of digital objects. They defined a checklist that any repository manager could use to assess the trustworthiness of the repository. The checklist provided examples of the required evidence, but the list is considered “prescriptive”; the authors did not try to list every possible type of example. It contained three sections: “organizational infrastructure”, “digital object management”, and, “technologies, technical infrastructure, and security”.

The authors provided a spreadsheet-style “audit checklist” called “Criteria for Measuring Trustworthiness of Digital Repositories and Archives”. They note that the criteria measured is applicable to any kind of repository, using documentation (evidence), transparency (both internal and external), adequacy (individual context), and, measurability (i.e., objective controls). The authors stated that a full certification process must include not just an external audit, but tools to allow for self-examination and planning prior to an audit (OCLC & CRL, 2007). The terminology in the audit checklist conformed to the OAIS Reference Model.
A typical policy in TRAC followed the model of statement, explanation, and evidence (see Figure 1, below).


Figure 1 - Is from TRAC, section A1.1 (OCLC & CRL, 2007). | Ward, J.H. (2012). Managing Data: Preservation Standards & Audit & Certification Mechanisms (i.e., "policies"). Unpublished Manuscript, University of North Carolina at Chapel Hill. Creative Commons License: Attribution-NoDerivatives 4.0 International (CC BYND 4.0)
Figure 1 – TRAC, A1.1 (OCLC & CRL, 2007).

I. Organizational Infrastructure

The authors of TRAC considered the organizational infrastructure to be as critical a component as the technical infrastructure (OCLC & CRL, 2007). This reflected the view of the authors of the OAIS Reference Model, who consider an OAIS to be “an archive, consisting of an organization of people and systems, that has accepted the responsibility to preserve information and make it available for a Designated Community” (CCSDS, 2002). OCLC & CRL (2007) considered “organizational attributes” to be a characteristic of a trusted digital repository, and these characteristics are reflected RLG’s (2002) grouping of financial sustainability, organizational viability, procedural accountability, and administrative responsibility as four of the seven attributes of a trusted digital repository.

The authors of TRAC considered the following ten elements to be part of organizational infrastructure, but they did not limit it to only these elements.

  1. Governance
  2. Organizational structure
  3. Mandate or purpose
  4. Scope
  5. Roles and responsibilities
  6. Policy framework
  7. Funding system
  8. Financial issues, including assets
  9. Contracts, licenses, and liabilities
  10. Transparency (OCLC & CRL, 2007).

In addition, they grouped the above elements into five areas:

  1. Governance and organizational viability: the owners and managers of a repository must commit to established best practices and standards for the long term. This includes mission statements, and succession/contingency plans.
  2. Organizational structure and staffing: the repository owners and managers must commit to hiring an appropriate number of qualified staff that receives regular ongoing professional development.
  3. Procedural accountability and policy framework: the repository owners and managers must provide transparency with regards to documentation related the long-term preservation and access of the archival data. This requirement provides evidence to stakeholders of the repository’s trustworthiness. This documentation may define the Designated Community, what policies and procedures are in place, legal requirements and obligations, reviews, feedback, self-assessment, provenance and integrity, and operations and management.
  4. Financial sustainability: the repository owners and administrators must follow solid business practices that provide for the long-term sustainability of the organization and the digital archive. This includes business plans, annual reviews, financial audits, risk management, and possible funding gaps.
  5. Contracts, licenses, and liabilities: the repository owners and administrators must make contracts and licenses “available for audits so that liabilities and risks may be evaluated”. This requirement includes deposit agreements, licenses, preservation rights, collection maintenance agreements, intellectual property and copyright, and, ingest (OCLC & CRL, 2007).

II. Digital Object Management

The authors described this section as a combination of technical and organizational aspects. They organized the requirements for this section to align with six of the seven OAIS Functional Entities: Ingest, Archival Storage, Preservation Planning, Data Management, Administration, and Access (OCLC & CRL, 2007; CCSDS, 2002). The authors of the TRAC audit & checklist defined these six sections as follows.

  1. The initial phase of ingest that addresses acquisition of digital content.
  2. The final phase of ingest that places the acquired digital content into the forms, often referred to as Archival Information Packages (AIPs), used by the repository for long-term preservation.
  3. Current, sound, and documented preservation strategies along with mechanisms to keep them up to date in the face of changing technical environments.
  4. Minimal conditions for performing long-term preservation of AIPs.
  5. Minimal-level metadata to allow digital objects to be located and managed within the system.
  6. The repository’s ability to produce and disseminate accurate, authentic versions of the digital objects (OCLC & CRL, 2007).

The authors further elucidated the above areas as follows.

  1. Ingest: acquisition of content

    This section covered the process required to acquire content; this generally falls under the realm of a Submission Agreement between the Producer and the repository. The Producer may be external or internal to the repository’s governing organization. The authors recommended considering the object’s properties, any information that needs to be associated with the submitted object (s), mechanisms to authenticate the materials, verify each ingested object for integrity, maintaining control of the bits so that none may be altered at any time, regular contact with the Producer as appropriate, a formal acceptance process with the Producer for all content, and, an audit trail of the Ingest process.

  2. Ingest: creation of the archival package

    The actions in this section covered the creation of an AIP. These actions involved documentation: of each AIP preserved by the repository; that each AIP created is actually adequate for preservation purposes; of the process of constructing an AIP from a SIP; of the actions performed on each SIP (deletion or creation as an AIP); of the use of persistent and unique naming schemas/identifiers, else, of the preservation of the existing unique naming schema; of the context for each AIP; of an audit trail of the metadata records ingested; of associated preservation metadata; of testing the ability of current tools to render the information content; of the verification of completeness of each AIP; of an integrity audit mechanism for the content; and, of any actions and process related to AIP creation.

  3. Preservation planning

    The authors’ recommended four simple actions a repository administrator may take regarding keeping the archive current. The administrator must document their current preservation strategies; monitor format, etc., obsolescence; adjust the preservation plan if or when conditions change; and, provide evidence that the preservation plan used is actually effective.

  4. Archival storage & preservation/maintenance of AIPs

    The actions in this section covered what is required to ensure that an AIP is actually being preserved. This involved examining multiple aspects of object maintenance, including, but not limited to, storage, tracking, checksums, migration, transformations, and copies/replicas. The repository administrator must be able to demonstrate the use of standard preservation strategies; that the repository actually implements these strategies; that the Content Information is preserved; that the integrity of the AIP is audited; and that there is an audit trail of any actions performed on an AIP.

  5. Information management

    This section addressed the requirements related to descriptive metadata. The repository owner must identify the minimal metadata required for retrieval by the Designated Community; create a minimal amount of descriptive metadata and attach it to the described object; and, prove there is referential integrity between each AIP and its associated metadata (both creation and maintenance of).

  6. Access management

    The authors designed this section to address methods for providing access to the content (i.e., DIPs) in the repository to the Designated Community; they wrote that the degree of sophistication of this would vary based on the context of the repository itself and the requirements of the Designated Community. They further subdivided this section into four areas: access conditions and actions, access security, access functionality, and, provenance. In order to fulfill the requirements presented in this section, a repository owner must: provide information to the Designated Community as to what access and delivery options are actually available; require an audit of all access actions; only provide access to particular Designated Community members as agreed to with the Producer; ensure access policies are documented and comply with deposit agreements; fully implement the stated access policy; log all access failures; demonstrate the DIP generated is what the user requested; prove that access success or failure is made known to the user within a reasonable length of time; and, all DIPs generated may be traced to an authentic original and themselves authentic (OCLC & CRL, 2007).

In summary, OCLC & CRL (2007) designed this section to make it mandatory for a trustworthy digital repository to be able to produce a DIP, “however primitive”.

III. Technologies, Technical Infrastructure, and Security

The authors of TRAC did not want to make specific software and hardware requirements, as many of these would fall under standard computer science best practices and they are covered by other standards. Therefore, they addressed general information technology areas as related to digital preservation. These areas fall under one of three categories: system infrastructure, appropriate technologies, and security (OCLC & CRL, 2007).

  1. System infrastructure

    This section addressed the basic infrastructure required to ensure the trustworthiness of any actions performed on an AIP. This meant that the repository administrator must be able to demonstrate that the operating systems and other core software are maintained and updated; the software and hardware are adequate to provide back ups; the number and location of all digital objects, including duplicates, are managed; all known copies are synched; audit mechanisms are in place to discover bit-level changes; any such bit-level changes are reported to management, including the steps taken to prevent further loss and replace/repair the current corruption and loss; processes are in place for hardware and software changes (e.g., migration); a change management process is in place to mitigate changes to critical processes; there is process for testing the effect of critical changes prior to an actual implementation; and, software security updates are implemented with an awareness of the risks versus benefits of doing so.

  2. Appropriate technologies

    The authors recommended that a repository administrator should look to the Designated Community for relevant standards and strategies. They proposed that the hardware and software technologies in place are appropriate for the Designated Community, and that appropriate monitoring is in place to update hardware and software as appropriate.

  3. Security

    This section addressed non-IT security, as well as IT security. The authors recommended that a repository administrator conducts a regular risk assessment of internal and external threats; ensures controls are in place to address any assessed threats; decides which staff members are authorized to do what and when; and, has an appropriate disaster preparedness plan in place, including off-site recovery plan copies (OCLC & CRL, 2007).

In conclusion, the archivists, librarians, computer scientists, and other experts who contributed to the development of TRAC created a document that encompassed the minimum requirements for an OAIS Archive to be considered “trustworthy”.

Audit and Certification of Trustworthy Digital Repositories Recommended Practice

The CCSDS released the “Audit and Certification of Trustworthy Digital Repositories Recommended Practice” (v. CCSDS 652.0-M-1, the “Magenta Book”) in September 2011 (CCSDS, 2011). This section will discuss the Recommended Practice only with regards to major differences with TRAC (OCLC & CRL, 2007), above. This is because the two documents are similar enough that to repeat a description of each of the sections would be gratuitous.

The CCSDS described the purpose of the Recommended Practice as that of providing the documentation “on which to base an audit and certification process for assessing the trustworthiness of digital repositories” (CCSDS, 2011). The essay “Managing Data: the Emergence & Development of Digital Curation & Digital Preservation Standards” contains an overview of this Recommended Practice. This section will cover areas not covered by the overview in that essay or earlier in this document.

The three major sections of the Recommended Practice are the same as for TRAC, except that the last section has been re-named. Therefore, instead of “organizational infrastructure”, “digital object management”, and, “technologies, technical infrastructure, & security”, the authors of the Recommended Practice renamed the last section, “infrastructure and security risk management”. Within that technology section, the sections were reduced from three to two. Therefore, instead of, “system infrastructure”, “appropriate technologies”, and “security”, the Recommended Practice contains sub-sections on “technical infrastructure risk management” and “security risk management”. The subsections for “organizational infrastructure” and “digital object management” remained the same. The CCSDS re-worded, re-organized, and expanded the content of the sub-sections, but the general ideas behind each section stayed in place. So for example, Figure 2, below, is the Recommended Practice version of the same content in the same section in TRAC from Figure 1, above.


Figure 2 - Audit and Certification of Trustworthy Digital Repositories Recommended Practice, Section 3.1.1 (CCSDS, 2011). | Ward, J.H. (2012). Managing Data: Preservation Standards & Audit & Certification Mechanisms (i.e., "policies"). Unpublished Manuscript, University of North Carolina at Chapel Hill. Creative Commons License: Attribution-NoDerivatives 4.0 International (CC BYND 4.0)
Figure 2 – Audit and Certification of Trustworthy Digital Repositories Recommended Practice, 3.1.1 (CCSDS, 2011).

In short, the members of the CCSDS evolved and expanded the original TRAC checklist to create the Recommended Practice, but overall, the ideas in the original version have held up well during the four-year transition to a Recommended Standard.

Trusted Digital Repositories: Requirements for Certifiers

Both Waters & Garrett (1996) and RLG (2002) recommended the creation of a certification program for trusted digital repositories. As a result, librarians, archivists, computer scientists and other experts and stakeholders in digital preservation created the “Trustworthy repositories audit & certification: criteria and checklist” in order to create a common set of standards and terminology by which a repository may be certified. These experts and others then took TRAC, via the CCSDS, and created the “Audit and Certification of Trustworthy Digital Repositories (CCSDS 652.0-M-1) Recommended Practice”. As part of the process of creating this Recommended Practice, these experts also determined the requirements for bodies that will provide the audit and certification of “candidate” trustworthy digital repositories.

They created a second Recommended Practice, “Requirements for bodies providing audit and certification of candidate trustworthy digital repositories CCSDS 652.1-M-1”. This Recommended Practice for bodies providing audit and certification is a supplement to an existing ISO Standard that outlines the requirements for a body performing audit and certification, “Conformity assessment — Requirements for bodies providing audit and certification of management systems” (ISO/IEC 17021, 2011).

ISO/IEC 17021 Conformity Assessment

The authors of this standard covered seven primary areas: principles, general requirements, structural requirements, resource requirements, information requirements, process requirements, and, management of system requirements for certification bodies. They defined “principles” as covering impartiality, competence, responsibility, openness, confidentiality, and responsiveness to complaints. They described “general requirements” as covering legal and contractual matters, management of impartiality, and liability and financing. They kept “structural requirements” simple — this is about the organizational structure and top management, and a committee for safeguarding impartiality.

The authors detailed “resource requirements” as covering the competence of management and personnel, the personnel involved in the certification activities, the use of individual auditors and external technical experts, personnel records, and outsourcing. They outlined “information requirements” as publicly accessible information, certification documents, directory of certified clients, reference to certification and use of marks, confidentiality, and the information exchange between a certification body and its clients. The authors delineated “process requirements” as covering general requirements, audit and certification, surveillance activities, recertification, special audits, suspending, withdrawing or reducing the scope of certification, appeals, complaints, and, the records of applicants and clients.

Finally, the authors provided three options for “management systems requirements for certification bodies” that includes general management requirements and management system requirements that are in accordance with ISO 9001. In document appendices, the authors discussed the required knowledge and skills to be an auditor, the possible types of evaluation methods, provided an example of a process flow for determining and maintaining competence, desired personal behaviors, the requirements for a third-party audit and certification process, and, considerations for the audit programme, scope or plan (ISO/IEC 17021, 2011).

Requirements for Bodies Providing Audit and Certification of Candidate Trustworthy Digital Repositories Recommended Practice

This section of this essay will address the areas in which the Recommended Practice for bodies providing audit and certification differs from “ISO/IEC 17021 Conformity Assessment”.

The CCSDS created the Recommended Practice, “Requirements for bodies providing audit and certification of candidate trustworthy digital repositories” as a supplement to “Conformity assessment — Requirements for bodies providing audit and certification of management systems” (ISO/IEC 17021, 2011). They created the document to provide additional information on which an organization that is assessing a digital repository for certification as trustworthy may base their operations for issuance of such certification (CCSDS, 2011). In other words, the CCSDS (2011) created the document to support the accreditation of bodies providing certification. They created the document with a secondary purpose of providing repository owners with documentation by which they may understand the processes involved in achieving certification. They wrote the document using terminology from the OAIS Reference Model.

The authors defined a “Primary Trustworthy Digital Repository Authorisation Body” (PTAB) as an organization that accredits training courses for auditors, accredits other certification bodies, and that provides audit and certification of candidate trustworthy digital repositories. The membership consists of “internationally recognized experts in digital preservation” (CCSDS, 2011). They defined the primary tasks of the organization as: accrediting other trustworthy digital repository certification bodies; certifying auditors; making certification decisions; accrediting auditor qualifications; undertaking audits; and, last, having a mechanism to add new experts to PTAB as needed. They noted that PTAB will also be accredited by ISO and will become a member of the International Accreditation Forum (IAF). In the event of any possible conflicts of interest, the authors designated two areas that are not considered conflicts by those members who are certifiers: lecturing, including in training courses, and identifying areas of improvement during the course of an audit (CCSDS, 2011).

The CCSDS outlined the criteria for the training of audit team members. This training must include: understanding digital preservation, including the technical aspects related to the audited activity; understanding of knowledge management systems; a general knowledge of the regulatory requirements related to trustworthy digital repositories; an understanding of the basic principles related to auditing, per ISO standards; an understanding of risk management and risk assessment with regards to digitally encoded information; and, finally, an understanding of the Recommended Practice, “Audit and Certification of Trustworthy Digital Repositories (CCSDS 652.0-M-1)”.

Furthermore, the authors specified that the audit team should have or find members with appropriate technical knowledge for the scope of the digital repository certification, the necessary comprehension of any applicable regulatory requirements for that repository, and knowledge of the repository owner’s organization, such that an appropriate audit may be conducted. The CCSDS wrote that the audit team might be supplemented with the necessary technical expertise, as needed. As well, the authors charged PTAB with assessing the conduct of auditors and experts and monitoring their performance, as well as selecting these experts and auditors based on appropriate experience, competence, training, and qualifications (CCSDS, 2011).

The CCSDS outlined the required levels of work experience for a trusted digital repository auditor. They required these auditors to have completed five days of training via PTAB or an accredited agency; some prior experience assessing trustworthiness, including participating in two audit certifications for a total of 20 days; four years of workplace experience focusing on digital preservation; remained current with regards to digital preservation best practices and standards; current experience; and, received certification from PTAB. The authors stipulated three additional requirements for audit team leaders. They must be able to effectively communicate in writing and orally; have been an auditor previously for two completed trustworthy digital repository audits; and, have the capability and knowledge of managing an audit certification process (CCSDS, 2011).

The authors outlined additional recommendations, including a requirement that the auditor must have access to the client organization’s records. If these records may not be accessed, then it is possible the audit may not be performed. The CCSDS defined the criteria against which an audit is performed as those defined in the Recommended Practice, “Audit and Certification of Trustworthy Digital Repositories (CCSDS 652.0-M-1)”. They require two auditors to be present on site; other auditors may work remotely. The authors’ note in an appendix on security that all auditors maintain confidentiality with respect to an organization’s systems, content, structure, data, etc., as required (CCSDS, 2011).

In conclusion, the CCSDS has created a method for a larger umbrella organization — PTAB — to certify the certifiers of a trusted digital repository by creating a “Recommended Practice for bodies providing audit and certification” as a supplement to the existing ISO/IEC standard for “Conformity assessment — Requirements for bodies providing audit and certification of management systems”. By creating both a certification program and the criteria for certification of trustworthiness, these experts believe they have ensured the availability of digital information over the indefinite long-term.

Trusted Digital Repositories: Criticisms

Gladney (2005; 2004) has been a vocal critic of the repository-centric approach to digital preservation, which he considers “unworkable”. He has proposed, instead, the creation of durable digital objects that encode all required preservation information within the digital object itself. R. Moore has reservations about the “top-down” approach, in which standards are handed-down from a body of experts to be used by practitioners. He would like to know what policies preservation data grid administrators are actually implementing at the machine-level (Ward, 2011).

Similar to R. Moore’s concerns, Thibodeau (2007) supports the development of standards for digital preservation, but he believes these standards should be supplemented by empirical data regarding the purpose of each repository. For example, practitioners should not assess a repository based solely on whether or not the repository is OAIS-compliant. He writes that practitioners should consider the purpose of the repository, its mission, and its user base, and whether or not the repository owner’s are fulfilling those requirements. Thibodeau (2007) defined a five-point framework for repository evaluation that considers service, collaboration, “state”, orientation, and coverage. He believes that this broader context, along with the OAIS Reference Model and the Recommended Practice for the Audit and Certification of Trustworthy Repositories, provide a more realistic determiner of a repository’s “success” or “failure”.

Summary

Archivists, librarians, computer scientists and other stakeholders and experts in digital preservation wanted to create certification standards for trustworthy digital repositories, and they voiced this desire in a 1996 report, “Preserving Digital Information” (Waters & Garrett, 1996). As one part of this enthusiasm for standards, the CCSDS released the OAIS Reference Model (CCSDS, 2002). Experts recognized that a technical framework was only part of a preservation repository, and so they worked to define the attributes and responsibilities of a trusted digital repository (RLG, 2002). They created an audit and certification checklist based on these attributes and responsibilities, called TRAC (OCLC & CRL, 2007). After receiving feedback from the preservation community, the CCSDS evolved TRAC into the Recommended Practice for the Audit and Certification of Trustworthy Digital Repositories (2011), and released the Recommended Practice for Requirements for Bodies Providing Audit and Certification of Candidate Trustworthy Digital Repositories (2011).

Thus, after many years of work, stakeholders with an interest in the preservation of digital material now have criteria against which to judge whether or not a repository and its contents are likely to last for the indefinite long-term, as well as an umbrella organization that will provide certified and trained auditors. To reiterate these accomplishments, over the past fifteen years, preservation experts have defined “trust” and a “trustworthy” digital repository; defined the attributes and responsibilities of a trustworthy digital repository; defined the criteria and created a checklist for the audit and certification of a trustworthy digital repository; evolved this criteria into a standard; and defined a standard for bodies who wish to provide audit and certification to candidate trustworthy digital repositories.

The significance of these accomplishments cannot be overstated — at stake in the concerns over the preservation of digital objects and information are the cultural and scientific heritage, and personal information, of humanity.

References


Bearman, D. & Trant, J. (1998). Authenticity of digital resources: towards a statement of requirements in the research process. D-Lib Magazine. Retrieved April 14, 2009, from http://www.dlib.org/dlib/june98/06bearman.html

Becker, C., Kulovits, H., Guttenbrunner, M., Strodl, S., Rauber, A., & Hofman, H. (2009). Systematic planning for digital preservation: evaluating potential strategies and building preservation plans. International Journal of Digital Libraries, 10(4), 133-157.

CCSDS. (2011). Requirements for bodies providing audit and certification of candidate trustworthy digital repositories recommended practice (CCSDS 652.1-M-1). Magenta Book, November 2011. Washington, DC: National Aeronautics and Space Administration (NASA).

CCSDS. (2011). Audit and certification of trustworthy digital repositories recommended practice (CCSDS 652.0-M-1). Magenta Book, September 2011. Washington, DC: National Aeronautics and Space Administration (NASA).

CCSDS. (2002). Reference model for an Open Archival Information System (OAIS) (CCSDS 650.0-B-1). Washington, DC: National Aeronautics and Space Administration (NASA). Retrieved April 3, 2007, from http://nost.gsfc.nasa.gov/isoas/

Dale, R. (2007). Mapping of audit & certification criteria for CRL meeting (15-16 January 2007). Retrieved September 11, 2007, from http://wiki.digitalrepositoryauditandcertification.org/pub/Main/ReferenceInputDocuments/TRAC-Nestor-DCC-criteria_mapping.doc

Digital Curation Centre. (2011). DRAMBORA. Retrieved December 9, 2011, from http://www.dcc.ac.uk/resources/tools-and-applications/drambora

Dobratz, S., Schoger, A., & Strathmann, S. (2006). The nestor Catalogue of Criteria for Trusted Digital Repository Evaluation and Certification. Paper presented at the workshop on “digital curation & trusted repositories: seeking success”, held in conjunction with the ACM/IEEE Joint Conference on Digital Libraries, June 11-15, 2006, Chapel Hill, NC, USA. Retrieved December 1, 2011, from http://www.ils.unc.edu/tibbo/JCDL2006/Dobratz-JCDLWorkshop2006.pdf

Gladney, H.M. & Lorie, R.A. (2005). Trustworthy 100-Year digital objects: durable encoding for when it is too late to ask. ACM Transactions on Information Systems, 23(3), 229-324. Retrieved December 29, 2011, from http://eprints.erpanet.org/7/

Gladney, H.M. (2004). Trustworthy 100-Year digital objects: evidence after every witness is dead. ACM Transactions on Information Systems, 22(3), 406-436. Retrieved July 12, 2008, from http://doi.acm.org/10.1145/1010614.1010617

Hedstrom, M. (1995). Electronic archives: integrity and access in the network environment. American Archivist, 58(3), 312-324.

ISO/IEC 17021. (2011.) Conformity assessment — Requirements for bodies providing audit and certification of management systems. Retrieved December 30, 2011, from http://www.iso.org/iso/iso_catalogue/catalogue_tc/catalogue_detail.htm?csnumber=56676

Jøsang, A. & Knapskog, S.J. (1998). A metric for trusted systems. In Proceedings of the 21st National Information Systems Security Conference (NISSC), October 6-9, 1998, Crystal City, Virginia. Retrieved December 27, 2011, from http://csrc.nist.gov/nissc/1998/proceedings/paperA2.pdf

Lee, C. (2010). Open archival information system (OAIS) reference model. In Encyclopedia of Library and Information Sciences, Third Edition. London: Taylor & Francis.

Lynch, C. (2001). When documents deceive: trust and provenance as new factors for information retrieval in a tangled web. Journal of the American Society for Information Science and Technology, 52(1), 12-17.

Lynch, C. (2000). Authenticity and integrity in the digital environment: an exploratory analysis of the central role of trust. Authenticity in a digital environment. Washington, DC: Council in Library and Information Resources. Retrieved April 14, 2009, from http://www.clir.org/pubs/reports/pub92/pub92.pdf

Lynch, C. A. (1994). The integrity of digital information: mechanics and definitional issues. Journal of the American Society for Information Science, 45(10), 737-744.

OCLC & CRL. (2007). Trustworthy repositories audit & certification: criteria and checklist version 1.0. Dublin, OH & Chicago, IL: OCLC & CRL. Retrieved September 11, 2007, from http://www.crl.edu/sites/default/files/attachments/pages/trac_0.pdf

Research Libraries Group. (2005). An audit checklist for the certification of trusted digital repositories, draft for public comment. Mountain View, CA: Research Libraries Group. Retrieved April 14, 2009, from http://worldcat.org/arcviewer/1/OCC/2007/08/08/0000070511/viewer/file2416.pdf

Research Libraries Group. (2002). Trusted digital repositories: attributes and responsibilities an RLG-OCLC report. Mountain View, CA: Research Libraries Group. Retrieved September 11, 2007, from http://www.oclc.org/programs/ourwork/past/trustedrep/repositories.pdf

Ross, S. & McHugh, A. (2006). The role of evidence in establishing trust in repositories. D-Lib Magazine 12(7/8). Retrieved May 6, 2007, from http://www.dlib.org/dlib/july06/ross/07ross.html

Science and Technology Council. (2007). The digital dilemma strategic issues in archiving and accessing digital motion picture materials. The Science and Technology Council of the Academy of Motion Picture Arts and Sciences. Hollywood, CA: Academy of Motion Picture Arts and Sciences.

Thibodeau, K. (2007). If you build it, will it fly? Criteria for success in a digital repository. Journal of Digital Information, 8(2). Retrieved December 27, 2011, from http://journals.tdl.org/jodi/article/view/197/174

Trust. (2011). Merriam-Webster.com. Encyclopaedia Britannica Company. Retrieved December 30, 2011, from http://www.merriam-webster.com/dictionary/trust

Ward, J.H. (2011). Classifying Implemented Policies and Identifying Factors in Machine-Level Policy Sharing within the integrated Rule-Oriented Data System (iRODS). In Proceedings of the iRODS User Group Meeting 2011, February 17-18, 2011, Chapel Hill, NC.

Waters, D. and Garrett, J. (1996). Preserving Digital Information. Report of the Task Force on Archiving of Digital Information. Washington, DC: CLIR, May 1996.

If you would like to work with us on a digital curation and preservation or data governance project, please review our services page.

Trusted Digital Repository Development Lit Review

OAIS Reference Model & Preservation Design Summary

OAIS Reference Model | Literature Review and Comprehensive Exams

Abstract

In 1995, the Consultative Committee for Space Data Systems (CCSDS) began to coordinate the development of standard terminology and concepts for the long-term archival storage of various types of data. Under the auspices of the CCSDS, experts and stakeholders from academia, government, and research contributed their knowledge to the development of what is now called the Open Archival Information Systems (OAIS) Reference Model. The conclusion from a variety of experienced repository managers is that the authors of the OAIS Reference Model created flexible concepts and common terminology that any repository administrator or manager may use and apply, regardless of content, size, or domain. This literature review summarizes the standard attributes of a preservation repository using the OAIS Reference Model, including criticisms of the current version.

Citation

Ward, J.H. (2012). Managing Data: Preservation Repository Design (the OAIS Reference Model). Unpublished manuscript, University of North Carolina at Chapel Hill. (pdf)

Creative Commons License
This work is licensed under a Creative Commons Attribution-NoDerivatives 4.0 International License.


Table of Contents

Abstract

Introduction

The OAIS Reference Model: Definition

The OAIS Reference Model: Key Concepts

The OAIS Reference Model: Key Responsibilities

The OAIS Reference Model: Key Models

The OAIS Functional Model
The OAIS Information Model
I. The Logical Model for Archival Information
II. The Logical Model of Information in an Open Archival Information System
III. Data Management Information
Information Package Transformations

The OAIS Reference Model: Preservation Perspectives

Information Preservation
Access Service Preservation

The OAIS Reference Model: Archive Interoperability

The OAIS Reference Model: Compliance

The OAIS Reference Model: Example Deployments

The OAIS Reference Model: Other Criticisms

Conclusions and Future Work

References


Table of Figures

Figure 1 – Environment Model of an OAIS (CCSDS, 2002).

Figure 2 – Obtaining Information from Data (CCSDS, 2002).

Figure 3 – Information Package Concepts and Relationships (CCSDS, 2002).

Figure 4 – OAIS Archive External Data (CCSDS, 2002).

Figure 5 – OAIS Functional Entities (CCSDS, 2002).

Figure 6 – Composite of Functional Entities (CCSDS, 2002).

Figure 7 – High-Level Data Flows in an OAIS (CCSDS, 2002).


Introduction

Various organizations and the individuals who work for those organizations have a vested interest in keeping information accessible over time, although there may be reasons to delete or destroy some data and information once a certain amount of time has passed. The reasons for this interest are varied. Librarians and archivists have a professional expectation that they will do their best to curate and preserve cultural heritage data, scientific data, and other types of information for future generations of scholars and laymen. Some interest may be personal — most people would like to be able to view their children’s baby pictures, and their descendants may wish to know how their ancestors looked.

Regardless of the motivation for keeping this information available over time, most practitioners and laymen will agree that standards are one way to ensure this happens. Standards provide a common terminology that aid in discussions of repository infrastructure and needs (Beedham, et al., 2005; Lee, 2010). According to the members of the Science and Technology Council of the Academy of Motion Picture Arts and Sciences (2007), when preservationists and curators collaborate among and between industries and domains to create and use standards, the resulting economy of scale should reduce costs for all involved. For example, Galloway (2004) wrote that the proliferation of file formats increased costs, and that this problem must be solved in order to reduce preservation costs.

If costs are reduced, then the likelihood of a community having the resources to preserve and curate the material increases, or, by the same token, the amount of information that can be saved for the same price increases. This is true across the board, as standards beget other standards. If practitioners and researchers develop a standard terminology for a preservation repository, then common standards for metadata, file formats, filenames, metadata, metadata registries, and archiving and distributing are likely either to follow or to have preceded the preservation repository standard. In other words, standards development is an iterative process.

In 1995, the Consultative Committee for Space Data Systems (CCSDS) convened to coordinate “the development of archive standards for the long-term storage of archival data” (Beedham, et al., 2005). As part of this task, the members of the CCSDS determined that there was no common model or foundation from which to build an archive standard. Lavoie (2004) describes how the members realized they would have to create terminology and concepts for preservation; characterizations of the functions of a digital archiving system; and determine the attributes of the digital objects to be preserved. Thus, the members agreed to create a reference model that would describe the minimum requirements of an archival system, including terminology, concepts, and system components. The members of the CCSDS recognized from the beginning that the application of a common model extended beyond the space data system, and they involved practitioners and researchers from across a broad spectrum in academia, private industry, and government (Lavoie, 2004; Lee, 2010).

This essay summarizes the standard attributes of a preservation repository as defined by the CCSDS with the Open Archival Information Systems (OAIS) Reference Model, and addresses some of the weaknesses of the model.

The OAIS Reference Model: Definition

An Open Archival Information System (OAIS) is an electronic archive that is maintained by a group or association of people and/or organizations as a system. This member organization has accepted the responsibility of providing access to information for the stakeholders of the electronic archive. These stakeholders are referred to as the Designated Community. The owners and maintainers of the electronic archive have either implicitly or explicitly agreed to preserve the information in the electronic archive and make it available to the Designated Community for the indefinite long-term (CCSDS, 2002).

The CCSDS created the document for the OAIS Reference Model to outline the responsibilities of the owners and maintainers of the electronic archive. If they meet those responsibilities, then the electronic archive may be referred to as an “OAIS archive”. When the CCSDS members used the word “Open” as part of the name of the Reference Model, they referred to the fact that the standard was developed and continues to be developed in open forums. They are clear that the use of the word, “open” does not mean that access to the OAIS system itself or its contents is unrestricted (CCSDS, 2002).

The OAIS Reference Model: Key Concepts

The members of the CCSDS created three OAIS concepts. They called these the “OAIS Environment”, the “OAIS Information”, and the “OAIS High-level External Interactions”.

The “OAIS Environment” consists of the “Producers”, “Consumers”, and “Management” in the environment that surrounds an OAIS archive. The “Producer” is a system or people who provide the information (data) that is ingested into the archive to be preserved. The “Consumer” is a system or people who use the archive to access the preserved information. “Management” is a role played by people who are not involved in the day-to-day functioning of the archive, but who also set overall OAIS policy. Other OAIS or non-OAIS compliant archives may interact with the OAIS archive as either a “Producer” or a “Consumer” (CCSDS, 2002). The CCSDS represented these concepts with in Figure 1, below.

Figure 1 - Environment Model of an OAIS (CCSDS, 2002).
Figure 1 – Environment Model of an OAIS (CCSDS, 2002).

The CCSDS wrote the “OAIS Information” concept to consist of the “information definition”, the “information package definition”, and the “information package variants”.

First, the CCSDS defined “information”. Information is “any type of knowledge that can be exchanged, and this information is always expressed (i.e., represented) by some kind of data” (CCSDS, 2002). A person or system’s Knowledge Base allows them to understand the received information (see Figure 2, below). Thus, “‘data interpreted using its Representation Information yields Information'” would mean in practice that ASCII characters (the data) representing a language (such as English or French grammar and language, i.e., “the Knowledge Base” or Representation Information) provided Information to the person. Therefore, in order for Information to be represented with any meaning to a Designated Community, the appropriate Representation Information for a Data Object must also be preserved.

Figure 2 - Obtaining Information from Data (CCSDS, 2002).
Figure 2 – Obtaining Information from Data (CCSDS, 2002).

Second, whether data is disseminated to a Designated Community member, or ingested via a Producer, the information must be packaged. The CCSDS described an Information Package as consisting of the Packaging information, the Content Information (the information to be preserved and its representation information), and the Preservation Description Information (provenance, context, reference, and fixity). Provenance describes the source of the information; context provides any related information about the object; reference is the unique identifier or set of identifiers for the content; and fixity assures that the content has not been altered, either intentionally or unintentionally. The Packaging Information binds the Content Information and Preservation Description Information, per Figure 3, below.

Figure 3 - Information Package Concepts and Relationships (CCSDS, 2002).
Figure 3 – Information Package Concepts and Relationships (CCSDS, 2002).

Third, the CCSDS defined three variants of the Information Package: the Submission Information Package (SIP), the Archival Information Package (AIP), and the Dissemination Information Package (DIP). These three versions may be the same, but they may also be different. For example, a Producer may submit a SIP to an OAIS archive that is then augmented by the archive managers to meet their policies and standards. Once ingested, the AIP the repository owner stores may or may not be the same as the DIP accessed by the Consumer. Beedham, et al. (2005) criticize the developers of the OAIS Reference Model for assuming that all OAIS archives will have three different versions of an Information Package. The authors note that this concept is not practical for data archives, for example, because all relevant information about a data set must be gathered at the time of submission, and it is impractical to store different versions of an information object within an archive. Thus, a consumer may receive a DIP that is an exact copy of the AIP and the original SIP.

Finally, the CCSDS documented the concepts of the “OAIS High-level External Interactions”, in Figure 4, below. In short, they described the external data flows between and among the actors in an “OAIS Environment”: management, producer, and consumer. The CCSDS provided example interactions for Management, such as: funding, reviews, pricing policies, and “conflict resolution involving Producers, Consumers, and OAIS internal administration” (CCSDS, 2002).

Figure 4 - OAIS Archive External Data (CCSDS, 2002).
Figure 4 – OAIS Archive External Data (CCSDS, 2002).

The members of the CCSDS described “Producer Interaction” as involving the initial contact, the establishment of a Submission Agreement (which lays out what is to be submitted, how, and other expectations per the two parties) and the Data Submission Session(s) (in which the SIPS are submitted to the OAIS). The authors of the Reference Model conceded that there might be many types of Consumer Interactions with the OAIS managers. They described a variety of interactions, which include catalog searches, orders, help, etc. Beedham, et al. (2005) again criticized the CCSDS for assuming that all OAIS archives will provide order functions to their Designated Communities. The authors point out that some repository’s owner policies require that data is available for free, particularly when the owner of the archive is a national government agency, and the Designated Community are taxpayers.

The OAIS Reference Model: Key Responsibilities

The CCSDS established the minimal responsibilities required for a repository to be considered an OAIS archive. The OAIS must:

  • Negotiate for and accept appropriate information from information Producers.
  • Obtain sufficient control of the information provided to the level needed to ensure Long-Term Preservation.
  • Determine, either by itself or in conjunction with other parties, which communities should become the Designated Community and, therefore, should be able to understand the information provided.
  • Ensure that the information to be preserved is Independently Understandable to the Designated Community. In other words, the community should be able to understand the information without needing the assistance of the experts who produced the information.
  • Follow documented policies and procedures which ensure that the information is preserved against all reasonable contingencies, and which enable the information to be disseminated as authenticated copies of the original, or as traceable to the original.
  • Make the preserved information available to the Designated Community (CCSDS, 2002).

Beedham, et al. (2005) wrote that the authors of the OAIS created an “inbuilt limitation” because they assume “both an identifiable and relatively homogeneous consumer (user) community”. They note that this is not the case for national archives and libraries; their Consumers hold a wide variety of skills, educational levels, and knowledge.

The OAIS Reference Model: Key Models

The members of the CCSDS described the functional entities of the OAIS as three models, a “Functional Model”, the “Information Model”, and “Information Package Transformations”. The authors of the Reference Model included this section to provide a common set of preservation system terminology, and to provide a model from which future systems designers may work.

The OAIS Functional Model

The functional model of the OAIS consists of “six functional entities and related interfaces” (CCSDS). The six functional entities are ingest, archival storage, data management, administration, preservation planning, access, and common services. The seventh entity, “Common Services”, is described in the document, but it is not included in the image of the OAIS Functional Entities (see Figure 5, below) because “it is so pervasive”.

Figure 5 - OAIS Functional Entities (CCSDS, 2002).
Figure 5 – OAIS Functional Entities (CCSDS, 2002).

1. INGEST: Functions of the Ingest entity include accepting SIPs from internal or external Producers and then preparing the SIP(s) for management and storage within the repository. As part of preparing the SIP for storage within the repository, the repository employee in charge of ingest will check the quality of the SIP(s), create an AIP that complies with the standards of the repository and with the Submission Agreement, extract any Descriptive Information, and sync updates between Ingest and Archival Storage/Data Management.

Practitioners such as Beedham, et al. (2005) criticized the lack of detail available for the Ingest process; the authors of the Reference Model made it appear to be a very simple function, when, in fact, it can be a very complex process. As a result of this criticism, the CCSDS wrote a more detailed description of the Ingest Process in the Producer-Archive Interface Methodology Abstract Standard (CCSDS, 2004). However, many practitioners are clear that “pre-ingest functions are…essential for efficient and effective archiving” and the authors of the OAIS would serve the preservation repository community better by expanding the Ingest section of the OAIS Reference Model documentation, rather than creating a separate model and documentation (Beedham, et al., 2005).

Partially due to the lack of detail related to Ingest, much less the Ingest of records, archivists and records managers at Tufts University and Yale University applied the OAIS Reference Model and developed an Ingest Guide to aid practitioners in preserving university records (Fedora and the Preservation of University Records Project, 2006). (This project was discussed in a previous literature review on digital curation and preservation.)

2. ARCHIVAL STORAGE: Functions of archival storage include maintaining the integrity of the digital files, including the bits. Thus, the functions of this entity include not only receiving the AIP from Ingest and to Access, but also refreshing and migrating the media and file formats on and in which the data is stored. Other tasks of this entity include error checking and disaster recovery.

3. DATA MANAGEMENT: The data management entity provides the functions and services for accessing, maintaining, and populating administrative data and Descriptive Information. These include generating reports from result sets which are based on queries on the data management data; updating the database; and maintaining and administering archive database functions, such as referential integrity and view/schema definitions.

Beedham, et al. (2005) concluded that this entity is a simple idea that is messy in practice. When they mapped the different data management entities, their results created an “explosion” to all the different archival systems and processes.

4. ADMINISTRATION: The functions of this entity involve the overall management of the archive. This includes setting policies and standards; supporting and aiding the Designated Community; migrating and refreshing the archive contents, software, and hardware; and soliciting, negotiating, auditing Submission Agreements with both internal and external producers; and, any other administrative related duties as required.

These functions are designed for large organizations with automated processes; the authors of the Reference Model did not design this entity for small-scale digital repositories (Beedham, et al., 2005). However, most of these functions are an organic part of many archive’s functioning, even if the roles are all performed by one or two people. Beedham, et al. (2005) wrote that the functions of this entity are sufficient for most archives, but the listed tasks do not stand on their own, as each archive has its own set of responsibilities, requirements, procedures, and policies.

5. PRESERVATION PLANNING: This preservation planning entity is related to the Administrative entity, but it focuses purely on the preservation aspects of maintaining the archive for the indefinite long-term and ensuring the content is available to the Designated Community. The functions of the entity primarily involve monitoring the internal and external environments of the archive to ensure hardware and software are up to date; that the archive follows best practices with regards to the preservation of digital content; and that plans are in place to enable Administration goals, such as migration.

Repository managers criticized this entity because “real” archives do not operate as cleanly as the OAIS Reference Model authors envision; not all decisions and processes can or should be made proactively. Beedham, et al. (2005) concluded that the OAIS is at times overly bureaucratic and formalized.

6. ACCESS: This function provides the Designated Community with a method to obtain the desired information from the archive, assuming such access is not restricted and that the user in question is, in fact, allowed to access this particular information from this particular archive. The services and functions provided by the Access entity allow the Designated Community to determine the existence, location, availability, and description of the stored information. This function provides the information to the Designated Community as a DIP.

7. COMMON SERVICES: The “common services” functional entity refers to supporting services common in a distributed computing environment. These services involve operating systems, network services, and security services. Operating system services include the core services required to administer and operate an application platform, and provide an interface. These include: system management, operating system security services, real-time extension, commands and utilities, and, kernel operations. Network services provide the means for the archive to operate in a distributed network environment, including: remote procedure calls, network security services, interoperability with other systems, file access, and data communication. Security services protect the content in the archive from external and internal threats by providing the following capabilities and mechanisms: non-repudiation services (i.e., the sender and receiver log copies of the transmission and receipt of the information), data confidentiality and integrity services, access control services, and authentication (CCSDS, 2002).

A detailed mapping of the Ingest, Archival Storage, Data Management, Administration, Preservation Planning, and Access functional entities is included in Figure 6, below.

Figure 6 - Composite of Functional Entities (CCSDS, 2002).
Figure 6 – Composite of Functional Entities (CCSDS, 2002).

Again, Common Services is not included because it is a supporting service of distributed computing (CCSDS, 2002).

The OAIS Information Model

The Information Model “defines the specific Information Objects that are used within the OAIS to preserve and access the information entrusted to the archive” (CCSDS, 2002). The CCSDS intended for this section to be conceptual, and it is written for an Information Architect to use when designing an OAIS-compliant system. The authors divided the Information Model into three sections: the logical model for archival information, the logical model of information in an open archival information system (OAIS), and data management information.

I. The Logical Model for Archival Information

The CCSDS defined information as a combination of data and representation information. The Information Object itself is either a physical or digital Data Object with Representation Information that “allows for the full interpretation of data into meaningful information” (CCSDS, 2002). The Representation Information provides a method for the data to be mapped to data types such as pixels, arrays, tables, numbers, and characters. The latter are referred to as the Structure Information and Semantic Information, in turn, supplements this. Semantic Information examples include the language expressed in the Structure Information, which kinds of operations may be performed on each data type, their interrelationships, etc. Representation Information may also reference other Representation Information; for example, “Representation Information expressed in ASCII needs the additional Representation Information for ASCII, which might be a physical document giving the ASCII Standard” (CCSDS, 2002).

Representation Rendering Software and Access software are two special types of Representation Information. The latter provides a method for some or all of the content of an Information Object to be in a form understandable to systems or a human. The former displays the Representation Information in an understandable form, such as a file and directory structure (CCSDS, 2002).

The CCSDS defined four types of Information Objects: Content, Preservation Description, Packaging, and Descriptive. The Content Information Object is “the set of information that is the original target of preservation by the OAIS” and it may be either a physical or digital object (CCSDS, 2002). In order to determine clearly what must be preserved, an administrator of an archive must determine which part of a Content Information Object is the Content Data Object and which part is the Representation Information.

The CCSDS defined Preservation Descriptive Information as “information that will allow the understanding of the Content Information over an indefinite period of time” (CCSDS, 2002). This descriptive information focuses on ensuring the authenticity and provenance of the Information Objects. The authors of the Reference Model described four parts to the Preservation Descriptive Information: reference (unique identifier(s)), context (why it was created and how it relates to other Information Objects), provenance (the history, origin, and source), and fixity (data integrity checks or validation/verification keys).

As stated previously, the Packaging Information logically binds the pieces of the package onto a specific media via an identifiable entity. Finally, Descriptive Information provides a method for the Designated Community to locate, analyze, retrieve, or order the desired information via some type of Access Aid, which is generally an application interface or document (CCSDS, 2002).

II. The Logical Model of Information in an Open Archival Information System

The authors of the Reference Model described three types of Information Packages that are based on the four types of Information Objects. That is, the Content, Preservation Description, Packaging, and Descriptive Information Objects may be used to create one of three types of Information Packages: the Submission Information Package (SIP), the Archival Information Package (AIP), and the Dissemination Information Package (DIP). The SIP is the data that is sent to an archive by an internal or external Producer. The form and content of the SIP(s) may or may not meet the requirements of the archive ingesting it, and the archive manager may require some additional information to be added prior to ingest, such as a unique ID, checksum validation, virus checks, file name standardization, or additional Representation Information (metadata).

The CCSDS defined the AIP as the Information Package that is stored for the indefinite long-term. The requirements for the Representation Information for an AIP are more stringent than for other types of Information Packages, because this is the actual information that is the focus of preservation. The Information Objects and the Representation Information that comprise an AIP are stored in an archive as one logical unit (Lavoie, 2004).

The authors described two subsets of the AIP, the Archival Information Unit (AIU) and the Archive Information Collection (AIC). The former “represents the type used for the preservation function of a single content atomic object”, while the latter “organizes a set of AIPs (AIUs and other AICs) along a thematic hierarchy….” (CCSDS, 2002). The CCSDS described the Collection Description as a subtype…”that has added structures to better handle the complex content information of an AIC” (CCSDS, 2002). The archive manager may use Collection Description to describe the entire collection or zero or more individual units within the collection. One benefit of Collection Description is the ability to generate new virtual collections based, for example, either on access or theme.

The Dissemination Information Package (DIP) is the Information Package ordered by or provided to the Designated Community. The CCSDS intended for the DIP to be a version of the AIP, but it is entirely possible for the AIP and the DIP to be exactly the same Information Package. Lavoie (2004) described possible variations between an AIP and a DIP. The Designated Community member accessing the archive may receive a different format, for example, a .jpeg instead of a .tiff. The DIP may contain less metadata than is available with the AIP, or even less content, since a DIP may correspond to one or more or even part of an AIP.

III. Data Management Information

Last, the CCSDS included Data Management Information as one part of the Logical Model of Information in an OAIS. That is, the authors of the Reference Model made the requirement that information needed for the operation of the archive is to be stored in the archive databases as persistent data classes. The type of information required includes: statistical information, such as access numbers; customer profile information; accounting information; preservation process history; event based order information; policy information, including pricing; security information; and, transaction tracking information (CCSDS, 2002). Other data management information may be added to the archive at the discretion of the archive managers or as requested by the Designated Community. However, Beedham, et al. (2005) concluded that the information categories in the Information Model are “too broad, functionally organised…and do not reflect the way metadata are packaged and used across particular archival practice”.

Information Package Transformations

The CCSDS members created the Functional Model to describe the architecture of an OAIS, and the Information Model to describe the content held by the OAIS. The authors also described the lifecycle of the Information Package and any associated objects, as well as its logical and physical transformations.

In short, when a Producer agrees to submit data to an OAIS, a Submission Agreement is created and approved with the OAIS administrator. The Producer then submits data in the form of a SIP to an OAIS, where the OAIS administrator stores it in a staging area. In the staging area, the OAIS manager will perform any necessary transformations to the SIP so it will meet the standards of the OAIS, and the criteria of the Submission Agreement. The OAIS manager will create AIPs from the SIP. This mapping may not be one-to-one. One SIP may produce one AIP or many AIPs, many SIPs may produce one AIP, many SIPs may produce many AIPs, and one SIP may produce no AIPs (CCSDS, 2002). The CCSDS described this process in more detail in the Producer-Archive Interface Methodology Abstract Standard (CCSDS, 2004).

At the same time as the SIPs are transformed into AIPs and stored in the OAIS, the Data Management functional entity augments the existing Collection Descriptions to include the contents of the Package Descriptions. When a Consumer, i.e., a member of the Designated Community, wishes to access the information contained in an OAIS, the member will do so via the Access functional area. Once the consumer has located the desired information via some type of finding aid, the information is provided to the Consumer in the form of a DIP. The authors of the Reference Model designed the DIP and AIP mapping to be similar to that between SIPs and AIPs. That is, the mapping may or may not be 1:1, depending on whether or not a transformation is performed.

Figure 7 - High-Level Data Flows in an OAIS (CCSDS, 2002).
Figure 7 – High-Level Data Flows in an OAIS (CCSDS, 2002).

Based on the Information Package Transformation in Figure 7, above, the authors of the OAIS Reference Model assumed that the Consumer from an AIP would create a DIP on demand. Beedham, et al. (2005) wrote, “this approach has serious drawbacks”. These data repository managers determined that by creating the DIP at the time of Ingest, they could ensure that the records accessed by the Consumer are in a “technically usable state” (Beedham, et al., 2005). They initially created DIPS from an AIP upon demand by a Consumer, but often times, the data is 5-10 years old at the time of ingest into the archive, and the data is often years older than that when accessed by a Consumer. This often meant that the DIP was not independently understandable by the Consumer, and the researchers who created the data either were no longer available, or could not answer queries regarding the data because too much time had passed.

Beedham, et al., (2005) discovered that by creating the DIP at the time of Ingest, they were able to eliminate many errors in the digital records while they still had co-operation from the Producer. This also improves the understanding and “preservability” of the AIP itself. As well, standard archival practice is to store the original version, and provide only a copy to users. In that sense, storing the AIP and creating a DIP at Ingest that is an exact replica of the AIP follows this practice, although “copy” does not have the same meaning in the digital world as it does in the physical. The OAIS Reference Model does not preclude this practice, but neither does it explicitly condone it.

The OAIS Reference Model: Preservation Perspectives

The members of the CCSDS used the Functional Model and the Information Model just described and applied them to information preservation and access service preservation. The former refers to the migration of digital information and the latter to the preservation of the services used to access the digital information.

Information Preservation

The CCSDS defined migration as “the transfer of digital information, while intending to preserve it, within the OAIS” (CCSDS, 2002). The authors distinguished migration from transfers based on three characteristics: the focus is on the preservation of the full information content; the new archival implementation is a replacement for the old; the responsibility for and full control of the transfer reside within the OAIS. The CCSDS (2002) members described three primary drivers for migration: the media on which the information resides is decaying; technology changes; and, the improved cost-effectiveness of newer technology over older or obsolete technology.

The committee members defined four types of migration: refreshment, replication, repackaging, and transformation. They determined that Refreshment refers to the replacement of a media instance with a similar piece of media, such that the bits comprising the AIP are simply copied over. An example of this would be replacing a computer disk. The authors defined Replication as a bit transfer to the same or new media-type, where there is “no change to the PDI, the Content Information, and the Packaging Information. An example of replication would be a full back up of the contents of an OAIS. The CCSDS described Repackaging as a change to the Packaging Information during transfer. If files from a CD-ROM are moved to new files on another media type, with a new file implementation and directory, then the files have been Repackaged.

Last, the CCSDS (2002) defined Transformation as “some change in the Content Information or PDI bits while attempting to preserve the full information content”. If an AIP undergoes Transformation, then the new AIP is considered a new Version of the previous AIP. For example, a file in the .doc format may be transformed to a .pdf for preservation purposes. Some transformations are Reversible, while others are Non-reversible. The CCSDS members state that only when an AIP is migrated using Transformation is the resulting AIP considered a new version; the AIP version is independent of Refreshment, Replication, and Repackaging.

Access Service Preservation

As part of examining preservation perspectives, the members of the CCSDS briefly addressed how to continue to provide Consumers access services as technology changes. A method archive managers use to maintain access is to develop Application Programming Interfaces (APIs) to provide access to AIPs. Another method they incorporate is to use emulation or provide the original source code to provide access to a set of AIUs while maintaining the same “look and feel” as the original access method.

The OAIS Reference Model: Archive Interoperability

A community of users and managers of digital repositories may wish to share data or cooperate with other archives. The reasons for this may vary; in some cases, the repository managers may wish to provide mutual back up and replication services with a similar archive, in order to prevent data loss and reduce costs. In another instance, a user community may prefer one point of entry to search for required information across multiple digital archives. Regardless of the motivations of an archive owner for interoperating with another archive, the interactions may be defined by two categories, technical and managerial.

The CCSDS defined four types of interoperating archives: independent, cooperating, federated, and shared resources. They described an independent archive as one that does not interact with other archives. There is no technical or management interaction between this type of archive and other archives. The authors defined cooperating archives as those archives that do not have a common finding aid, but otherwise share common dissemination standards, submission standards, and producers.

The members of CCSDS (2002) wrote that a federated archive consists of two communities, Local and Global, and those archives “provide access to their holdings via one or more common finding aids”. They note that Global dissemination and Ingest are optional, and that the needs of the Local community tend to take precedence over the Global community. Furthermore, they described three levels of functionality for a Federated archive: Central Site (i.e., one point of entry to all archive content via metadata harvested by the central site), Distributed Finding Aid (i.e., federated searching of all archives), and Distributed Access Aid (i.e., a “standard ordering and dissemination mechanism”) (CCSDS, 2002). They wrote that federated archives tend to have similar policy and technology issues, such as authentication and access management, preservation of federation access to AIPs, duplicate AIPs, and providing unique AIPs.

Last, the authors described “shared resources”, where archives enter into agreements to share resources for their mutual benefit, often to reduce costs. The wrote that this type of agreement does not alter the view of the archives by their respective Designated Communities, it merely requires the implementation of a variety of standards internal to the archive, such as ingest-storage and access-storage interface standards (CCSDS, 2002).

The CCSDS described the primary management issue related to archive interoperability in one word: autonomy. The members of the CCSDS (2002) characterized three primary autonomy levels: no association because there are no interactions; an association member’s autonomy with regards to the federation is maintained; and association members are bound to the federation by a contract.

The OAIS Reference Model: Compliance

What does it mean to be “OAIS Compliant”? The members of the CCSDS stated that if a repository “supports the OAIS information model”, commits to “fulfilling the responsibilities listed in chapter 3.1 of the reference model”, and uses the OAIS terminology and concept appropriately, then the archive is compliant (CCSDS, 2002; Beedham, et al., 2005). When the members of the CCSDS wrote the Reference Model documentation, they did not recommend any particular concrete implementation of hardware, software, etc., as the authors deliberately designed it to be a conceptual framework. How then, may an archive owner, manager, or member of a Designated Community “prove” that the archive of interest is, in fact, OAIS-compliant?

One method to audit OAIS-compliance is to create a set of standards that define the attributes of a trusted digital repository. The Research Libraries Group (RLG) and the Online Computer Library Center (OCLC) funded the development of the attributes of a “trusted digital repository” in March 2000. The two groups produced a report that defined the attributes and responsibilities of a trusted digital repository in 2002 (Research Libraries Group, 2002). Beedham, et al. (2005) notes that the authors of the report put compliance with the OAIS Reference Model first on the list of attributes of a trustworthy repository.

Based on this report, RLG, OCLC, the Center for Research Libraries (CRL), and the National Archives and Records Administration (NARA) produced a “criteria and checklist” in 2005 called, “Trustworthy Repositories Audit & Certification: Criteria and Checklist” (Research Libraries Group, 2005). The authors designed it so that archive managers could use it for audit and certification of the archive. Experts in the field merged the RLG and OCLC report from 2002 and the “Criteria and Checklist” from 2005 to develop a Recommended Practice under the auspices of the CCSDS. They called the document the “Audit and Certification of Trustworthy Digital Repositories Recommended Practice”, and the CCSDS released the document in September 2011. The CCSDS released the document to provide a basis for the audit and certification of the trustworthiness of a digital repository by providing detailed criteria by which an archive shall be audited (CCSDS, 2011). These documents will be discussed in detail in a separate literature review.

One criticism of the OAIS is that it is challenging to develop a from-scratch repository using the Reference Model. Egger (2006) conducted a use case analysis as part of a standard software development process, and determined that he must “develop additional specifications which fill the gap between the OAIS model and software development”. He wrote that is was difficult to map OAIS functions as use case scenarios, because the descriptions contain different levels of detail. For example, he states that some functions are written as general guidelines, while others are “specified nearly at the implementation level” (Egger, 2006). He also criticizes the authors for mixing technical functionality with management functionality, because in order to develop a technical system, the management functions must be removed. Egger (2006) recommends creating additional specifications that would “define system architectures and designs that conform to the OAIS model”, although he notes that the OAIS Reference Model is not a technical guideline.

Beedham, et al. (2005) wrote that as repository managers, they have to consider other legislation, standards, guidelines, and regulations when determining the archive’s OAIS compliance. For example, they must provide web access to the disabled as part of their charter as national archives, and they have specific responsibilities to the data depositor (the Producer) with regards to Intellectual Property and statistical disclosure. The authors of the Reference Model did not discuss how to comply with legislation, et al., when to do so would make the archive in question “not OAIS-compliant”, if audited.

The OAIS Reference Model: Example Deployments

Ball (2006) examined the OAIS Reference Model to determine the application of it to engineering repositories. Two common generic repository systems that use the OAIS Reference Model are DSpace and Fedora. The creators of DSpace designed it primarily for Institutional Repositories, while the researchers behind Fedora designed it to be a digital library that stores multimedia collections. Ball found five custom repositories that claim to be OAIS-compliant: the Centre deDonnées de la Physique des Plasmas (CDPP), MathArc, the European Space Agency (ESA) Multi-Mission Facility Infrastructure (MMFI), the National Oceanic and Atmospheric Administration (NOAA) Comprehensive Large Array-data Stewardship System (CLASS), and, the National Space Science Data Center (NSSDC). While Ball did discuss the efforts of RLG, OCLC, CRL, and NARA to provide a method for audit and certification, he did not note whether or not the creators and owners of DSpace, Fedora, or any of the custom systems, or their users, had formally audited any of the repository software for OAIS compliance.

Vardigan & Whiteman (2007) did apply the OAIS Reference Model to the social science data archive for the Inter-university Consortium for Political and Social Research (ICPSR). The authors wished to determine their repository’s conformance to the OAIS Reference Model. After an extensive audit, they realized that the ICPSR digital repository did fulfill many of the key responsibilities of an OAIS archive, with two exceptions. First, they need to publish a preservation policy, and second, they discovered that their Preservation Description Information is not always clearly labeled and it is often incomplete (Vardigan & Whiteman, 2007).

Data grids are an example of a general systems deployment of the OAIS Reference Model. A grid administrator may map the policies and procedures that govern the data flow of the data grid to specific OAIS components. For example, if the grid administrator would like to create authentic copies, then s/he will implement access policies that govern the generation of DIPs. The grid administrator may implement replication and integrity checking by implementing storage policies; and may implement the processing of SIPs and the creation of AIPs by implementing ingest policies (Reagan Moore, personal communication, December 22, 2011). Other specific OAIS components may be mapped to the data grid’s policies and procedures data flow as needed; these are but a few examples.

The OAIS Reference Model: Other Criticisms

Higgins and Semple (2006) compiled a list of recommendations for updates to the OAIS Reference Model in preparation for the CCSDS’ review of the recommendation at the five-year mark. The authors compiled the list of recommendations on behalf of the Digital Curation Centre and the Digital Preservation Coalition. Among the general recommendations, the authors listed: supplementary documents such as OAIS-lite for managers, a self-testing manual, an implementation checklist, and a best practice guide. The authors requested more concrete and up-to-date examples for implementers.

Higgins and Semple noted the CCSDS’ tendency to be very prescriptive and detailed in some sections, and overly general in others. They re-iterated that the CCSDS should create a better description of minimal requirements, as not everything must be implemented. The authors requested a review of the terminology clashes between the OAIS Reference Model, PREMIS, and other standards, and asked the CCSDS to resolve these differences. Higgins and Semple requested terminology and clarification updates by chapter, including updates to words such as “repository”, “preservation”, “security”, etc. They also identified a variety of outdated material.

The members of the CCSDS Data Archiving and Ingest Working Group did respond to this list of recommendations. They adopted some of the recommendations and made changes to the text of the OAIS Reference Model, but they refused to make other requested changes. Higgins and Boyle (2008) compiled a response to the CCSDS, again on behalf of the Digital Curation Centre and the Digital Preservation Coalition. Their concerns related to the changes rejected by the CCSDS Data Archiving and Ingest Working Group. Higgins and Boyle (2008) wanted “to ensure that the revised standard” would:

  • remain up-to-date until the next planned review;
  • remain applicable to the current heterogeneous user base;
  • be easier to understand through a structure which clearly delimits normative text, use cases and examples;
  • contain guidelines on how to achieve an implementation;
  • follow ISO practice by clearly referencing other applicable standards; and,
  • clarify its applicability to digital material (Higgins & Boyle, 2008).

It will be interesting to note which, if any, of these recommendations the members of the CCSDS include in the next revision of the OAIS Reference Model.

Conclusions and Future Work

Practitioners note that one benefit of the OAIS Reference Model has been “the utility of the OAIS language as a means of communication” between partnering repository administrators, who often had different terminology (Beedham, et al., 2005). The authors recommend that current archives should adopt the OAIS language in lieu of their own terminology, and new archive administrators should adopt it from the inception of the archive. Allinson (2006) writes that the OAIS Reference Model “ensures good practice”, as it “draws attention to the important role of preservation repositories” by providing a standard model so that preservation is considered part and parcel of other archive functions and activities. When the CCSDS outlined an archive manager’s Mandatory Responsibilities, the authors asked only that an archive’s “preservation has been planned for and a strategy identified”, as most repository managers already fulfill those tasks as a de facto part of the repository’s functioning (Allinson, 2006).

One area of future work may be to create an “OAIS lite” for smaller archives, who do not have the personnel or need for such a bureaucratic model (Beedham, et al., 2005). Another area for future work is to de-homogenize the definition of Designated Community, as not every repository has a narrow audience of users. The CCSDS might consider recommending other metadata documentation to supplement the Reference Model, or create a separate recommendation; similar to the way the Producer-Archive Interface Methodology Abstract Standard (CCSDS, 2004) supplements the Ingest entity. This documentation would describe how the different information packages breakdown or how to apply metadata schemas (Beedham, et al., 2005; Allinson, 2006).

Egger (2006), Allinson (2006), and Beedham, et al. (2005), among others, complained that the authors of the OAIS Reference Model are inconsistent in the specifications, as some specifications are very general, while others are very detailed. Therefore, one area for future work is for the CCSDS to create consistency within the Reference Model document with regards to specificity. Finally, Beedham, et al., concluded that the authors of the Reference Model may want to re-word the recommendation to take into account that a SIP, AIP, and DIP may all be one and the same, rather than assume that each of these are different types of Information Packages.

In spite of the various criticisms, the overall conclusion from a variety of experienced repository managers is that the authors of the OAIS Reference Model created flexible concepts and common terminology that any repository administrator or manager may use and apply, regardless of content, size, or domain (e.g., academia, private industry, and government).

References

Allinson, J. (2006). OAIS as a reference model for repositories an evaluation. Bath, England: UKOLN. Retrieved December 19, 2011, from http://www.ukoln.ac.uk/repositories/publications/oais-evaluation-200607/Drs-OAIS-evaluation-0.5.pdf

Ball, A. (2006). Briefing paper: the OAIS Reference Model. Bath, England: UKOLN. Retrieved December 19, 2011, from http://homes.ukoln.ac.uk/~ab318/docs/ball2006oais/

Beedham, H., Missen, J., Palmer, M. & Ruusalepp, R. (2005). Assessment of UKDA and TNA compliance with OAIS and METS standards. UK Data Archive and The National Archives, 2005. Retrieved: December 20, 2011, from: http://www.jisc.ac.uk/uploaded_documents/oaismets.pdf

CCSDS. (2002). Reference model for an Open Archival Information System (OAIS) (CCSDS 650.0-B-1). Washington, DC: National Aeronautics and Space Administration (NASA). Retrieved April 3, 2007, from http://nost.gsfc.nasa.gov/isoas/

CCSDS. (2004). Producer-archive interface methodology abstract standard (CCSDS 651.0-B-1). Washington, DC: National Aeronautics and Space Administration (NASA). Retrieved August 18, 2007, from http://public.ccsds.org/publications/archive/651x0b1.pdf

CCSDS. (2011). Audit and Certification of Trustworthy Digital Repositories (CCSDS 652.0-M-1). Magenta Book, September 2011. Washington, DC: National Aeronautics and Space Administration (NASA).

Egger, A. (2006). Shortcomings of the Reference Model for an Open Archival Information System (OAIS). IEEE TCDL Bulletin, 2(2). Retrieved October 23, 2009, from http://www.ieee-tcdl.org/Bulletin/v2n2/egger/egger.html

Fedora and the Preservation of University Records Project. (2006). 2.1 Ingest Guide, Version 1.0 (tufts:central:dca:UA069:UA069.004.001.00006). Retrieved April 16, 2009, from the Tufts University, Digital Collections and Archives, Tufts Digital Library Web site: http://repository01.lib.tufts.edu:8080/fedora/get/tufts:UA069.004.001.00006/bdef:TuftsPDF/getPDF

Galloway, P. (2004). Preservation of digital objects. In B. Cronin (Ed.), Annual Review of Information Science and Technology, 38(1), (pp. 549-590).

Higgins, S. & Boyle, F. (2008). Responses to CCSDS’ comments on the ‘OAIS five-year review: recommendations for update 2006’. London: Digital Curation Center and Digital Preservation Coalition.

Higgins, S. & Semple, N. (2006). OAIS five‐year review: recommendations for update. London: Digital Curation Center and Digital Preservation Coalition.

Lavoie, B. (2004). The open archival information system reference model: introductory guide. Technology Watch Report. Dublin, OH: Digital Preservation Coalition. Retrieved March 6, 2007, http://www.dpconline.org/docs/lavoie_OAIS.pdf

Lee, C. (2010). Open archival information system (OAIS) reference model. In Encyclopedia of Library and Information Sciences, Third Edition. London: Taylor & Francis.

Research Libraries Group. (2002). Trusted digital repositories: attributes and responsibilities an RLG-OCLC report. Mountain View, CA: Research Libraries Group. Retrieved September 11, 2007, from http://www.oclc.org/programs/ourwork/past/trustedrep/repositories.pdf

Research Libraries Group. (2005). An audit checklist for the certification of trusted digital repositories, draft for public comment. Mountain View, CA: Research Libraries Group. Retrieved April 14, 2009, from http://worldcat.org/arcviewer/1/OCC/2007/08/08/0000070511/viewer/file2416.pdf

Science and Technology Council. (2007). The digital dilemma strategic issues in archiving and accessing digital motion picture materials. The Science and Technology Council of the Academy of Motion Picture Arts and Sciences. Hollywood, CA: Academy of Motion Picture Arts and Sciences.

Vardigan, M. & Whiteman, C. (2007). ICPSR meets OAIS: applying the OAIS reference model to the social science archive context. Archival Science, 7(1). Netherlands: Springer. Retrieved February 20, 2008, from http://www.springerlink.com/content/50746212r6g21326/

If you would like to work with us on a digital preservation and curation or data governance project, please review our services page.

OAIS Reference Model & Preservation Design Summary

Manage Data: Preservation Standards & Management

Community data standards stewardship preservation curation

Abstract

Archivists, librarians, computer scientists and other researchers and scientists have been concerned about the long-term survivability of data for decades. This data may be in the form of actual data sets, or data that represents and describes published works, art, video, audio, or other file formats. This literature review describes the emergence of digital curation and digital preservation standards in the context of managing data. Standards for digital curation and digital preservation augment the ability of data owners and users to ensure the survivability of their data, but these standards do not directly “cause” the long-term preservation of the data itself. The conclusion is that the survivability of data depends on the will and desire of the data owners and users, and the availability of financial resources to do so.

Citation

Ward, J.H. (2012). Managing Data: the Emergence & Development of Digital Curation & Preservation Standards. Unpublished manuscript, University of North Carolina at Chapel Hill. (pdf)

Creative Commons License
This work is licensed under a Creative Commons Attribution-NoDerivatives 4.0 International License.


Table of Contents

Abstract

Introduction

Why Preserve and Curate Data?

Basic Definitions

Motivating Factors for the Development of Digital Curation and Digital Preservation Standards

Persistence

Overview of the OAIS Reference Model and the Audit and Certification of Trustworthy Digital Repositories Recommended Practice

Applications of the Audit and Certification of Trustworthy Digital Repositories Recommended Practice and the OAIS Reference Model

Other Technical Issues

General Digital Repository Management

Funding

Current Status and Future Challenges/Further Work

Conclusion

References


Table of Figures

Figure 1 – the Digital Curation Centre Curation Lifecycle Model (Higgens, 2007).

Figure 2 – Scientific and Technical Information Lost Over Time (Nelson, 2000).

Figure 3 – Core data grid capabilities and functions for implementing a persistent archive (Moore & Merzky, 2002).

Figure 4 – the OAIS Reference Model “Functional Model” (CCSDS, 2002).


Introduction

Librarians and archivists have spent centuries wringing their hands and having multiple heated discussions about the best method for recording information for the purposes of transmitting it to succeeding generations. It could be, perhaps, that librarians and archivists of ancient Egypt were in agreement that clay tablets were the best form to transmit information, but one imagines that even within that framework there was much discussion with many agreements and disagreements. Which clay would last? Who was the best potter to fire it? How do we know that potter is actually selling us the quality of clay that he promised? “There must be standards!”…and so on and so forth. Regardless, those ancient librarians and archivists chose well, as 5,000 years later, those clay tablets have stood the test of time and are still readable now (Krasner-Khait, 2001).

Then “someone” determined that papyrus was better than clay as an information transmission form. After all, it was lighter, it couldn’t break, and it was much easier to carry over long distances. The material required less storage space, as well, which would reduce overall costs. One can imagine the “old school” librarians and archivists with their clay fetish, snubbing the new papyrus advocates. However, the papyrus advocates eventually won, and the rolls of papyrus replaced clay tablets as the information medium of choice. Papyrus remained the primary information storage method of choice for around 3,000 years, until the development of the codex by the Romans in the first century A.D. (Zen College Life, 2011).

One can only imagine the consternation old school papyrus librarians and archivists faced with the invention of the codex. Should they change all of their holdings of clay tablets and papyrus rolls to codices? Should they leave this information in the old technologies and only store new information in the codex format? How many resources of time, money, and personnel would it take to migrate information from the old formats to the new? By 300 A.D., the codex was as popular as the papyrus scroll — and the first and current format used for the Christian Bible. These debates, and one can be sure there were discussions, were not purely academic. There were then, as now, practical reasons to be concerned with the transmission of historical, cultural, political, and literary information to succeeding generations. By the time Gutenberg invented the movable type press in the 15th Century, the codex had evolved into the book, and another information revolution occurred. Books became more prevalent, and no doubt librarians and archivists of Western Europe, Asia and the Middle East felt an information deluge of their own as they figured out how to organize, lend, copy, store, and find these books as libraries and archives grew and evolved from the middle ages to the 20th Century (Zen College Life, 2011).

The mid-20th Century brought the computer, and then networked computers that share and store information as bits and bytes. The formats these bits are stored in evolve every few years, as do the software to run the formats, and the hardware that runs the software. Format changes now occur every few years, and make the 3,000 year reign of clay tablets as the information transmission form of choice unimaginable. Yet, one is certain that current librarians and archivists are solving the same problems their counterparts faced 5,000 years ago. How do you select, preserve, maintain, collect and archive information in order to make it available to succeeding generations? This is the essence of curation, whether digital or physical. However, the focus of this paper is to discuss the curation and preservation of binary data; therefore, curation methods as applied to physical artifacts are out of the scope of this discussion.

Why Preserve and Curate Data?

There are many, many motivations for preserving data, regardless of the content. It would be challenging to cover every possible reason why some person or organization might want to curate and preserve their data. A few themes are common, though. In some instances, preservation is motivated by the human desire to preserve the current record (in a general sense) for future generations to access and use. Other motivations may be more base — to help a particular company or organization comply with legal requirements or provide a source of revenue. In some cases, cultural heritage concerns may overlap with financial incentives, such as with digital movies. For example, executives at movie companies have a huge financial incentive to ensure that their libraries are accessible in the future as formats change, so that they may sell and re-sell their titles for public consumption (Science and Technology Council, 2007). These films also represent the cultural heritage of humanity, whether the film in question is “Harold and Kumar Take Guantanamo Bay” or “Citizen Kane”. In other organizations such as the National Archives, federal legal requirements overlap with a professional desire and charge to preserve the United States’ materials “for the life of the republic” (Thibodeau, 2007). Individuals’ health records must be available for the life of the person. Most of us would like our photographs to be accessible by our descendants and relatives, and not lost in a hard drive or a hard drive crash. These are but a few examples of “what” and “why” data are deemed preservation-worthy.

Basic Definitions

Archive“, “digital archive”, “data“, “information“, “knowledge“, “wisdom“, “digital preservation“, “digital curation“, “reliable“, “authentic“, “integrity“, and “trustworthy“.

Tibbo (2003) writes that computer scientists tend to use “archive” simply as a term to describe the storage and backup of digital data in an offline electronic environment, while archivists see the process of archiving data as part of a complex process that encompasses the entire lifecycle of a digital object (Waters & Garrett, 1996; Higgens, 2007). One may also see the difference between “archive” in the computer science sense as simply storing data, whereas an “archive” per an archivist is an entire information system lifecycle that encompasses data, information, knowledge, and, perhaps, wisdom that will be made accessible for the indefinite long-term.

As well, practitioners who work with digital libraries and digital archives often use “digital library” to mean a “digital archive”, and vice versa. What then, is a digital archive?

Waters and Garrett (1996) defined

digital archives strictly in functional terms as repositories of digital information that are collectively responsible for ensuring, through the exercise of various migration strategies, the integrity and long-term accessibility of the nation’s social, economic, cultural and intellectual heritage instantiated in digital form. Digital archives are distinct from digital libraries in the sense that digital libraries are repositories that collect and provide access to digital information, but may or may not provide for the long-term storage and access of that information. Digital libraries thus may or may not be, in functional terms, digital archives and, in fact, much of the recent work on digital libraries is notably silent on the archival issues of ensuring long-term storage and access….Conversely, digital archives necessarily embrace digital library functions to the extent that they must select, obtain, store, and provide access to digital information. Many of the functional requirements for digital archives defined in this report thus overlap those for digital libraries.

The Society of American Archivists (1999) defines the core curation functions of any archive as appraisal, accession, arrangement, description, preservation, access and use. The basic archival principles remain the same whether an archive contains physical artifacts or data (Hedstrom, 1995). How an archivist applies these concepts may vary depending on the digital objects or physical artifacts to be preserved. Within the limitations of digital data, however, most applications of a data archive as of this writing use the Open Archival Information System (OAIS) (Consultative Committee for Space Data Systems, 2002) as a reference model. This model will be discussed briefly in a later section. However, the Consultative Committee for Space Data Systems (2002) notes that an “OAIS Archive” is distinguished from other uses of the term “archive” because it consists of an “organization of people and systems, that has accepted the responsibility to preserve information and make it available for a Designated Community”. The archive must meet the set of responsibilities outlined in the OAIS Reference Model to be considered an “OAIS archive” (Consultative Committee for Space Data Systems, 2002). Otherwise, it is merely an “archive”.

Data are “any information in digital form” (Higgens, 2007) that “correspond to the bits (zeroes and ones) that comprise a digital entity” (Moore, 2002). Data include both simple and complex objects, as well as structured collections. A simple object may be a text file or image; a complex file may comprise an entire web site; and a database is an example of a structured collection (Higgens, 2007). Furthermore, Galloway (2004) notes that to be digital the objects must “require a computer to support their existence and display”.

Moore (2002) writes from a computer science perspective that information “corresponds to any tags associated with bits”, while Buckand (1991) defines information via the lens of Information Science. He describes “information-as-process”, “information-as-knowledge”, and, “information-as-thing”. According to Buckland, “information-as-process” is the act of informing, while “information-as-knowledge” is the actual knowledge communicated during “information-as-process”. He defines “information-as-thing” by objects such as text and data, for example, because they impart and communicate knowledge; and notes that knowledge may be contained in text, etc. that describes these information objects. Ackoff (1989) takes a management science approach and posits that information is contained in answers to questions posed with “who”, “what” “where”, and “when”.

Knowledge “corresponds to any relationship that is defined between information attributes” (Moore, 2002); it is the application of data and information. Knowledge refines information and makes “possible the transformation of information into instructions” by answering the “how” questions (Ackoff, 1989). Wisdom is at the pinnacle of Ackoff’s hierarchy as an ideal state that evaluates the long-term consequences of an act. One might argue that repositories with audit mechanisms to ensure “authenticity” and “trust” apply wisdom in the form of policies to curate data, information, and knowledge “as things”.

The phrases “digital curation” and “digital preservation” are often used interchangeably, but they have slightly different meanings. The term “digital preservation” refers to a “series of managed activities necessary to ensure continued access to digital materials for as long as necessary” (Digital Preservation, 2009). Members of the Digital Preservation Coalition made this definition deliberately broad in order to refer “to all of the actions required to maintain access to digital materials beyond the limits of media failure or technological change” (Digital Preservation, 2009). As part of that definition, these members framed digital preservation as short-term, medium-term, and long-term. They defined “short-term” as access to the materials for the foreseeable future or for a defined period of time; “medium-term” as providing access for the near-term but not indefinitely; and, “long-term” as providing continued access to the materials for the indefinite future (e.g., as long as possible). Hedstrom (1995) writes that

preservation of an electronic record entails retaining its content; maintaining the ability to reproduce its structure; and providing linkages between an archival document and related records, its creator and recipient, the function or activity that it derived from, and its place in a larger body of documentary evidence.

Researchers and practitioners at the Digital Curation Centre (DCC) have defined digital curation as involving “maintaining, preserving and adding value to digital research data throughout its lifecycle” (Digital Curation Centre, 2010). An archivist, librarian or other data manager begins curation at the time the collection is assembled or acquired. He or she actively manages the collection in order to “mitigate the risk of digital obsolescence” and “to reduce threats to [the data’s] long-term research value” (Digital Curation, 2010). According to DCC researchers and practitioners, curation serves two other primary purposes that include providing a means to share data and reducing duplication of effort in data creation.

Higgens (2007) conceptualized an ideal model of digital curation as a lifecycle with three primary areas: full lifecycle actions, sequential actions, and occasional actions. These actions may be applied across the entire digital lifecycle or sequentially through it (Higgens, 2007). She defines “full lifecycle actions” as encompassing preservation planning; description and representation information; and, curation and preservation. Higgens models sequential actions as: conceptualization; creation or reception; access, use, and re-use; appraisal and selection; ingestion; storage; preservation action; and, transformation.

 Figure 1 - the Digital Curation Centre Curation Lifecycle Model (Higgens, 2007).
Figure 1 – the Digital Curation Centre Curation Lifecycle Model (Higgens, 2007).

The significance of the model is that provides a visual tool and summary from which a repository manager may plan the curation tasks appropriate for the collection and the repository at any stage in the curation lifecycle.
Duranti (1995) defined the terms “reliability” and “authenticity” based on diplomatic concepts. A record is “reliable” when the degree of completeness of the form and the degree of control of the procedure of creation meet the requirements of the socio-juridical system in which it is created. A reliable record is a “fact in itself, that is, as the entity of which it is evidence” (Duranti, 1995). If a document is what it claims to be, then the document is considered authentic. However, just because a document is authentic does not make it reliable. If a record is authentic, then it “does not result from any manipulation, substitution, or falsification occurring after the completion of its procedure of creation” (Duranti, 1995). Reliability takes precedence over authenticity.

The way to guarantee both reliability and authenticity is to have a standard for record completeness along with a controlled procedure for creation as well as a procedure to control the transmission and storage of the records. For example, a birth certificate will be considered reliable and authentic if all fields required by law have entries, the person providing the information has the authority to do so (i.e., is the attending physician or midwife) from a knowledgeable source (i.e., one or both parents as well as their own attendance at the birth), the authorized person enters the information provided correctly, the parents provide the correct information to begin with, and the birth certificate is stored in a government repository with access controls to the repository records. If a parent or physician provides false information on the birth certificate and the government stores it, then subsequent copies obtained of the birth certificate may be authentic, but they will not be reliable.

In order to provide reliable, authentic records in a digital environment, the keepers of the data objects must be able to maintain the objects’ integrity and provide evidence that that repository itself is trustworthy. The primary evidence of an objects’ integrity relate to its content, fixity, reference, provenance, and context (Waters & Garrett, 1996). Integrity builds upon, and to some degree, is concerned with authenticity, but it is not security (Lynch, 1994). Some examples of integrity violations include bit flipping, data corruption, disk errors, and malicious intrusions (Sivathanu, Wright, and Zadok, 2005).

At a minimum, for a repository to be trustworthy, it must begin with “‘a mission to provide reliable, long-term access to managed digital resources to its Designated Community, now and into the future'” (Consultative Committee for Space Data Systems, 2011). Both Waters & Garrett (1996) and the Consultative Committee for Space Data Systems (2011) prefer that repository managers conduct transparent audits of the system itself in order to assure “trustworthiness” to both internal and external stakeholders.

Motivating Factors for the Development of Digital Curation and Digital Preservation Standards

The movement to set standards for preservation and curation developed to provide order to chaos, and provide the information necessary so that individuals and organizations may make informed decisions about which data objects are reliable and authentic, and which repositories are trustworthy and mindful of data object integrity. That is, practitioners need to be able to determine if the people running a repository are actually doing so in a way that will preserve the objects for the specified time required in such a way that those objects can be found. More importantly, practitioners and users also must be certain that the objects preserved are both authentic and reliable. One way to ensure the reliability, authenticity, integrity, and trustworthiness of data objects and the repositories that house them are for the stakeholders to come together and agree on the procedures and definitions for those, and in the process, create standards for digital curation and digital preservation.

Previously, different industries worked within their domain to develop standards for preservation and curation. Book publishers worked within book publishing; filmmakers within filmmaking, and so on and so forth (Science and Technology Council, 2007). The mass use of digital data has created the need for broad standards that cross all industries. This is not a situation where knowledge about how to preserve one kind of format tends to exclude knowledge of how to preserve another kind of format, e.g., paper vs. film. A digital file is a digital file, whether it resides in a repository at the Library of Congress or in a graphic designer’s personal laptop. All industries are facing similar problems; a short list of these problems include format obsolescence, physical media changes, hardware and software migrations, personnel costs, as well as the costs of storing all of this data for perpetuity and making it accessible.

The latter — cost — ranks among one of the highest concerns. For example, the cost of storing a 4k digital master of a movie is 1,100 times higher than storing the same master as film (Science and Technology Council, 2007). A collection may be deemed worthy of saving into perpetuity by a consensus of experts, but without any resources to make that happen, the most one can hope for is that the machine the data is stored on will be turned off and put in a temperature-controlled closet until and if “someone” finds it and migrates the data to a new resource. (This preservation method assumes the data can be migrated and that there has not been any physical deterioration of the machine or disks, etc., during the time it was in storage.)

What is the best way to reduce long-term preservation costs? According to the members of the Science and Technology Council of the American Academy of Motion Picture Arts and Sciences (2007), the best way to reduce costs is to collaborate within and across industries and domains to develop and use standards, leveraging organizations such as the National Digital Information Infrastructure & Preservation Program (NDIIPP) for this purpose. The word “standard” includes, but is not limited to, file formats, filenames, metadata, metadata registries, distribution and archiving. Gallloway (2004) also concluded that the costs of preserving digital materials are exacerbated by the proliferation of proprietary formats, and that the format problem must be solved in order to limit cost.

Persistence

As stated earlier, digital curation and preservation standards grew out of established practices for the preservation of the human record, whether the purpose is research, legal requirements, cultural heritage, etc. One idea behind the development of standards, best practices, reference models, audit criteria, and a lifecycle model, etc., is to create a body of knowledge such that any person charged with preserving and curating a digital collection may readily find the information needed to accomplish their task.

Waters & Garrett (1996) were part of the Task Force on Archiving of Digital Information that examined the “state of the state” of digital preservation in the mid-1990s. Many of the task force’s recommendations contributed to the development of the final versions of the OAIS (Consultative Committee for Space Data Systems, 2002) and the standards for the Audit and Certification of Trustworthy Digital Repositories (Consultative Committee for Space Data Systems, 2011). Other recommendations from the 1996 task force include: creators, providers and owners of digital information are responsible for the preservation of the information; deep digital infrastructure must be developed to support a distributed preservation system; and, trustworthy, certified archives must be prepared and able to aggressively rescue data from repositories that are failing (Waters & Garrett, 1996).

While many large datasets have been preserved for decades without any formal standards for preservation and curation, it helps to have best practices with which to build a preservation program. For example, the Inter-University Consortium for Political and Social Research (ICPSR) has been migrating data since at least the early 1960s with few formal preservation criteria or curation standards to reference (Galloway, 2004). ICPSR personnel, partners, and users were committed to the longevity of the data, so it has been migrated repeatedly. Over the past few years, ICPSR has formalized their repository design to comply fully with the OAIS reference model, for example, because data managers believe this will further ensure the long-term availability of the social science data in the repository and lead to a “federated system of social science repositories” (Vardigan & Whiteman, 2007).

This year, Paul Ginsparg, physicists, mathematicians, computer scientists, and other scientists celebrated the 20th anniversary of arXiv, a pre-print archive (Ginsparg, 2011). Ginsparg began arXiv as an electronic bulletin board to continue physicists’ tradition of sharing research via mail and email. The bulletin board grew into a digital repository, and has survived a variety of funding sources, media, hardware, and software changes. The creators of arXiv and affiliated researchers have used it as a test bed from which to create a variety of standards that have aided in repository architecture design and interoperability such as the Dienst Protocol and the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) (Davis & Lagoze, 2000; Lagoze & Van de Sompel, 2001). Thus, practitioners had not yet created preservation and curation standards for repositories at the time of arXiv’s birth, yet it has survived for 20 years because the community that uses it wants to keep using it.

Although there are many technical problems associated with digital preservation that have yet to be solved, including the rapid obsolescence of software and hardware due to technology cycles (Thibodeau, 2002; Rothenberg, 1999), the primary problems associated with digital preservation and the curation of data are not technical, they are societal. Galloway (2004) notes that whether or not data are preserved has more to do with whether or not a given community chooses to preserve its own record; intellectual and social capital are the issue. Although we are in the midst of a data deluge that is not going to grow smaller any time soon, if ever, there are adequate systems and designs to support it. There must be an institutional commitment to support the preservation of a particular set of data, and this commitment must include an expenditure of resources, not just of will or desire for digital preservation (Consultative Committee for Space Data Systems, 2002). Galloway (2004) lists organizations that have consistently migrated data due to institutional will and personnel commitment, and these include science (for data sets), data warehouses, publishers, including authors (text files), and government agencies (e.g., the National Archives, the Library of Congress, and other federal and state agencies).

Plenty of data has been lost over the years, as well, by those same organizations. Rothenberg (1999) listed several cases of possible loss by U.S. government agencies; one of the more famous examples is the census data for 1960 (although Waters and Garrett (1996) note that the data loss was not as extensive as some think). He points out that computer scientists are notorious for accepting data loss as part of the price one pays to move to the next generation of hardware and software. He also writes that in 1990, a Congressional report “cited a number of cases of significant digital records that has already been lost or were in serious jeopardy of being lost”. To put this in perspective on a smaller scale, Nelson (2000) wrote that in a typical project at NASA c. 2000, the published research paper went to a library, the software to an FTP site, raw data was thrown away, and images were stored in a filing cabinet.

 Figure 2 - Scientific and Technical Information Lost Over Time (Nelson, 2000).
Figure 2 – Scientific and Technical Information Lost Over Time (Nelson, 2000).

In theory, all digital data could be preserved, but then the question becomes, “Should it be preserved?” If not, how do you cull that much data? Maybe it is better to keep it all? It takes personnel time to cull data, but it costs to store data that should otherwise be deleted, too.

The idea of how permanent or impermanent an archive’s collections should be is not a new one. O’Toole (1989) wrote that archivists had evolved their attitudes towards the “permanence” of artifacts and had begun to view permanence as an “unrealistic and unattainable” ideal. This is echoed in the digital realm. The members of the InterPARES (2001) project determined that it is acceptable to preserve a version of a record, so long as the integrity of the information is maintained. In other words, if a file format must be migrated from one form to another (.doc to .pdf, for example) in order to preserve it, an archivist does not have to preserve the original bits for the information itself to be considered authentic. Thibodeau (2002) also noted that it is more important to preserve the essential characteristics of an object — its look, feel, and content, for example — than it is to preserve the digital encoding of the object per se. All preservationists do not share this view, however. As late as 1999, Rothenberg (1999) expected documents to be preserved in their original bit form.

The members of the Science and Technology Council (2007) reached a conclusion similar to Thibodeau and the InterPARES members regarding film masters versus digital movie masters. The practice for the past 100 years or so has been to “save everything” when archiving a film. Thus, a director may go back 20 or 30 years later and create a new version of a movie, or film buffs with access to the film archive may study other aspects of the movie itself. The council members concluded that “save everything” is not feasible with digital movies, both due to the number and size of a digital movie, plus the cost of storing that much data over time. The digital movies will have to be migrated from the original file format, software, and hardware, to be compatible with new file formats, software, and hardware. This new version will supersede the old version of a movie, thus changing the idea of what is the actual canonical copy of a film. Therefore, the idea that the objects in a digital collection are ephemeral — both in terms of whether or not data will be kept in the first place, and that the canonical digital version itself will evolve over time — is an idea that has gained ground as digital curation and preservation have developed over the past decades

However, in spite of the idea that data are ephemeral either in terms of their lifespan or bits and bytes, another notion developed: that of “persistence”. An archivist or computer scientist may not want to keep an object long, or he or she may wish to migrate the format, but he or she wants to be able to find that data and do what is needed to the object, whether that means deleting it, migrating it, or some other task.

One of the first tasks upon ingesting an object into a repository is to assign it a unique identifier that is not shared by any other object in the archive, and, preferably, by any object in any archive. A full discussion of unique identifiers is beyond the scope of this paper, much less a discussion of the pros and cons of the various identifiers available to use with data. Some unique identifiers are one-of-a-kind to the archive or archive owner only. Some are part of a larger standard, such as Digital Object Identifiers (DOI), which are persistent names linked to redirection (Paskin, 2003). Some identifiers work only with URIs and can only be used via the World Wide Web (WWW), such as ARK (Archival Resources Key) (Kunze, 2003).

Most identifiers used with digital data may be used as URLs/URNs (Universal Resource Locator/Universal Resource Name). These are web-based, and run over the Internet. URLs are equivalent to a person’s address (e.g., http://sils.unc.edu/), and URNs are the equivalent of a person’s name, but the latter may be combined with existing non-Web identifiers to create a one-off, web-based identifier such as, “urn:isbn:n-nn-nnnnnn-n” (URI Planning Interest Group, 2001). Once a unique identifier is assigned, it is considered a best practice never to change that identifier, resource name, or resource locator (Berners-Lee, 1998). If it is necessary to do so for administrative or policy reasons, then within the system itself a “redirect” should be in place, so that the old location identifier points the system or user to the new location of the data.

As part of establishing persistent identifiers and locators for networked-based identifiers, researchers began to identify the features of a persistent (digital) archive, a persistent collection, and a persistent object. Moore & Merzky (2002) developed concepts for a persistent archive. They combined the functionality of a data grid with traditional archival processes (e.g., appraisal, accession, arrangement, and description) to create a matrix of core capabilities and functions.

 Figure 3 - Core data grid capabilities and functions for implementing a persistent archive (Moore & Merzky, 2002).
Figure 3 – Core data grid capabilities and functions for implementing a persistent archive (Moore & Merzky, 2002).

The authors proposed that this set of core capabilities would minimize the human labor involved in “implementing, managing, and evolving a persistent archive”. More importantly, they noted that these capabilities already exist in (then) current implementations of data grids.

Moore (2005) evolved these ideas to include the concept of a “persistent collection”. He defines a persistent collection as a “combination of digital libraries for the publication of digital entities, data grids for the sharing of digital entities, and persistent archives for the preservation of digital entities”. Moore concluded that while persistent collections are built on top of data grids, and data grids have been used successfully for data sharing, publication, and preservation, in order to use data grids for persistent collections, additional capabilities “to simplify the integration of new services and support the federation of independent data grid federations” must be added.
Brody (2000) and Carr (1999) “mined” the life of an ePrint archive and discovered that authors still made corrections to the papers and metadata after the respective author or authors had submitted them to the University of Southampton ePrint archive. (Neither Brody nor Carr provided an average end date as to when authors stopped committing changes either to the paper or the metadata.) Thus, even Thibodeau’s “essential characteristics” are subject to change, although the repository’s owners could change this characteristic be creating a policy that allowed or prohibited changes post-publication in the repository.

Another aspect of object persistence is whether or not the Web site that contains the object or data currently exists (as opposed to available but not accessible). Koehler (1999) examined the persistence of Web pages, Web sites, and server-level domains beginning in 1996. He reported that after 6 months, 20.5% of Web pages and 12.2% of Web sites monitored for the study failed to respond. After 12 months, those figures changed to 31.8% and 17.7% respectively. He inferred from this that the half-life of a Web page is about 1.6 years, and a Web site, 2.9 years. Koehler determined 3 kinds of Web persistence: permanence (it is not going anywhere); intermittent (sometimes it is there, sometimes it is not); and, disappearance (it is gone forever). He discovered that 99% of Web sites had changed after 12 months. Koehler (1999) concluded that if the World Wide Web is the equivalent of H.G. Wells’ (1938) “world brain”, then two things may be said of it: the world brain has a short memory, and when it does remember, it changes its mind a lot — how much and where depends.

Koehler (2004) revisited his study five years later. He reports that static collections — similar to the ePrints archive mentioned earlier in this paper — tend to stabilize after they have “aged”. As part of this paper, he reviewed the growing body of literature related to persistence — also referred to as “linkrot” — and found that the stability of collection-oriented Web sites (e.g., legal, academic, citation-based) varies based on the domain specialty. Nelson and Allen (2002) examined 1,000 digital objects in a variety of digital libraries over the course of a year. They discovered that 3% of all objects were no longer available after 12 months, but the resource half-life is about 2.5 years. Koehler writes that for other resource types, such as scholarly article citations, legal citations, biological science education resources, computer science citations, and random Web pages, the half-life of the resources ranges between 1.4 years to 4.6 years. While some of the URLs in both of Koehler’s studies stabilized for two years after losing two-thirds of the URLs in the first 4 years of the study, his overall conclusion was that the Web provides no guarantee of longevity for data, collections, or repositories.

McCown, Chan, Nelson, & Bollen (2005) revisited the Nelson and Allen (2002) study of D-Lib Magazine Web persistence and expanded upon it by examining outlinks — the URLs cited in D-Lib Magazine articles. They extracted 4387 unique URLS referenced during July 1995 to August 2004 in 453 articles. They discovered that approximately 30% of URLs failed to resolve, although only 16% of the content registered indicated more than a 1 KB change during this same testing period. The researchers concluded that the half-life of a URL referenced in a D-Lib Magazine article is around 10 years. To state the obvious, even scholarly articles referenced in a respected journal in the Information and Library Science field — where linkrot is a known problem — cannot maintain stable references.

These studies above represent but a small proportion of the literature documenting the ephemeral nature of data (“digital objects”), Web sites (“archives”), and Web pages. By the late 1990s to early 2000s, it had became apparent in all fields that in order to rely on digital resources, some objects need to be static, the repository that contains the objects needs to remain accessible, there needs to be audit mechanisms to prove that the objects in the repository are what they say they are, and that the repository is capable of persisting over time even as the content is migrated to newer software and hardware. In other words, “someone” needed to develop a standard model for archiving objects for some period, either short- or long-term. As well, “someone” needed to create audit mechanisms to determine that a repository is “trustworthy” and that the repository’s contents are “authentic” and “reliable” and have maintained their “integrity”. “Someone” had been doing just that: the CCSDS finalized the “Reference Model for an Open Archival System” (OAIS) as a standard in 2002. The CCSDS released the “Audit and Certification of Trustworthy Digital Repositories” as a Recommended Practice (Magenta Book) in September 2011.

Overview of the OAIS Reference Model and the Audit and Certification of Trustworthy Digital Repositories Recommended Practice

The Consultative Committee for Space Data Systems (CCSDS) convened an international workshop in 1995 with the purpose of advancing a proposal “to develop a reference model for an open archival information system” (Lavoie, 2004). The CCSDS had determined previously that there was no widely accepted model or framework that could serve as a standard for the long-term storage of space mission digital data. The members of the CCSDS recognized that fundamental questions related to digital preservation cut across all domains; therefore, the development scope of the model included stakeholders from a variety of domains, including government, private industry, and academia (Lee, 2010). The committee determined that the purpose of creating a reference model was to “address fundamental questions regarding the long-term preservation of digital material” (Lavoie, 2004). This model would define an archival system and outline the essential conditions a repository owner must meet in order to be considered a preservation archive.

The CCSDS defines an OAIS as “an archive, consisting of an organization of people and systems, that has accepted the responsibility to preserve information and make it available for a Designated Community. It meets a set of such responsibilities as defined in this document, and this allows an OAIS archive to be distinguished from other uses of the term ‘archive'” (CCSDS, 2002). The word, “Open” is used to note that the CCSDS (2002) developed the recommendation in open forums, and will continue to do so for any future iterations of the model. The use of the word, “Open” does not imply that access to the repository itself must be unrestricted, in order to meet the requirements of the OAIS model (Lee, 2010).

The committee described four categories of archives: independent, cooperating, federated, and shared resources. The owners of an independent archive do not interact with any other archive owners with regards to technical or management issues. The possessor of a cooperating archive does not have a “common” finding aid with other archive possessors, but otherwise shares common producers, submission standards, and dissemination standards. The owners of a federated archive serve both a global and local Designated Community with interests in these related archives, and these owners provide access to their holdings to the Designated Community via one or more shared finding aids. The holders of archives with shared resources have agreed to “share resources” with each other, generally to reduce cost. This type of arrangement requires the use of standards internal to the archive, such as for ingest and access, that do not “alter the user community’s view of the archive” (CCSDS, 2002; Lee, 2010).

The CCSDS divided the reference model into two “sub-models” – a Functional Model and an Information Model. Simply put, the Functional Model defines what an archive must do, and the Information Model defines what a repository must have in its collections (Lee, 2010). The former describes seven main functional entities, and however they interface with each other. These interfaces are: Common Services, Preservation Planning, Data Management, Ingest, Administration, Access, and Archival Services.

 Figure 4 - the OAIS Reference Model "Functional Model" (CCSDS, 2002).
Figure 4 – the OAIS Reference Model “Functional Model” (CCSDS, 2002).

The Information Model describes and defines the information beyond the content. The members of the CCSDS included this section because the long-term preservation of digital material will require more than simply the content itself. A few examples of the information described and defined within the Information Model include: representation, fixity, provenance, content, and preservation description.

In summary, if a repository is an OAIS-type archive, then the archive managers will implement each area of the Functional Model in order to preserve information as an information package via the Information Model, for a Designated Community (Lavoie, 2004). The CCSDS designed the OAIS to be a reference model – it is NOT an implementation. The committee members deliberately left it up to an archive’s owners to determine the technical details of the archival system. Egger (2006) writes that this is a disappointing aspect of the reference model, because it mixes technical and management functionality, rather than keeping them separate per standard software engineering practices. Vardigan and Whiteman (2007) successfully applied the OAIS reference model to the Inter-university Consortium for Political and Social Research (ICPSR) social science repository. The managers of the Online Computer Library Center (OCLC) Digital Archive based their service on the OAIS reference model while drawing data and metadata from a “wide array of OCLC organizational units” (Lavoie, 2004).

Another application of the conceptual work of the CCSDS (2002) with the OAIS reference model, and Waters and Garrett’s (1996) work with the Task Force on Archiving of Digital Information, is the CCSDS’ development of a “recommended practice” for the “audit and certification of trustworthy digital repositories” (CCSDS, 2011). This work is also based on the development of the requirement for a repository to be “reliable”, “authentic”, have “integrity”, and, be “trustworthy”, as defined in a previous section.

Lavoie (2004) writes that OCLC and the Research Libraries Group (RLG) sponsored an initiative in March 2000 to address the “attributes of a trusted digital repository”. The working group’s charge was “to reach consensus on the characteristics and responsibilities of trusted digital repositories for large-scale, heterogeneous collections held by cultural organizations” (Research Libraries Group, 2002). The purpose of determining these characteristics is to ensure that an OAIS Designated Community will be able to audit a repository and determine whether or not the repository owners have designed it, and are managing it, in such a way that the repository will actually preserve the Designated Community’s data for the indefinite long-term and make it accessible. The RLG/OCLC working group issued their report in 2002. Among the recommendations, the working group specified that a process needed to be developed to certify a digital repository (Research Libraries Group, 2002). Waters and Garrett (1996) had also made this recommendation.

What is a “trusted digital repository”? The working group of the Research Libraries Group (2002) defined it as a repository with “a mission to provide reliable, long-term access to managed digital resources to its designated community, now and into the future”. The NESTOR Working Group on Trusted Repository — Certification (2006) determined that the entire system must be looked at in order to determine whether or not a Designated Community should trust that it will last for the indefinite long-term. This includes its governance; procedures and policies; financial sustainability and fitness; organizational management, including employees; legal liabilities, contracts and licenses under which it operates; plus the trustworthiness of any organization or person who might inherit the data (NESTOR Working Group on Trusted Repository — Certification, 2006; Online Computer Library Center, Inc. & Center for Research Libraries, 2007).

A repository manager must also assess internal and external risks to the repository. Among many of the potential risks to a repository’s long-term availability, Rosenthal, et al (2005) include internal and external attacks; natural disasters; hardware, software and media obsolescence; hardware, software, media, network services, organizational, and economic failure; as well as simple communication errors. Regular audits and re-certification — e.g., transparency — are the keys to the long-term survivability of a repository (Online Computer Library Center, Inc. & Center for Research Libraries, 2007).

Researchers and practitioners then set about developing the criteria and checklists for audit and certification. OCLC and the Center for Research Libraries (CRL) developed the “Trustworthy Repositories Audit & Certification: Criteria and Checklist” (2007). The creators called this document “TRAC”, and provided a spreadsheet for practitioners to use that covered the requirements for “organizational infrastructure”, “digital object management”, and “technologies, technical infrastructure, & security”. The researchers with nestor (Network of Expertise in long-term STORage) also created guidelines around these three areas (Dobratz, Schoger, & Strathmann, 2006).

TRAC covered the following policy areas for audit and certification: governance & organizational viability; organizational structure & staffing; procedural accountability & policy framework; financial sustainability; contracts, licenses, & liabilities; ingest, including the creation of the archival package and acquisition of content; preservation planning; archival storage & the preservation and maintenance of AIPs; information and access management; system infrastructure; appropriate technologies; and, security (Online Computer Library Center, Inc. & Center for Research Libraries, 2007).

Ross and McHugh (2006) applied TRAC to examine mechanisms to provide audit and certification services for United Kingdom digital repositories. As part of this work, the researchers developed a toolkit, “Digital Repository Audit Method Based on Risk Assessment” (DRAMBORA). This toolkit is available online so that practitioners may “facilitate internal audit by providing repository administrators with a means to assess their capabilities, identify their weaknesses, and recognize their strengths” (Digital Curation Centre & Digital Preservation Europe, 2007). It is a self-audit that follows the workflow and criteria an external auditor would apply, so that a repository may self-assess prior to going through an external audit and certification. The toolkit provides a methodology by which a digital archivist might assess any risks to the repository she or he manages. While TRAC, DRAMBORA, and nestor are very similar, DRAMBORA provides a “documented understanding of the risks…expressed in terms of probability and impact” and provides “quantifiable insight into the severity of risks faced by repositories” along with a means to document those risks (Digital Curation Centre, 2011). In other words, TRAC is a more informal audit process that provides qualitative output, while DRAMBORA is a more detailed, formal, audit method that provides quantifiable results. The policies and risks covered by the DRAMBORA risk assessment are similar to the ones stated above for nestor and TRAC. The difference, to reiterate, is that the DRAMBORA method provides quantifiable output.

The next logical step in the development of an overall standard for the audit and certification of repositories was to merge the concepts and ideas behind TRAC, nestor, and DRAMBORA. Thus, representatives of The Digital Curation Centre (U.K.), DigitalPreservationEurope, NESTOR (Germany), and the Center for Research Libraries (North America) convened at that Chicago, IL offices of the Center for Research Libraries “to seek consensus on core criteria for digital preservation repositories, to guide further international efforts on auditing and certifying repositories” (Center for Research Libraries, 2007). Dale (2007) compared and contrasted the different methods, and created a matrix that displayed similarities and differences. Based on this matrix, and internal discussions, the attendees identified 10 core characteristics of a preservation repository:

  • The repository commits to continuing maintenance of digital objects for identified community/communities.
  • Demonstrates organizational fitness (including financial, staffing, and processes) to fulfill its commitment.
  • Acquires and maintains requisite contractual and legal rights and fulfills responsibilities.
  • Has an effective and efficient policy framework.
  • Acquires and ingests digital objects based upon stated criteria that correspond to its commitments and capabilities.
  • Maintains/ensures the integrity, authenticity and usability of digital objects it holds over time.
  • Creates and maintains requisite metadata about actions taken on digital objects during preservation as well as about the relevant production, access support, and usage process contexts before preservation.
  • Fulfills requisite dissemination requirements.
  • Has a strategic program for preservation planning and action.
  • Has technical infrastructure adequate to continuing maintenance and security of its digital objects (Center for Research Libraries, 2007).

A key idea to come out of this meeting is that the preservation activities must scale to the “needs and means of the defined community or set of communities” (Center for Research Libraries, 2007). In other words, some repositories may need to implement more preservation activities, and some may need to implement less.

The Consultative Committee for Space Data Systems released the “Magenta Book” version of the “Audit and Certification of Trustworthy Digital Repositories Recommended Practice” in September 2011. This recommendation is the culmination of years of best practice work by researchers and practitioners. This best practice work began with the development of TRAC, DRAMBORA, and nestor, among other projects. The CCSDS began the process to use these methods for audit and certification to create an ISO standard, based primarily on TRAC. The precursor to this ISO standard is the “Recommended Practice” for the “Audit and Certification of Trustworthy Digital Repositories” that is currently in release as the “Magenta Book”.

A “Recommended Practice” is not binding to any agency. The purpose of a “Recommended Practice” is to “provide general guidance about how to approach a particular problem associated with space mission support” and to provide a basis on which a community that has a stake in a digital repository may assess the trustworthiness of the repository (Consultative Committee for Space Data Systems, 2011). The CCSDS’ recommendations are aimed at any and all digital repositories. Another way to think of the purpose and scope of the “Recommended Practice” is that it establishes a method for a Designated Community to determine whether or not a repository of interest is actually OAIS-compliant. The following is a summary of this Recommended Practice.

The Recommended Practice covers audit and certification criteria, including defining a “trustworthy digital repository”, an evidence metric (e.g., “examples”) in support of a particular requirement, and related relevant standards, best practices, and controls. The policies required to be trustworthy fall under three primary categories: “Organizational Infrastructure”, “Digital Object Management”, and “Infrastructure and Security Risk Management”. The authors designed the document so that each of those sections follows a similar design.

First, the policy is stated. Second, the “Supporting Text” is presented; this is the “so what?” section. Third, the document provides “Examples of the Ways the Repository Can Demonstrate It Is Meeting This Requirement”. Finally, the authors provide a “Discussion” section that explains the previous three sections in order to remove any possible ambiguity.
So, for example, in section “3 Organizational Infrastructure”, “3.1 Governance and Organizational Viability”, section 3.1.1 states:

3.1.1 The repository shall have a mission statement that reflects a commitment to the preservation of, long-term retention of, management of, and access to digital information.

Supporting Text
This is necessary in order to ensure commitment to preservation, retention, management and access at the repository’s highest administrative level.

Examples of Ways the Repository Can Demonstrate It Is Meeting This Requirement
Mission statement or charter of the repository or its parent organization that specifically addresses or implicitly calls for the preservation of information and/or other resources under its purview; a legal, statutory, or government regulatory mandate applicable to the repository that specifically addresses or implicitly requires the preservation, retention, management and access to information and/or other resources under its purview.

Discussion
The repository’s or its parent organization’s mission statement should explicitly address preservation. If preservation is not among the primary purposes of an organization that houses a digital repository then preservation may not be essential to the organization’s mission. In some instances a repository pursues its preservation mission as an outgrowth of the larger goals of an organization in which it is housed, such as a university or a government agency, and its narrower mission may be formalized through policies explicitly adopted and approved by the larger organization. Government agencies and other organizations may have legal mandates that require they preserve materials, in which case these mandates can be substituted for mission statements, as they define the purpose of the organization. Mission statements should be kept up to date and continue to reflect the common goals and practices for preservation (CCSDS, 2011).

The policy areas covered by the Recommended Practice include: governance and organizational viability; organizational structure and staffing; procedural accountability and preservation policy framework; financial sustainability; contracts, licenses, and liabilities; ingest, including acquisition of content and creation of the AIP; preservation planning; AIP preservation; information management; access management; and, risk management, including technical infrastructure and security. These areas almost exactly replicate the original audit and certification criteria for the TRAC checklist, and they are also closely replicate the criteria used in nestor and DRAMBORA.

Applications of the Audit and Certification of Trustworthy Digital Repositories Recommended Practice and the OAIS Reference Model

A complete listing of all projects, repository designs, and organizations that have applied the OAIS reference model and some version of TRAC, DRAMBORA, or nestor is beyond the scope of this literature review. Instead, this section will discuss a few example applications for applying both TRAC and the OAIS Reference Model.

When Steinhart, Dietrich, and Green (2009) applied the TRAC checklist to a “data staging repository”, they made several observations and conclusions. First, the TRAC checklist is applicable “to the pilot phase of a staging repository”, which is a “transitory curation environment” (Steinhart, Dietrich, & Green, 2009). This meant that TRAC had practical applications beyond digital preservation audit and certification.

For example, the TRAC checklist may be used as an evaluation tool when repository owners want to purchase new repository software. The TRAC checklist may also be used as a standard from which to create machine-actionable rules, per Smith and Moore’s (2006) work on the PLEDGE project. By implementing TRAC policies at the machine-level, the amount of human effort required to enforce a policy is reduced because policy enforcement is built into the system itself (Moore & Smith, 2007).

Steinhart, Dietrich, and Green (2009) noted that there seemed to be two applications of TRAC: an audit of the system to satisfy auditors, or an audit of the system to satisfy users of the system (i.e., “the Community of Practice”). Implied in this observation is the idea that few audits seem to be conducted purely for an organization’s internal erudition. Regardless of the purpose for conducting the audit, however, TRAC has provided a method for repository owners to identify gaps in an organization’s workflows and policies, and provides the mechanisms (e.g., “knowledge”) for those owners to fill those gaps.

Another example of the application of TRAC to a repository is the audit of the MetaArchive repository. Contractor Matt Schultz conducted an audit of the MetaArchive Cooperative and made the results public. The author reported that the MetaArchive “conforms to all 84 criteria specified by TRAC” and “has undertaken 15 reviews and/or improvements to its documentation and operations as a result of its self-assessment findings” (Educopia Institute, 2010). The organization made the actual spreadsheet available that contained the results of the audit and certification of the MetaArchive.

A quick skim of titles containing the word, “TRAC” in journals such as D-Lib Magazine, JASIST, and other related journals indicate that TRAC has been used often as an assessment tool. What is missing, however, are papers with negative assessments of TRAC, or any negative results from applying TRAC. What is also missing is a formal assessment as to whether or not a top-down approach (e.g., formal established standards) is the most feasible, or, even, the only approach. Perhaps a bottom-up approach where someone analyzes what policies people are actually implementing versus what is recommended, would be a useful approach?

Perhaps the positive reviews of the application of TRAC, of which the two above are only a small portion, indicate that as a Recommended Practice it is, indeed, comprehensive and covers all required bases. Or the positive reviews of applying TRAC may reflect researchers’ and publishers’ biases towards not publishing negative results, otherwise known as the “file-drawer effect” (Fanelli, 2011). The likely answer to the lack of published critical reviews of TRAC and the Recommended Practice is that not enough time has gone by to evaluate whether or not following the recommended policies will make any difference in the longevity of the repository.

As stated previously, ICPSR employees have been migrating their social science archive forward since the 1960s, with no standards such as TRAC or the OAIS to follow (Vardigan & Whiteman, 2007). Other repositories disappeared or lost information. Would having international standards in place both for repository design and audit and certification policies really have prevented that kind of loss of information? It is hard to say, as even the authors of the OAIS Reference Model state that the long-term survival of a repository depends on the will and resources of the repository owners and the community of practice (CCSDS, 2002).

Archivists and librarians at Tufts and Yale applied the OAIS Reference Model to electronic records. Specifically, they created an ingest guide to aid in moving electronic records from a “recordkeeping system to a preservation system”. The practitioners designed the guide to describe the actions needed for a “trustworthy” ingest process. The authors used both the OAIS Reference Model and the “Producer-archive Interface Methodology Abstract Standard” (Consultative Committee for Space Data Systems, 2004) as the basis for the guide. According to the archivists and librarians who worked on the project, following the guide should allow “a reasonable person to presume that a record has maintained its level of authenticity during ingest” (Fedora and the Preservation of University Records Project, 2006).

The authors divided the ingest guide into two main sections: “negotiate submission agreement” and “transfer and validation”. The former section covers establishing a relationship with the collection owner, defining the project, assessing the records themselves, and finalizing the submission agreement. The latter section includes creating and transferring Submission Information Packages (SIPS), validation, transformation, metadata, formulating and assessing Archival Information Packages (AIPs), and formal accession. Each section contains an overview, an image of the flow of steps involved in that particular process, and a step-by-step written narrative for each step in the flow. The purpose of the document is not to provide “a detailed manual of procedures”, but to provide “a prescriptive guide for a trustworthy ingest process” (Fedora and the Preservation of University Records Project, 2006).

A different kind of application of both TRAC and the OAIS is to build or use a “trusted digital repository” to create “persistent collections” in a “persistent archive” (Moore, 2004). Some of these solutions are based on digital library systems such as DSPACE and FEDORA; other solutions include data grids such as the Storage Resource Broker (SRB) and the integrated Rule-Oriented Data System (iRODS) (Moore, 2005; Moore, Rajasekar, & Marciano, 2007). One unique aspect of iRODS is that preservation policies outlined in TRAC may be implemented at the machine level, in the code, via the use of rules. Rajasekar, et al (2006) call this “policy virtualization”.

For example, the following “human language example” regarding “Chain of Custody” from the Audit and Certification of Trustworthy Digital Repositories Recommended Practice (CCSDS, 2011):

5.1.2 The repository shall manage the number and location of copies of all digital objects.
This is necessary in order to assert that the repository is providing an authentic copy of a particular digital object.

may be written in machine language in iRODS v.3.0 as:

myTestRule {
#Input parameters are:
# Object identifier
# Buffer for results
#Output parameter is:
# Status
msiSplitPath(*Path, *Coll, *File);
msiExecStrCondQuery(“SELECT DATA_ID where COLL_NAME = ‘*Coll’ and DATA_NAME = ‘*File'”,*QOut);
foreach(*QOut) {
msiGetValByKey(*QOut,”DATA_ID”,*Objid);
msiGetAuditTrailInfoByObjectID(*Objid,*Buf,*Status);
writeBytesBuf(“stdout”,*Buf);
}
}
INPUT *Path=”/tempZone/home/rods/sub1/foo1″
OUTPUT ruleExecOut

This type of policy virtualization is the method by which the researchers who created iRODS implemented the OAIS Reference Model recommendations within the system architecture itself (Ward, de Torcy, Chua, & Crabtree, 2009).

Other Technical Issues

The other end of the digital curation spectrum from the OAIS Reference Model and the Audit and Certification of Trustworthy Repositories is bit level preservation. Moore (2002) wrote, “the challenge in digital archiving and preservation is not the management of the bits comprising the digital entities, but the maintenance of the infrastructure required to manipulate and display the image of reality that the digital entity represents”. Lynch (2000) also writes that infrastructure is key. However, since bit level preservation is followed by preservation of the media that contains the bits and bytes, which requires preservation of the software and hardware on which the media runs, which requires networked infrastructure, bit management must be addressed.

A bit is a “binary digit”. A “binary digit” is either a one or a zero in a binary system of notation (Binary digit, 2011). Chunks of bits make up a byte. Rothenberg (1999) writes that bytes may be any length, but 8 bytes provides considerably more freedom to create upper and lower case characters, punctuation, digits, control characters, and graphical elements. In very simple terms, to read a bit stream, the computer hardware must retrieve it from the media it is stored on (e.g., flash drive, CD, DVD, computer hard drive, etc.) and interpret it via software that is designed to render bits stored in that format (e.g., .pdf, .doc, .jpg., etc.).

If the bits become corrupted, then the content is unrenderable. If the media storage device deteriorates, the content is unrenderable. If the software and hardware are unavailable to read and render the file format, it is unrenderable. If the file format is unknown, then the content is unrenderable by any machine or available software. Thus, when a repository owner designs a preservation system to provide access to the content for the indefinite near-term, a decision must be made regarding migrating, refreshing, replicating, and emulating the file format, software, and hardware used to store and render the contents of a digital object.

Waters and Garrett (1996) describe migration as the transfer of data to a new operating system, programming code, or file format. The advantage of this method is it keeps the data current with technological changes. The disadvantage is that it is possible the rendering of the content may change, so that the representation is different in some way from the original (Rothenberg, 1999). In most instances, this is likely not to matter, but in some instances, it could be important. One way around this is to save the original files, migrate copies of those files to the new format/operating system/programming language, and then store the originals with the copies. The disadvantage to this, however, is that one must also save the hardware and software to read these files, which negates the advantages inherent in migration. Preservationists prefer migration to refreshing because it better retains the ability to retrieve, display and otherwise access the data (Research Libraries Group, 1996)

Archivists “refresh” data by copying data from old media onto new media, in an effort to stave off the effects media deterioration. However, this preservation method only works so long as the data and information are “encoded in a format that is independent of the particular hardware and software needed to use it and as long as there exists software to manipulate the format in current use” (Waters & Garrett, 1996). That is, the software and hardware used to read the information on the media must be backwards compatible and interoperable with different file formats, hardware, and software.

Rothenberg (1999) proposes emulation as the best solution to preservation. He defines emulation as a new system that replicates the functionality of a now-obsolete system, providing the user with the data, information, and functionality of the original system. Rothenberg writes that emulators may be built for hardware platforms, applications, and/or operating systems. However, emulation is expensive, as the cost of replicating the original system and actually being able to provide all of the functionality requires a great deal of resources, both human, financial, and time. Oltmans (2005) compared migration and emulation and concluded that emulation is more cost effective because it preserves the collection in its entirety when compared to migration. One could argue that preservationists would be better off simply maintaining the original system in the first place. However, few organizations or people have the resources to maintain that amount of hardware and software. Video game aficionados prefer to use emulators; otherwise, migration has been the method of choice for curators and preservationists of data.

Repository owners use replication as a way to back up data in multiple locations, preferable not in the same geographic or physical space. This prevents the accidental and permanent loss of data. If there is a fire, a flood, or a malicious act by some person to destroy the data, replication ensures that there are still copies of the data stored in a format such that a full restore is possible. Generally, repository managers create two replications of data. Often, this can be done in a shared format, so that one repository owner stores back up data for another organization, and vice versa. One challenge to replication is ensuring that all data stored in all locations are synced so that additions, deletions, updates, etc. are done so that the data in one location “matches” the data stored in the other two locations. The repository systems administrators much check the data on a regular basis for to ensure its continued integrity via tools such as fixity checks, access controls, and other data integrity techniques and mechanisms (Sivathanu, Wright, & Zadok, 2005). Software such as LOCKSS (“Lots of Copies Keep Stuff Safe”) and data grid “middleware” such as iRODS provide repository owners with proven technology to aid in the replication of their data (Moore & Merzky, 2003; Moore, 2004; Moore, 2006). Organizations such as Data-PASS (“The Data Preservation Alliance for the Social Sciences”) help their members replicate and preserve social science data by creating a common technical mechanism for data sharing/replication.

General Digital Repository Management

How does an archivist, librarian, or other technologist manage a preservation digital repository? The same way personnel manage a non-preservation digital repository (Lavoie & Dempsey, 2004). Material must be selected and ingested or digitized if it has not been born digital. Metadata must be created, or the quality of the metadata must be checked prior to ingest and possibly augmented if it does not meet the repository owner’s quality standards (Lavoie & Gartner, 2005; Shreeves, et al, 2005; Jackson, et al, 2008; Ward, 2004). The digitization and/or ingest project must be managed, and risks to the repository must be identified and solutions created (Lawrence, et al, 2000). Intellectual property and copyright to the data must be established and enforced internally and with the Community of Interest (National Initiative for a Networked Cultural Heritage, 2002). Lee, Tibbo, & Schaefer (2007) note that the manager of the repository also must hire trained personnel with the appropriate skill sets to create, manage, preserve, and curate the repository.

Funding

Last, but not least, the repository manager and the Community of Interest must ensure funding is available to maintain the repository over the indefinite long-term. And, should this funding fall short, the repository manager must ensure that there is a back up organization to take over the management of the repository, should the “owning” organization no longer exist (Waters & Garrett, 1996).

Both the members of AAMPAS (2007) and Waters & Garrett (1996) examined cost factors of preserving digital information over time. The AAMPAS members estimated the costs of digital video vs. film masters preservation, and Waters & Garrett examined digital book vs. paper book preservation and storage. Both groups reached the same conclusion: the curation and preservation of digital material is far more expensive than preserving and maintaining film or paper books over time. The AAMPAS committee determined that it would cost 1,100% more to store digital movie masters for 100 years than to store film masters for the same time period. Waters & Garrett’s (1996) cost model indicated that “storage costs…are 12 times higher for a digital archives composed of texts in image form, and the access costs are 50% higher” than for the same material as books. Chapman (2003) pondered the storage affordability question and concluded that the final costs are variable. He wrote that the true costs depend on the services provided around the repository, the type and amount of content, the choice of repository software, and the type of storage chosen (“dark archive”, publicly accessible, etc.).

Regardless, the final conclusion is that digital curation and preservation is not cheap. The members of the Blue Ribbon Task Force on Sustainable Digital Preservation and Access (2008) noted that “there is no general agreement” as to “who is responsible…and who should pay for the access to, and preservation of, valuable present and future digital information”.

Current Status and Future Challenges/Further Work

Librarians, archivists, computer scientists and other researchers are currently immersed in figuring out the “data deluge”. How big is this deluge? It is hard to estimate, but IDC estimates that the annual compound growth rate of data created and stored was almost 60% higher in 2008 than the 180 exabytes that existed in 2006 (Mearian, 2008). IDC further estimates that by 2011, “there will be 1,800 exabytes of electronic data in existence”. If those numbers are correct, Mearian (2008) writes that as of 2011 the number of bits stored exceeds the number of stars in the sky.

That is a lot of data.

Digital preservationists and domain scientists are now focusing their attention on access to research data, specifically, the preservation of research data sets from which research conclusions are drawn. The Committee on Ensuring the Utility and Integrity of Research Data in the Digital Age (2009), the National Science Foundation (2005), and other individuals and groups have drawn attention to the need to steward data for use and re-use by other researchers. As one part of this, members of these two organizations have recommended creating formal standards and strategies for data stewardship.

The editors of the journal Nature have participated in this effort, by drawing attention to the perils and advantages of data sharing (Butler, 2007; Butler, 2007; Nelson, 2009) and data neglect (Editor, 2009). The editors of Science have also followed suit, and examined data sharing and data restoration (Curry, 2011; Hanson, Sugden, & Alberts, 2011) Over the course of the past year, both the National Science Foundation and the National Institutes of Health have required grant applicants to provide data management plans as part of the application process. One can only wonder at how well researchers’ data management plans conform to established best practice recommendations for the preservation of data, such as the OAIS Reference Model and the Audit and Certification of Trustworthy Digital Repositories Recommended Practice.

The logic behind the interest in preserving, accessing, and sharing data sets is twofold: to ensure that science can be replicated (and the science cannot be replicated if the original data set is lost or unavailable); and to ensure that taxpayers receive the full benefits of their investment in research by allowing other researchers access to data generated with taxpayer money. If stakeholders wish to share data, then it must be stewarded when the data is gathered, on through the initial research, and includes storage of the data set(s) post-dissemination of any results. It must also be stored for the indefinite long-term, should a future researcher wish to access the data set(s).

Practitioners’ initial research into this area indicates that some kind of institutional support in the form of data centers where researchers may store and share their data may be required in some instances (Beagrie, Beagrie, & Rowlands, 2009; Research Information Network, 2011). Skinner & Walters (2011) advocate a new role for librarians and archivists — that of data curator. Their recommendation is that academic and research librarians should provide curatorial guidance with regards to digital content. They write that librarians and archivists should go to the researchers, rather than wait for the researchers to come to them for advice. Most academic and research libraries and archives do offer research data management advice, including a “data curation toolkit”, to aid in interviewing the researcher about their data curation requirements (Witt, Carlson, & Brandt, 2009).

Conclusion

The problem of preserving data, information, knowledge, and wisdom is not a new problem. Whether it is clay tablets, papyrus, books, data or some other format, the people who are interested in preserving the cultural, research, and other heritage of our world on earth have faced challenges of one sort or another. Some data has been preserved for centuries, and others, unnecessarily lost. War, weather, politics, fire, and other factors have destroyed valuable information objects in all centuries. The value of the data to one or more individuals is a major factor that leads to its curation and long-term survivability. The ability of the owner and users of it to fund its preservation is equally important.

Librarians, archivists, and computer scientists’ establishment of standards for digital preservation and curation aid in the survivability of this data, but do not “cause” it. What has changed over time is the type of data preserved and the method for doing so. What has not changed over the millennia is that the preservation and curation of objects is not guaranteed and it is not cheap.


References

Ackoff, R.L. (1989). From data to wisdom. Journal of Applied Systems Analysis, 16(1), 3-9.

Beagrie, N., Beagrie, R., & Rowlands, I. (2009). Research data preservation and access: the views of researchers. Ariadne, 60. Retrieved August 18, 2009, from http://www.ariadne.ac.uk/issue60/beagrie-et-al/

Berners-Lee, T. (1998). Cool URIs don’t change. W3C. Retrieved July 15, 2008, from http://www.w3.org/Provider/Style/URI.html

Binary digit. (2011). Google.com. Retrieved December 13, 2011, from http://www.google.com/search?client=safari&rls=en&q=define:+binary+digit&ie=UTF-8&oe=UTF-8

Blue Ribbon Task Force on Sustainable Digital Preservation and Access. (2008, December). Sustaining the digital investment: issues and challenges of economically sustainable digital preservation. San Diego, CA: San Diego Supercomputer Center. Retrieved January 24, 2009, from http://brtf.sdsc.edu/biblio/BRTF_Interim_Report.pdf

Brody, T. (2000). Mining the social life of an ePrint archive. Retrieved September 16, 2001, from the University of Southampton, OpCit Project Web site: http://opcit.eprints.org/tdb198/opcit/q2/

Buckland, M.K. (1991). Information as thing. Journal of the American Society for Information Science, 42(5), 351-360.

Butler, D. (2007). Agencies join forces to share data. Nature, 446, 354.

Butler, D. (2007). Data sharing: the next generation. Nature, 446, 10-11.

Carr, L. (1999). Metadata changes to XXX papers in a three month period. Retrieved October 13, 2001, from the University of Southampton, Electronics and Computer Science Web site: http://users.ecs.soton.ac.uk/lac/XXXmetadatadeltas.html

Center for Research Libraries (2007). Ten principles. Retrieved December 8, 2011, from http://www.crl.edu/archiving-preservation/digital-archives/metrics-assessing-and-certifying/core-re

Committee on Ensuring the Utility and Integrity of Research Data in the Digital Age; National Academy of Sciences. (2009). Ensuring the Integrity, Accessibility, and Stewardship of Research Data in the Digital Age. Executive Summary. Washington, DC: the National Academies Press. Retrieved January 7, 2009, from http://www.nap.edu/catalog.php?record_id=12615

CCSDS. (2002). Reference model for an Open Archival Information System (OAIS) (CCSDS 650.0-B-1). Washington, DC: National Aeronautics and Space Administration (NASA). Retrieved April 3, 2007, from http://nost.gsfc.nasa.gov/isoas/

CCSDS. (2004). Producer-archive interface methodology abstract standard (CCSDS 651.0-B-1). Washington, DC: National Aeronautics and Space Administration (NASA). Retrieved August 18, 2007, from http://public.ccsds.org/publications/archive/651x0b1.pdf

CCSDS. (2011). Audit and Certification of Trustworthy Digital Repositories (CCSDS 652.0-M-1). Magenta Book, September 2011. Washington, DC: National Aeronautics and Space Administration (NASA).

Curry, A. (2011). Rescue of Old Data Offers Lesson for Particle Physicists. Science, 331, 694-695.

Dale, R. (2007). Mapping of audit & certification criteria for CRL meeting (15-16 January 2007). Retrieved September 11, 2007, from http://wiki.digitalrepositoryauditandcertification.org/pub/Main/ReferenceInputDocuments/TRAC-Nestor-DCC-criteria_mapping.doc

Davis, J. R. & Lagoze, C. (2000). NCSTRL: design and deployment of a globally distributed digital library. Journal of the American Society for Information Science, 51(3), 273-280.

Digital Curation Centre. (2011). DRAMBORA. Retrieved December 9, 2011, from http://www.dcc.ac.uk/resources/tools-and-applications/drambora

Digital Curation Centre. (2010). What is digital curation? Retrieved November 6, 2011, from http://www.dcc.ac.uk/digital-curation/what-digital-curation

Digital Curation Centre & Digital Preservation Europe. (2007). DCC and DPE digital repository audit method based on risk assessment (DRAMBORA). Retrieved August 1, 2007, from http://www.repositoryaudit.eu/download

Digital Preservation. (2009). Introduction – definitions and concepts. Digital Preservation Coalition. Retrieved November 6, 2011, from http://dpconline.org/advice/preservationhandbook/introduction/definitions-and-concepts

Dobratz, S., Schoger, A., & Strathmann, S. (2006). The nestor Catalogue of Criteria for Trusted Digital Repository Evaluation and Certification. Paper presented at the workshop on “digital curation & trusted repositories: seeking success”, held in conjunction with the ACM/IEEE Joint Conference on Digital Libraries, June 11-15, 2006, Chapel Hill, NC, USA. Retrieved December 1, 2011, from http://www.ils.unc.edu/tibbo/JCDL2006/Dobratz-JCDLWorkshop2006.pdf

Duranti, L. (1995). Reliability and authenticity: the concepts and their implications. Archivaria, 39 (Spring), 5-10.

Editor. (2009). Data’s shameful neglect. Nature, 461, 145.

Educopia Institute. (2010, April). Metaarchive cooperative TRAC audit checklist. Prepared by M. Schultz. Atlanta, CA: Educopia Institute. Retrieved December 10, 2010 from http://www.metaarchive.org/sites/default/files/MetaArchive_TRAC_Checklist.pdf

Egger, A. (2006). Shortcomings of the Reference Model for an Open Archival Information System (OAIS). IEEE TCDL Bulletin, 2(2). Retrieved October 23, 2009, from http://www.ieee-tcdl.org/Bulletin/v2n2/egger/egger.html

Fedora and the Preservation of University Records Project. (2006). 2.1 Ingest Guide, Version 1.0 (tufts:central:dca:UA069:UA069.004.001.00006). Retrieved April 16, 2009, from the Tufts University, Digital Collections and Archives, Tufts Digital Library Web site: http://repository01.lib.tufts.edu:8080/fedora/get/tufts:UA069.004.001.00006/bdef:TuftsPDF/getPDF

Fanelli, D. (2011). Negative results are disappearing from most disciplines and countries. Scientometrics, September 2011, 1-14.

Galloway, P. (2004). Preservation of digital objects. In B. Cronin (Ed.), Annual Review of Information Science and Technology, 38(1), (pp. 549-590).

Ginsparg, P. (2011). ArXiv at 20. Nature, 476, 145-147.

Hanson, B., Sugden, A., & Alberts, B. (2011). Making Data Maximally Available. Science, 331, 649.

Hedstrom, M. (1995). Electronic archives: integrity and access in the network environment. American Archivist, 58(3), 312-324.

Higgens, S. (2007). Draft DCC curation lifecycle model. International Journal of Digital Curation, 2(2). Retrieved March 22, 2008, from http://www.ijdc.net/index.php/ijdc/article/view/46

InterPARES. (2001). The long-term preservation of authentic electronic records: findings of the InterPARES project. Retrieved October 5, 2007, from http://www.interpares.org/ip1/ip1_index.cfm

Jackson, A. S., Han, M., Groetsch, K., Mustafoff, M., & Cole, T. W. (2008). Dublin Core metadata harvested through the OAI-PMH (pre-print). Journal of Library Metadata, 8(1).

Koehler, W. (1999). An analysis of web page and web site constancy and permanence. Journal of the American Society for Information Science, 50(2), 162-180.

Koehler, W. (2004). A longitudinal study of web pages continued: a consideration of document persistence. Information Research, 9(2).

Krasner-Khait, B. (2001). Survivor: the history of the library. History Magazine, October/November 2011. Retrieved August 30, 2011, from http://www.history-magazine.com/libraries.html

Kunze, J. (2003). Towards electronic persistence using ARK identifiers. Retrieved July 10, 2008, from the University of California, California Digital Library, Inside CDL Web site: http://www.cdlib.org/inside/diglib/ark/arkcdl.pdf

Lagoze, C. and Van de Sompel, H. (2001). The Open Archives Initiative: building a low-barrier interoperability framework. In Proceedings of the First ACM/IEEE-CS Joint Conference on Digital Libraries, June 24-28, 2001, Roanoke, VA. pp. 54-62.

Lavoie, B. (2004). The open archival information system reference model: introductory guide.Technology Watch Report. Dublin, OH: Digital Preservation Coalition. Retrieved March 6, 2007, http://www.dpconline.org/docs/lavoie_OAIS.pdf

Lavoie, B. & Dempsey, L. (2004). Thirteen ways of looking at…digital preservation. D-Lib Magazine, 10(7/8). Retrieved May 7, 2007, from http://www.dlib.org/dlib/july04/lavoie/07lavoie.html

Lavoie, B. & Gartner, R. (2005). Preservation metadata. Technology Watch Report. Dublin, OH: Digital Preservation Coalition. Retrieved June 20, 2009, http://www.dpconline.org/docs/reports/dpctw05-01.pdf

Lawrence, G.W., Kehoe, W.R., Rieger, O.Y., Walters, W.H., & Kenney, A.R. (2000). Risk management of digital information: a file format investigation. Washington, DC: Council on Library and Information Resources. Retrieved October 22, 2007, from http://www.clir.org/pubs/reports/pub93/contents.html

Lee, C. (2010). Open archival information system (OAIS) reference model. In Encyclopedia of Library and Information Sciences, Third Edition. London: Taylor & Francis.

Lee, C., Tibbo, H.R., & Schaefer, J.C. (2007). Defining what digital curators do and what they need to know: The DigCCurr Project. In Proceedings of the 2007 ACM/IEEE Joint Conference on Digital Libraries, 49-50.

Lynch, C. A. (1994). The integrity of digital information: mechanics and definitional issues. Journal of the American Society for Information Science, 45(10), 737-744.

Lynch, C. (2000). Authenticity and integrity in the digital environment: an exploratory analysis of the central role of trust. Authenticity in a digital environment. Washington, DC: Council in Library and Information Resources. Retrieved April 14, 2009, from http://www.clir.org/pubs/reports/pub92/pub92.pdf

McCown, F., Chan, S., Nelson, M.L., & Bollen, J. (2005). The availability and persistence of Web references in D-Lib Magazine. Paper presented at the 5th International Web Archiving Workshop (IWAW05), Vienna, Austria. Retrieved July 14, 2008, from http://arxiv.org/abs/cs.OH/0511077

Mearian, L. (2008). Study: digital universe and its impact bigger than we thought. Computerworld, March 11, 2008. Retrieved March 14, 2008, from http://www.computerworld.com/action/article.do?command=viewArticleBasic&articleId=9067639

Moore, R. (2002). The preservation of data, information, and knowledge. In Proceedings of the World Library Summit, April 24-26, 2002, Singapore. Retrieved April 1, 2009, from http://www.sdsc.edu/NARA/Publications/Web/moore-rw.doc

Moore, R. (2004). Evolution of data grid concepts. Paper presented at the workshop on “data” at the 10th Global Grid Forum, Berlin, Germany, March 9-13, 2004. Retrieved March 23, 2009, from http://www.npaci.edu/DICE/Pubs/Grid-evolution.doc

Moore, R.W. (2004). Preservation Environments. In Proceedings of the NASA/IEEE MSST 2004 Twelfth NASA Goddard Conference on Mass Storage Systems and Technologies in cooperation with the Twenty-First IEEE Conference on Mass Storage Systems and Technologies (MSST 2004), April 13-16, 2004, Adelphi, Maryland, USA. Retrieved September 26, 2010, from http://ntrs.nasa.gov/archive/nasa/casi.ntrs.nasa.gov/20040121020_2004117345.pdf

Moore, R. (2005). Persistent collections. In S.H. Kostow & S. Subramaniam (Eds.), Databasing the brain: from data to knowledge (neuroinformatics) (pp. 69-82). Hoboken, NJ: John Wiley and Sons.

Moore, R. (2006). Building preservation environments with data grid technology. American Archivist, 69(1), 139-158.

Moore, R. & Merzky, A. (2003). Persistent archive concepts. Paper presented at the 7th Global Grid Forum, Tokyo, Japan, March 4-7, 2003. Retrieved March 4, 2009, from http://www.npaci.edu/DICE/Pubs/Data-PAWG-PA.doc

Moore, R., Rajasekar, A., & Marciano, R. (2007). Implementing Trusted Digital Repositories. In Proceedings of the DigCCurr2007 International Symposium in Digital Curation, University of North Carolina – Chapel Hill, Chapel Hill, NC USA, 2007. Retrieved September 24, 2010, from http://www.ils.unc.edu/digccurr2007/papers/moore_paper_6-4.pdf

Moore, R. & Smith, M. (2007). Automated Validation of Trusted Digital Repository Assessment Criteria. Journal of Digital Information, 8(2). Retrieved March 2, 2010, from http://journals.tdl.org/jodi/article/view/198/181

National Initiative for a Networked Cultural Heritage. (2002). Rights management. In the NINCH guide to good practice in the digital representation and management of cultural heritage materials, v.1.0. Glasgow: University of Glasgow (HATII) & NINCH. Retrieved April 17, 2009, from http://www.nyu.edu/its/humanities/ninchguide/IV/

National Science Foundation. (2005). Long-lived digital data collections enabling research and education in the 21st century (NSB-05-40). Arlington, VA: National Science Foundation. Retrieved May 5, 2008, from http://www.nsf.gov/pubs/2005/nsb0540/

Nelson, B. (2009). Data sharing: empty archives. Nature, 461, 160-163.

Nelson, M.L. (2000). Buckets: Smart Objects for Digital Libraries (Doctoral Dissertation). Retrieved December 14, 2011, from http://www.cs.odu.edu/~mln/phd/

Nelson, M.L., & Allen, B.D. (2002). Object persistence and availability in digital libraries. D-Lib Magazine, 8(1). Retrieved July 18, 2007, from http://www.dlib.org/dlib/january02/nelson/01nelson.html

NESTOR Working Group on Trusted Repository — Certification. (2006). Catalog of criteria for trusted digital repositories version 1 draft for public comment (urn:nbn:de:0008-2006060703). Berlin: nestor Working Group — Certification. Retrieved April 14, 2009, http://edoc.hu-berlin.de/series/nestor-materialien/8en/PDF/8en.pdf

Oltmans, E. & Kol, N. (2005). A comparison between migration and emulation in terms of costs. RLG DigiNews 9(2). Retrieved September 10, 2007, from http://worldcat.org/arcviewer/1/OCC/2007/08/08/0000070519/viewer/file959.html

Online Computer Library Center, Inc. & Center for Research Libraries. (2007). Trustworthy repositories audit & certification: criteria and checklist version 1.0. Dublin, OH & Chicago, IL: OCLC & CRL. Retrieved September 11, 2007, from http://www.crl.edu/sites/default/files/attachments/pages/trac_0.pdf

O’Toole, J.M. (1989). On the idea of permanence. American Archivist, 52, 10-25.

Paskin, N. (2003). DOI: A 2003 progress report. D-Lib Magazine, 9(6). Retrieved July 9, 2008, from http://www.dlib.org/dlib/june03/paskin/06paskin.html

Rajasekar, A., Wan, M., Moore, R., & Schroeder, W. (2006). A prototype rule-based distributed data management system. Paper presented at a workshop on “next generation distributed data management” at the High Performance Distributed Computing Conference, June 19-23, 2006, Paris, France.

Research Information Network. (2011). Data centres: their use, value, and impact. A Research Information Network report. London, UK: JISC, September 2011.

Research Libraries Group. (1996). Preserving digital information report of the task force on archiving of digital information. Final report of the Task Force on Archiving of Digital Information commissioned by the Commission on Preservation and Access and the Research Libraries Group. Retrieved September 24, 2007, from http://www.eric.ed.gov/ERICWebPortal/custom/portlets/recordDetails/detailmini.jsp?nfpb=true&&ERICExtSearch_SearchValue_0=ED395602&ERICExtSearch_SearchType_0=eric_accno&accno=ED395602

Research Libraries Group. (2002). Trusted digital repositories: attributes and responsibilities an RLG-OCLC report. Mountain View, CA: Research Libraries Group. Retrieved September 11, 2007, from http://www.oclc.org/programs/ourwork/past/trustedrep/repositories.pdf

Rosenthal, D.S.H., Robertson, T., Lipkis, T., Reich, V., Morabito, S. (2005). Requirements for digital preservation systems a bottom-up approach. D-Lib Magazine, 11(11). Retrieved August 11, 2007, from http://www.dlib.org/dlib/november05/rosenthal/11rosenthal.html

Ross, S. & McHugh, A. (2006). The role of evidence in establishing trust in repositories. D-Lib Magazine 12(7/8). Retrieved May 6, 2007, from http://www.dlib.org/dlib/july06/ross/07ross.html

Rothenberg, J. (1999). Avoiding technological quicksand: finding a viable technical foundation for digital preservation (pub 77). A report to the Council on Library and Information Resources. Washington, DC: Council on Library and Information Resources. Retrieved April 16, 2009, from http://www.clir.org/pubs/reports/rothenberg/pub77.pdf

Rothenberg, J. (1999). Ensuring the longevity of digital information. Washington, DC: Council on Library and Information Resources. Retrieved April 16, 2009, from http://www.clir.org/pubs/archives/ensuring.pdf

Society of American Archivists. (1999). Core Archival Functions. Guidelines for College and University Archives. Prepared by the College and University Archives Section of the Society of American Archivists (SAA). Retrieved May 26, 2010, from http://www.archivists.org/governance/guidelines/cu_guidelines4.asp

Science and Technology Council. (2007). The digital dilemma strategic issues in archiving and accessing digital motion picture materials. The Science and Technology Council of the Academy of Motion Picture Arts and Sciences. Hollywood, CA: Academy of Motion Picture Arts and Sciences.

Shreeves, S. L., Knutson, E. M., Stvilia, B., Palmer, C. L., Twidale, M. B., & Cole, T. W. (2005). Is ‘quality’ metadata ‘shareable’ metadata? The implications of local metadata practices for federated collections. In Proceedings of the Twelfth National Conference of the Association of College and Research Libraries, April 7-10 2005, Minneapolis, MN, 223-237.

Sivathanu, G., Wright, C.P., & Zadok, E. (2005). Ensuring data integrity in storage: techniques and applications. In Proceedings of the first ACM International Workshop on Storage Security and Survivability (StorageSS 05), held in conjunction with the 12th ACM Conference on Computer and Communications Security (CCS 2005), November 7-11, 2005, Alexandria, VA. Retrieved October 4, 2007, from http://www.fsl.cs.sunysb.edu/docs/integrity-storagess05/integrity.html

Smith, M. & Moore, R. (2006). Digital Archive Policies and Trusted Digital Repositories. Paper presented at the 2nd International Digital Curation Conference, November 21 – 22, 2006, Glasgow, Scotland. Retrieved November 2, 2009, from http://pledge.mit.edu/images/6/6f/Smith-Moore-DCC-Nov-2006.pdf

Steinhart, G., Dietrich, D., & Green, A. (2009). Establishing trust in a chain of preservation the TRAC checklist applied to a data staging repository (DataStaR). D-Lib Magazine 15(9/10). Retrieved October 13, 2009 from http://www.dlib.org/dlib/september09/steinhart/09steinhart.html

Thibodeau, Kenneth. (2002). Overview of technological approaches to digital preservation and challenges in coming years. In Proceedings of the State of Digital Preservation: An International Perspective, at the Institutes for Information Science, April 24-25, 2002, Washington, DC. Retrieved September 26, 2007 from http://www.clir.org/pubs/reports/pub107/thibodeau.html
Thibodeau, K. (2007). The Electronic Records Archives Program at the National Archives and Records Administration. First Monday, 12(7). Retrieved January 15, 2009 from http://firstmonday.org/issues/issue12_7/thibodeau/index.html

Tibbo, H.R. (2003). On the nature and importance of archiving in the digital age. Advances in Computers, 57, 1-67.

URI Planning Interest Group. (2001). URIs, URLs, and URNs: Clarifications and Recommendations 1.0. Report from the joint W3C/IETF URI Planning Interest Group, W3C Note, 21 September 2001. Retrieved November, 8, 2011, from http://www.w3.org/TR/uri-clarification/

Vardigan, M. & Whiteman, C. (2007). ICPSR meets OAIS: applying the OAIS reference model to the social science archive context. Archival Science, 7(1). Netherlands: Springer. Retrieved February 20, 2008, from http://www.springerlink.com/content/50746212r6g21326/

Walters, T. & Skinner, K. (2011). New roles for new times: digital curation for preservation. Report prepared for the Association of Research Libraries. Washington, D.C.: Association of Research Libraries. Retrieved April 2, 2011, from http://www.arl.org/bm~doc/nrnt_digital_curation17mar11.pdf.

Ward, J. (2004). Unqualified Dublin Core usage in OAI-PMH Data Providers. OCLC Systems and Services, 20(1), 40-47.

Ward, J.H., de Torcy, A., Chua, M., and Crabtree, J. (2009). Extracting and Ingesting DDI Metadata and Digital Objects from a Data Archive into the iRODS extension of the NARA TPAP using the OAI-PMH. In Proceedings of the 5th IEEE International Conference on e-Science, Oxford, UK, December 9-11, 2009.

Waters, D. and Garrett, J. (1996). Preserving Digital Information. Report of the Task Force on Archiving of Digital Information. Washington, DC: CLIR, May 1996.

Wells, H.G. (1938). World brain. Garden City, NY: Doubleday, Doran and Co.

Witt, M., Carlson, J., & Brandt, D.S. (2009). Constructing data curation profiles. International Journal of Digital Libraries, 3(4), 93-103.

Zen College Life. (2011). The history of libraries through the ages. Retrieved August 30, 2011, from http://www.zencollegelife.com/the-history-of-libraries-through-the-ages/

 
If you would like to work with us on a data governance or digital preservation and curation project, please review our consulting services.

Managing Data: the Emergence & Development of Digital Curation and Preservation Standards

Understand Stewardship and the Data Deluge

Managing Data: the Data Deluge and the Implications for Data Stewardship

Abstract

Preservation standards for repositories do not exist in a void. They were created to address a particular issue, which is the long-term preservation of digital objects. Preservation repository and policy standards are designed to address long-term digital storage (i.e., digital curation and preservation) by defining “the what” (preservation repository design) and “the how” (preservation policies). This essay focuses primarily on the research data deluge and the implications for the long-term stewardship of data. The conclusion is that researchers want to focus on creating and analyzing data. Some researchers care about the long-term stewardship of their data, while others do not. Effective data stewardship requires not just technical and standards-based solutions, but also people, financial, and managerial solutions. It remains to be seen whether or not funders’ requirements for data sharing will impact how much data is actually made available for re-purposing, re-use, and, preservation.

Citation

Ward, J.H. (2012). Managing Data: the Data Deluge and the Implications for Data Stewardship. Unpublished manuscript, University of North Carolina at Chapel Hill. (pdf)

Creative Commons License
This work is licensed under a Creative Commons Attribution-NoDerivatives 4.0 International License.


Table of Contents

Abstract

Introduction

Definitions

Data, Metadata, and Ontologies
Types of Data Collections
Types of Research Data
The Research Data Deluge: What and How Big Is It?

Privacy versus Big Data

Why Share and Preserve Data?

Infrastructure and Data Centers

Roles and Responsibilities

Sustainability

Research Data Curation

Data Curation vs. Digital Curation
The Research Data Lifecycle
Data Repositories
Funders’ Requirements and Guidance
The National Institutes of Health
The National Science Foundation

The Application of Policies to Repositories and Data

The Automation of Preservation Management Policies
The Application of Policies to Data and Data Curation

Summary: the Implications for the Long-term Stewardship of Research Data

References

Appendix A


Table of Figures

Figure 1 – The National Aeronautics and Space Administration (NASA) Earth Observing System (EOS) Data Processing Levels (National Aeronautics and Space Administration, 2010; Ball, 2010).

Figure 2 – Space Science Board Committee on Data Management and Computation (CODMAC) Space Science Data Levels and Types (Ball, 2010).

Figure 3 – Big Data, MGI’s estimate of size (Manyika, et al., 2011).

Figure 4 – LIFE (Life Cycle Information for E-Literature) Project (Blue Ribbon Task Force on Sustainable Digital Preservation and Access, 2008).

Figure 5 – I2S2 Idealized Scientific Research Activity Lifecycle Model (Ball, 2012).

Figure 6 – Entities by Role, 1 of 3 (Interagency Working Group on Digital Data, 2009).

Figure 7 – Entities by Role, 2 of 3 (Interagency Working Group on Digital Data, 2009).

Figure 8 – Entities by Role, 3 of 3 (Interagency Working Group on Digital Data, 2009).

Figure 9 – Entities by Individuals (Interagency Working Group on Digital Data, 2009).

Figure 10 – Entities by Sector with footnotes (Interagency Working Group on Digital Data, 2009).

Figure 11 – Individuals by Role (Interagency Working Group on Digital Data, 2009).

Figure 12 – Individuals by Life Cycle Phase/Function (Interagency Working Group on Digital Data, 2009).

Figure 13 – Entities by Life Cycle Phase/Function (Interagency Working Group on Digital Data, 2009).


Introduction

Preservation standards for repositories do not exist in a void. They were created to address a particular issue, which is the long-term preservation of digital objects, i.e., “data”. Waters & Garrett (1996) wrote that these standards must be created in order for archives to demonstrate that “they are what they say they are” and they can “meet or exceed the standards and criteria of an independently-administered program”. Preservation repository and policy standards are designed to address long-term digital storage (i.e., digital curation and preservation) by defining “the what” (preservation repository design) and “the how” (preservation policies). The next step is to examine what types of data are being curated and preserved, put into an “OAIS Reference Model inside” repository and managed with Audit and Certification of Trustworthy Digital Repositories Recommended Practices, as well as to examine any related issues and factors.

Hey and Trefethen (2003) defined the data deluge with an examination of eScience. The authors called for “new” types of digital libraries for science data that would provide data-specific services and management. While the data deluge cuts across all sectors (Manyika, et al., 2011; Science and Technology Council, 2007), this essay focuses primarily on the research data deluge. It defines research data, the types of research data and collections; attempts to determine how much data exists; and, examines “big data” versus privacy. It also describes the reasons researchers do and do not share their data, the role of data curators, and, provides an overview of infrastructure. Finally, this literature review describes research data curation; examines example applications of general data management policies to repositories and to the data itself; and, discusses the implications for the long-term stewardship of research data based on the literature reviewed.

Definitions

What does it mean to “steward” data? The editors of Merriam-Webster (2012) defined stewardship as, “the conducting, supervising, or managing of something; especially : the careful and responsible management of something entrusted to one’s care “. The authors of ForestInfo.org (2012) wrote that stewardship is, “the concept of responsible caretaking; the concept is based on the premise that we do not own resources, but are managers of resources and are responsible to future generations for their condition”. Therefore, one may extrapolate that “data stewardship” is the “careful and responsible management of something entrusted to one’s care” so that future generations may access the data with full confidence that the data is what the provider says it is.

How does data stewardship differ from digital curation and digital preservation? Lazorchak (2011) wrote that he has used the terms interchangeably, but they are really three different processes. The detailed definitions for digital curation and digital preservation are available in the previous section, “Managing Data: the Emergence & Development of Digital Curation & Digital Preservation Standards”. However, in short, digital curation addresses the whole life cycle of digital preservation. Lazorchak (2011) stated that the concept of “digital stewardship…brings preservation and curation together…pulling in the lifecycle approach of curation along with research in digital libraries and electronic records archiving, broadening the emphasis from the e-science community on scientific data to address all digital materials, while continuing to emphasize digital preservation as a core component of action”.

Thus, one might say that digital preservation is the “what”; digital curation is the “how” for preserving the data; and digital or data stewardship is the “why” (to manage entrusted resources for future generations). Lynch (2008) wrote that the best data stewardship “will come from disciplinary engagement with preservation institutions”. That is, if scientists wish to manage their data so that it will be accessible for the indefinite long-term, then they will need to work with librarians, archivists, computer scientists, domain specialists, and other information professionals whose expertise lies in the curation and preservation of data.

Data, Metadata, and Ontologies

What are data, metadata, and ontologies in the context of science research data? The National Science Foundation Cyberinfrastructure Council (2007) defined these terms. They wrote that “data are any and all complex data entities from observations, experiments, simulations, models, and higher order assemblies, along with the associated documentation needed to describe and interpret the data”. Next, the authors described metadata as a subset of, and about, data. They wrote that “metadata summarize data content, context, structure, interrelationships, and provenance…. They add relevance and purpose to data, and enable the identification of similar data in different data collections” (National Science Foundation Cyberinfrastructure Council, 2007). Finally, the council members defined ontology as “the systematic description of a given phenomenon. It often includes a controlled vocabulary and relationships, captures nuances in meaning and enables knowledge sharing and reuse” (National Science Foundation Cyberinfrastructure Council, 2007).

Employees of the U.S. Office of Management and Budget defined research data as, “the recorded factual material commonly accepted in the scientific community as necessary to validate research findings, but not any of the following: preliminary analyses, drafts of scientific papers, plans for future research, peer reviews, or communications with colleagues” (National Science Board, 2011). As part of this definition, the authors also included metadata and the analyzed data. The former may include computational codes, apparatuses, input conditions, and so forth, while the latter may include published tables, digital images, and tables of numbers from which graphs and charts may be generated, among others. Furthermore, they differentiated “digital research data” from research data by including a separate definition. They wrote that digital research data is “any digital data, as well as the methods and techniques used in the creation and analysis of that data, that a researcher needs to verify results or extend scientific conclusions, including digital data associated with non-digital information, such as the metadata associated with physical samples” (National Science Board, 2011).

Last, the members of the National Science and Technology Council Interagency Working Group on Digital Data (2009) wrote that:

“digital scientific data” refers to born digital and digitized data produced by, in the custody of, or controlled by federal agencies, or as a result of research funded by those agencies, that are appropriate for use or repurposing for scientific or technical research and educational applications when used under conditions of proper protection and authorization and in accordance with all applicable legal and regulatory requirements. It refers to the full range of data types and formats relevant to all aspects of science and engineering research and education in local, regional, national, and global contexts with the corresponding breadth of potential scientific applications and uses (National Science and Technology Council Interagency Working Group on Digital Data, 2009).

Thus, while there is some variation between the definitions of research data, the general consensus is that it consists of the items or objects that scientists analyze, create, and use in the process of conducting research.

Types of Data Collections

When data are organized, they become collections. The National Science Foundation (2005) and the National Science Foundation Cyberinfrastructure Council (2007) defined three types of data collections: research, resource, and reference collections. The authors of the 2005 National Science Foundation report chose not to refer to databases, but collections, because the authors wanted to refer to the individuals, infrastructure, and organizations indispensable to the management of the collection. Thus, the board members wrote that data collections fall under one of the three functional categories mentioned previously.

  • Research Data Collections: these collections are created for a limited group, supported by a small budget, as part of one or more focused research projects, and may very in size. The researchers who collect the data do not intend to preserve, curate or process it, although this is often due to lack of funding. They may apply rudimentary standards for metadata structure, file formats, or content access policies. Often, there are not standards because the community-of-interest is very small. Some recent examples of these types of collections include Fluxes Over Snow Surfaces (FLOSS) and the Ares Lab Yeast Intron database.
  • Resource/Community Data Collections: these types of data collections are maintained and created to serve an engineering or science community. The budgets to maintain the collection(s) are provided directly by agency funding and are generally intermediate in size. This funding model can make it challenging to gauge how long the collection will be available, due to changes in budget priorities. However, the community does tend to apply standards for the maintenance of the collection, either by developing community standards or re-purposing existing standards. Two examples of these types of collections include The Canopy Database Project and the PlasmoDB.
  • Reference Data Collections: Characteristic features of these types of collections are a diverse set of user communities that represent large segments of the education, research, and scientific community. Users of these data sets include students, educators, and scientists across a variety of institutional, geographical, and disciplinary domains. The managers of these data collections tend to follow or create comprehensive, well-established standards. The creators, users, and managers of these data collections intend to make them available for the indefinite long-term, and budgetary support tends to come from multiple sources over the long-term. The examples for these types of data collections include The Protein Data Bank, SIMBAD, and the National Space Science Data Center (NSSDC) (The National Science Foundation, 2005; National Science Foundation Cyberinfrastructure Council, 2007).

The type of data collection does not necessarily indicate its long-term value to future researchers, but the collection type does increase the odds of the collection being usable and accessible within one or more generations. A small, under-funded, poorly documented research data collection may prove to be of great value to a future researcher or researchers who can figure out what the data is and how to access it, while a large, well-funded, and well-documented data collection may have no users after the original research study closes.

Types of Research Data

The types of data researchers have created fall into three primary categories used for one or more processes: structured, unstructured, or semistructured [sic]. Members of the National Research Council (2010) described structured data as rigidly formatted, while unstructured data consists of text. They provided examples of semi-structured data as consisting of personnel data, want ads, and so forth. The data researchers have generated in one of these categories may be created by a variety of processes that generally fall into one of three areas: scientific experiments, models or simulations, and observations.

The data generated from a scientific experiment is intended to be reproducible, at least in theory. Researchers often do not have the time and funding to reproduce many experiments (Lynch, 2008). With regards to model or simulation data, researchers have preferred to retain the model and related metadata rather than the outputted data. Scientists have considered observational data to be irreplaceable, as it is usually the result of data gathering at a specific location and time that may not be reproducible. They have gathered raw data in the course of observations and or experiments, while derived data results from combining or processing raw data (Research Information Network, 2008; Research Information Network, 2008).

The National Aeronautics and Space Administration (NASA), Earth Observing System (EOS), developed a set of terminology to describe the degree to which data has been processed (Ball, 2010; National Aeronautics and Space Administration, 2010). The authors designed it with four data levels, each with subsets; Level 0 is the least processed and Level 4 is the most processed (see Figure 1, below). Ball wrote that under this scheme, “data do not have significant scientific utility until they reach…Level 1”.

Figure 1 - The National Aeronautics and Space Administration (NASA) Earth Observing System (EOS) Data Processing Levels (National Aeronautics and Space Administration, 2010; Ball, 2010).
Figure 1 – The National Aeronautics and Space Administration (NASA) Earth Observing System (EOS) Data Processing Levels (National Aeronautics and Space Administration, 2010; Ball, 2010).

The author noted that Level 2 has the greatest long-term usefulness, and that most scientific applications require data processed to at least that level. He described Level 3 data as being the most “shareable”; those data contain smaller sets than Level 2 data, and are thus easier to combine with other data.

Alternately, Members of the Space Science Board have developed specific definitions for space data levels and types that range from raw data to a user description. See Figure 2, below.

 Figure 2 - Space Science Board Committee on Data Management and Computation (CODMAC) Space Science Data Levels and Types (Ball, 2010).
Figure 2 – Space Science Board Committee on Data Management and Computation (CODMAC) Space Science Data Levels and Types (Ball, 2010).

These board members considered that space data is not just the data itself, but also any related documentation needed to access, run, correlate, calibrate, or extract information from the data.

The authors of the Research Information Network (2008) paper on research data sharing wrote that researchers and curators further process this data, either by reduction, annotation, or curation. They noted that researchers often share derived or reduced data; they do not often share raw data. The authors described how — once data has gone through this last process — it might be made available to other users and researchers, depending on the implicit and explicit policies of a particular domain. However, they stated that the trade-off to using derived data is that reproducibility may be compromised because something may have been lost in the processing.

In addition, the authors noted that if a researcher adds metadata information to describe the processing techniques used, then the original provenance might be compromised. They iterated that most researchers prefer to work with raw data, but practical reasons often prohibit its use by anyone other than the originating researchers. They described how, when researchers cannot or will not share raw data, sometimes it is because the data may be in a proprietary format that must be transferable to a more common format, and that “something” is lost in the transfer. They stated that the reason for this is that often the raw data set may be too unwieldy to share, or, the researcher(s) simply may not be willing to share the raw data set (Research Information Network, 2008).

The Research Data Deluge: What and How Big Is It?

Researchers and authors have found it challenging to determine how much data currently exists, much less how much data exists within science, much less how much will exist at X point in the future in any field. In order to make an educated estimate, a researcher must determine what does and does not constitute data. Is it the actual data created by someone, or the information about them, such as metadata or someone’s digital exhaust? How do you de-duplicate data? Do you count a compressed file or folder, or an uncompressed file or folder? Another question to consider is, how much is, “a lot of data”?

Tony Hey and Anne Trefethen’s seminal paper (2003) brought attention to the imminent e-Science data deluge and attempted to quantify the amount of data by examining Astronomy, Bioinformatics, Environmental Science, Particle Physics, Medicine and Health, and Social Sciences. Lord and MacDonald (2003) also attempted to quantify the amount of research data by domain. However, at the time of this literature review, the numbers in the two papers are around ten years out of date, so the numbers will not be quoted here as they are no longer relevant — although the authors’ arguments that a deluge of data is here has remained relevant. The point is that any researcher or author attempting to quantify and describe “the data deluge” must take into account the standards of the time, because what is considered “a lot of data” at one time may seem like “not much data” a generation later.

For example, technologists have often quoted Bill Gates as saying that “640K ought to be enough for anybody” in 1981 (Tickletux, 2007). (Various authors have written that he later denied making this statement, but whether or not Bill Gates made that statement, the point is that users tend to fill up whatever amount of digital storage is made available to them, and then they will complain that they need more.) Thus, in 1981, researchers used to measuring storage in kilobytes may have considered 10 gigabytes of data to be a “data deluge”. Researchers currently speak of data in terms of exa, zetta, and yottabytes; many, if not most, researchers will concede that “a lot of data” or a “data deluge” is a relative phrase. One imagines that ancient archivists managing clay tablets and papyri considered themselves in the midst of a “data deluge”! Or, a generation from now, future technologists will wonder why curators in the early 2000s considered exabytes “a lot of data”. However, whether the amount of data currently in existence is “a lot” or “not very much” data, analysts have attempted to quantify the current data deluge using a variety of methodologies.

Thus, more recently in the data deluge estimation universe, Hilbert and Lopez (2011) examined “all information that has some relevance for an individual” and did not try to distinguish between duplicate or different information. They considered the computation time of information, in addition to transmission through time (storage) and space (communication). Their study spanned two decades (1986-2007) and 60 categories worldwide (39 digital and 21 analog). Their research indicated that as of 2007, “humankind was able to store 2.9×1020 optimally compressed bytes, communicate almost 2×2021 bytes, and carry out 6.4×1018 instructions per second on general purpose computers” (Hilbert & Lopez, 2011).

Beginning in 2007, the research firm IDC created an annual report dedicated to estimating the amount of new digital information generated and replicated. The study has been sponsored by EMC, an information-management company. IDC’s 2007 study concluded that the world’s capacity to store data had been exceeded by our ability to generate new data. Their projection was that the annual growth rate for data through 2020 would be 40% (Manyika, et al., 2011). In June 2011, the author of the IDC report wrote, “the amount of information created and replicated will surpass 1.8 zettabytes…growing by a factor of 9 in just five years (Gantz & Reinsel, 2011).

While researchers have examined the amount of data that individuals and organizations are generating, there is little insight to how much variation there is among and between the different sectors, such as education, industry, and government. However, Manyika, et al.’s (2011) research indicated that while the Library of Congress (LC) had collected 235 terabytes of data as of April 2011, fifteen out of seventeen sectors in the USA have more data stored than the LC. For example, James Hamilton, the Vice President of Amazon, has noted that the amount of capacity Amazon ran on in all of 2001 is currently added to its data centers daily (Gallagher, 2012). Hamilton’s comment has reinforced the earlier point that “a lot of data” is a relative term; one imagines that Amazon’s employees considered that they processed and stored “a lot of data” in 2001, especially relative to their storage and processing capacities in the 1990s.

Figure 3 - Big Data, MGI’s estimate of size (Manyika, et al., 2011).
Figure 3 – Big Data, MGI’s estimate of size (Manyika, et al., 2011).

Regardless of whether or not the current data-intensive environment is a “deluge”, one must consider current technology and demands versus processing and storage requirements. Manyika, et al., (2011) determined that critical mass has been reached in every sector, but the level of the intensity of the data generated varies. They determined these aggregate results by examining four factors: utilization rate, duplication rate, average replacement cycle of storage, and annual storage capacities shipped by sector (please see Figure 3, above). The consultants found that for the year 2010, the amount of data stored in enterprise external disk storage for one year is 7.4×1018 bytes, including replicas. Their research indicated that for the same year, consumers generated 6.8×1018 bytes. Furthermore, Gallagher (2012) wrote that “Google processes over 20 petabytes of data per day” on searches alone. One must concede that, given current technologies versus user demands and expectations, that is a lot of data.

Privacy versus Big Data

It is important to note that data is not just about the content that is created — it is also about the information around it. These sources include browsing histories, geographic locations, and other metadata and “digital exhaust” (Gantz & Reinsel, 2011). The two authors wrote that the amount of information being created about users of data is greater than the amount of data and information users are creating themselves. Evans and Foster (2011) stated that this “metaknowledge” — knowledge about knowledge — may include idioms particular to a domain or scientist, the status and history of researchers when included in a paper, as well as the focus and audience of a particular journal. The authors argued that studying metaknowledge could provide useful information about the spread of ideas within a research domain, particularly from teacher to student.

However, metaknowledge may also be considered digital exhaust. Evans and Foster (2011) described the former term as the explicit information about someone that is publicly available, such as a short biography submitted by an author as part of a paper. Burgess (2011) defined digital exhaust as the information all users leave behind when using digital technology. This exhaust can be as innocuous as a name in the metadata of a Microsoft Word document that allows a researcher to determine whom his or her anonymous reviewer is, to browsing history, to one’s physical location as determined by one’s location to a cell phone tower.

The data that is generated about individuals may be unimportant, but, en masse, gives government and corporations an incredible amount of data and information about individuals that has previously been private. This information may include Tweets, photos, emails, Facebook posts, etc., etc. For example, Hough (2009) discusses a study in which 75% of Facebook users post information indicating that they are out of town, thus putting themselves at risk of a break-in.

Sullivan (2012) wrote an article describing university and government agencies’ demands for athletes’ and job applicants’ Facebook account user names and passwords in order to better monitor each person’s personal habits and preferences. Some state legislators are banning the practice, citing the first amendment. Solove (2007) argues that just because an individual may not have anything to hide, does not mean that he or she must share their personal data, while Hough (2009) declares that individuals should not be so willing to give up their privacy as the price of using technology. Hough cites a study by Sweeny (2002) in which 87% of the population of the United States may be uniquely identified using only 1990 census data — gender, date of birth, and a five-digit zip code. Sweeny proved it is fairly easy to determine an individual’s Social Security Number, particularly if that individual was born after 1980 — simply by knowing their date and place of birth.

As well, Sweeny (2002) provided one of the most famous examples of how easy it is to find individual information. The researcher correlated the information contained in a public data set provided by the primary state employee health care provider for Massachusetts with publicly available voter registration data. The voter rolls contained each individual’s name, birth date, address, gender, and zip code. The data set provided by the Massachusetts state employee health care provider contained each anonymized individual’s birth date, zip code, gender and their individual medical information, such as medications and procedures. Sweeny used this information to find then-Massachusetts Governor Weld’s medical records, and promptly requested and mailed his own records to him! She found his medical records by matching shared attributes: Governor Weld then lived in Cambridge, Massachusetts. Based on the voter rolls, six people in Cambridge had the same birth date, three were men, and only one lived in Weld’s 5-digit zip code.

A few years later, in March 2010, Netflix cancelled an annual prize competition to develop better recommendation algorithms, due to privacy concerns. Narayanan and Shmatikov (2007) correlated the supposedly anonymized user data Netflix had provided to the contest’s participants and compared it to data from the Internet Movie Database. The researchers claimed they successfully identified the Netflix records of known users, thus revealing their implied political views and other potentially sensitive information.

Thus, researchers must be careful about what data they release, how much, and to whom. Even supposedly anonymized data may provide enough detail to be dangerous when correlated with other publicly available data.

Why Share and Preserve Data?

The National Science Foundation and the National Institutes of Health in the United States, as well as major research funders in the United Kingdom, now require the researchers they fund to provide data management plans and be prepared to share the data generated from their research (National Science Foundation, 2010; National Institutes of Health, 2010; and, Jones, 2011). The policy arguments for sharing data are primarily based around two reasons: to ensure the reproducibility and replicability of science; and, so that the results of taxpayer funded research are made re-usable in order to maximize the returns from the high costs involved in gathering the data initially (Borgman, 2010; National Science Board, 2011).

As noted above, observational data is the most vulnerable with regards to reproducibility because it is based on a specific time and place; experimental data and model data are replicable in theory. However, if these data are curated in the appropriate formats with the required software, hardware, and any related scripts, then the research results should be replicable. Borgman (2010), Lynch (2008), Fry, et al. (2008), and, Lord and MacDonald (2003) stated that the reasons librarians and libraries should curate the outputs of scientific research are pretty simple: curation is not an end in itself, it is a way of supporting science by providing methods for access, use, re-use, and a more complete and transparent record of science. However, the members of the National Science Board (2011) have made the point that a one-size-fits-all approach to data sharing is neither desirable nor feasible. Instead, the National Science Foundation (2010) has encouraged each domain to establish its own standards for data management.

Other policy reasons cited by the National Research Council (2010) and Borgman (2010) included the creation of new science based on new questions of existing data, such as finding patterns, and advancing research in general by creating a new set of data-intensive methods that move science beyond theory, simulation, and empiricism, i.e., “the 4th paradigm”. Wired Magazine’s Chris Anderson (2008) took the 4th Paradigm idea too far, however, when he declared that “the data deluge makes the scientific method obsolete”. As Borgman (2010) observed, “access to data does not a scientist make”, as rigorous data analysis requires a certain amount of expertise to accurately interpret often-complex information and associated metadata. Fry, et al. (2008) cited a study in which researchers expressed concern that public access to research data would only increase confusion, rather than transfer any useful knowledge to the general public.

Given the potential dangers of providing data to others for their use and re-use, as noted in a previous section, why should a researcher share their data with anyone? The reasons vary, but generally involve coercion (i.e., a funder requires it); a requirement for reciprocal data sharing; the collaboration value; costs are reduced by preventing duplicate data collection; and, a desire to support the scientific method and ensure that studies are replicable (Borgman, 2010; Van den Eynden, et al., 2011). Researchers’ willingness to share their data varies by domain; it is rare for climate scientists to share their data or re-use another researcher’s model-run data. Therefore, climate scientists have little incentive to repurpose data for re-use.

However, for those researchers who work in a domain that shares data formally or informally, such as Astrophysics (Harley, et al., 2010), the Research Information Network (2008) study indicated that other incentives for sharing include paper co-authorship opportunities, greater collaboration opportunities, and greater visibility for the researcher’s institution and research group. Regardless of whether or not a particular domain encourages data sharing, Borgman (2010, 2008) wrote that publication is still the route to success and rewards, not data sharing, although research productivity is shown to increase with both informal and formal data sharing, especially with secondary publications (Pienta, Alter, & Lyle, 2010).

Borgman (2010, 2008) and Fry, et al. (2008) also noted that other disincentives to sharing data are the time and resources required to re-purpose the data; the researcher’s inability to control their intellectual property; and, concerns that their research results will be “scooped” by another researcher, if no embargo period on data sharing is required and enforced. In addition to these disincentives for data sharing, Lynch (2008), Fry, et al. (2008) and Cragin, et al. (2010), listed legal and ethical constraints, lack of expertise in data management, a lack of time to handle data requests, and a lack of technical infrastructure in which to publicly archive the data.

Scholars prefer to perform research and write the publications rather than curate data for re-use and storage (Lynch, 2008; Harley, et al., 2010). However, Pienta, Alter and Lyle (2010) studied the use and re-use of Social Science primary research data, and their research indicates that while informal data sharing is the norm in the Social Sciences, the sharing of data via an archive “leads to many more times the publications than not sharing data”.

Publications such as Science and Nature have called upon the larger science communities to create the infrastructure to share and curate data for the indefinite near term (Hanson, Sugden & Alberts, 2011; Editor, 2009, 2005). The editors of Science, for example, require authors to submit not just a copy of the data itself, but any computer code required to read the data. The Toronto International Data Release Workshop Authors (2009) examined prepublication data sharing within genomics, and they recommended that it be extended to related domains. At the opposite end, Schofield, et al., (2009) discussed ways to promote data sharing among mouse researchers in an opinion piece. The authors concluded that a research commons must be created, but that data sharing would require an entire culture change for their field.

Curry (2011) provided an example of particle physicists who rescued an old data set from the 1990s; these physicists then wrote more than a dozen new high-impact papers from this same set. In spite of these examples, and the support of major publications, Nelson (2009) wrote that the power to “prod” researchers to share their data must come from the organizations that have real clout with researchers: the funding agencies, scientific societies, and journals. However, as Lynch (2008) noted, the best use of scientists’ time is to devote it to practicing science. He wrote that researchers are not the best at data management, and this area should be left to professional data stewards.

Thus, it appears that most managers of major funding agencies, librarians and archivists, scientists, and journal editors and authors have been encouraging or requiring data sharing among researchers. However, whether or not a researcher is willing to do so may depend on a variety of factors, including personal preference. So long as data analysis takes up the majority of researchers’ time, they may not have the resources to share data, even with the appropriate infrastructure and policies in place, given the amount of time it takes a researcher to prepare data for use, re-use, and long-term preservation (Research Information Network, Institute of Physics, Institute of Physics Publishing, & Royal Astronomical Society, 2011). Thus, taking into account most researchers’ resource constraints, how well and often a researcher may share his or her data, even if they are willing to do so, is still to be determined, in spite of funders’ requirements.

Infrastructure and Data Centers

Researchers may find incentives to share their data, as more data-centric infrastructure becomes the norm, even in domains in which data sharing is not the norm. However, as Lynch (2008) concluded, one of the issues that must be clarified concerns what institution or domain is responsible for providing the underlying infrastructure and data stewardship. Some librarians think that it is the library’s responsibility to provide this infrastructure; others believe it is better for each domain to come together and create this infrastructure, given the proprietary nature of data formats, software, etc.; while still others promote the concept of national data centers; and, finally, some data managers prefer institution-based infrastructure (Walters & Skinner, 2011; Research Information Network, 2011; UKRDS, 2008; Soehner, Steeves & Ward, 2010).

The members of the Association of Research Libraries (ARL) institutions have described four models of data infrastructure to support e-science: multi-institutional collaborations; a decentralized or unit-by-unit approach; a centralized or institution-wide response; or, a hybrid centralized and decentralized approach (Soehner, Steeves, & Ward, 2010). Lyon (2007) derived a “domain deposit model” and a “federation deposit model” from her study results. She described the domain deposit model as a “strong integrated community…with well-established common standards, policy and practice”, and defined the federation deposit model as a group of repositories which have come together “based on some agreed level of commonality” in a documented partnership. The author wrote that the “federation deposit model” might be built around an institution, regional geographic boundaries, format type, or software platform.

The debate over who will provide infrastructure, and what model that support service will follow, is similar to the problems that arose with the development of Institutional Repositories (IRs) in the 2000s. arXiv, while not an Institutional Repository per se, grew out of the Physics community’s culture of sharing research results immediately, and has grown to encompass Computer Science, Astrophysics, and Mathematics, among others, but that does not mean the arXiv model fits all e-print needs for all domains or institutions (Ginsparg, 2011). Researchers’ needs have been heterogeneous, as are each field’s communication styles and technical expertise (Kling & McKim, 2000; Borgman, 2008). Foster and Gibbons’ (2005) study proved that librarians eagerly built Institutional Repositories, only to find a lukewarm reception from faculty and researchers, which led to a lack of IR content.

The Research Information Network (2009) studied life sciences researchers and noted that one infrastructure and data sharing model will not fit all research domains, and that the information practices of life scientists do not match that of information practitioners and policy makers. Librarians may wish to grow data-sharing infrastructure more carefully than they did IRs, and grow them based on need, rather than the latest trend. So far, however, researchers have seemed to value data centers and they have stated that their existence has improved their ability to “undertake high quality-research” (Research Information Network, 2011). Currently, whether or not one or more of the above-mentioned ARL models will prove to be the best choice is in flux, generally because each institution and domain has different needs and requirements.

As regards other areas of big data infrastructure, such as preservation repository design and policies, those topics were covered in the previous sections, “Managing Data: Preservation Repository Design (the OAIS Reference Model),” and “Managing Data: Preservation Standards and Audit and Certification Mechanisms (i.e., ‘policies’)”. Other, more technical discussions, such as over-the-network and local data processing, data discoverability and indexing, physical networking infrastructure, interoperability, security, data center design, syncing, data replication, data backups, etc., are beyond the scope of this essay.

In conclusion, the results of the studies discussed in this essay have indicated that for data to be stewarded for the long-term, research scientists will need some type of support infrastructure, both technical, financial, and management.

Roles and Responsibilities

Lyon (2007) observed that there was a “dearth of skilled practitioners, and data curators play an important role in maintaining and enhancing the data collections that represent the foundations of our scientific heritage”. The author wrote that in time, “native data scientists” would emerge from within each domain’s curriculum as data management becomes integrated into graduate research training. Gray, Carozzi and Woan (2011) noted, “normal data management practice…corresponds to notably good practice in most other areas”. Their recommendation was for administrators to formalize data management planning in order to make it more auditable. One aspect of this formalization is to define the roles and responsibilities by individual, role, and sector.

The members of the National Science Board (National Science Foundation, 2005) defined the primary roles and responsibilities of institutions and individuals. They defined four primary individual roles: data authors, data managers, data scientists, and data users.

  • Data Author: this individual is involved in research that produces digital data. This person should receive credit for the production of the data, and ensure that it may be broadly disseminated, if appropriate. The data author must ensure that the metadata, and data recording, context, and quality all conform to community standards.
  • Data Manager: this individual is responsible for the maintenance and operation of the database. This person must follow best practices for technical management such as replication, backups, fixity checks, security, enforcement of legal provisions, and implement and enforce community standards and preferences for data management. The data manager must provide appropriate contextual information for the data, and design and maintain a system that encourages data deposit by making it as simple and easy as possible.
  • Data Scientist: the individuals who are data scientists have a variety of roles. This person may be a librarian, archivist, computer or information scientist, software engineer, database manager or other disciplinary expert. His or her contributions involve advising on the implementation of technology and best practices to the data for long-term stewardship and ensuring that it is implemented properly, as well as enhancing the ability of domain scientists to conduct their research using digital data. This role involves creative inquiry, analysis, and outreach, as well as participating in research appropriate to the data scientist’s own domain, for the purposes of publication and contributing to research progress.
  • Data Users: this individual is a member of the larger research and scientific community, and this person will benefit from having access to data sets that are well-defined, searchable, robust, and well-documented. The data user must credit the data author, adhere to copyright and other restrictions, and must notify the appropriate data managers or data authors of any data errors (National Science Foundation, 2005).

The National Science Foundation (2005) authors also defined the responsibilities of the funding agencies. They stated that these agencies must provide a science commons to enable data sharing, help to create a culture in which data sharing is rewarded, and enable access to data across research communities. The board members were adamant that the representatives of the funding agencies, the agencies themselves, the various individuals, and their respective institutions, all have a part to play in ensuring the long-term stewardship of data.

Swan and Brown (2008) examined the roles and career structures of data scientists and curators in order to provide recommendations for their career development. They defined and examined both the roles and the career trajectories of those who manage the data itself. First, the authors distinguished the following four roles based on interviews of practicing data scientists and curators.

  • Data creator: researchers with domain expertise who produce data. These people may have a high level of expertise in handling, manipulating and using data
  • Data scientist: people who work where the research is carried out – or, in the case of data centre personnel, in close collaboration with the creators of the data – and may be involved in creative enquiry and analysis, enabling others to work with digital data, and developments in data base technology
  • Data manager: computer scientists, information technologists or information scientists and who take responsibility for computing facilities, storage, continuing access and preservation of data
  • Data librarian: people originating from the library community, trained and specialising in the curation, preservation and archiving of data (Swan & Brown, 2008).

Next, the authors interviewed practitioners regarding their roles and responsibilities, and career satisfaction. They discovered that most data scientists moved into their roles “by accident rather than design”; that “there is no defined career structure”; and that they feel undervalued within their research group due to the lack of professional training and/or a defined career path. Swan & Brown (2008) described three primary roles for libraries with respect to data stewardship. First, librarians must provide preservation and archiving services for data, particularly through Institutional Repositories. Second, they must provide consulting and training for data creators. Third, librarians must develop training curriculum for data librarians.

The Interagency Working Group on Digital Data (2009) defined the various roles involved with “harnessing the power of digital data for science and society”. They described the entities by role, individual, sector, and life cycle phase/function, and the individuals by role and life cycle phase/function. They defined entities as research projects, data centers, libraries, archives, etc., and defined the role for each one and provided an existing example. For example, they authors provided eleven tasks under “role” for the entity “archives”, and provided the name of the National Archives and Records Administration as an example. They defined eleven different types of individual roles, including data scientist, librarian, and researcher, for example, along with a corresponding definition for the role. Please go to Appendix A to view the complete set of tables as Figures 6-13.

In conclusion, the authors above have demonstrated that while one person may take on the multiple roles of data creator, data scientist, data manager, and data user, it ultimately takes an entire team and community to ensure the long-term survivability of research data.

Sustainability

General funding and sustainability estimates are covered in the section, “Managing Data: the Emergence & Development of Digital Curation & Digital Preservation Standards”. This section will focus on sustainability issues related to data sets, per se.

It is as challenging for information practitioners to determine the true cost of data stewardship as it is for them to measure the amount of digital data. Gray, Carozzi and Woan (2011) cited several studies and existing science archives, including one that had been built recently by an experienced archive staff. The authors wrote that staff costs, as well as acquisition and ingest costs, account for the substantial portion of preservation project funding, which reflected Lord and MacDonald’s (2003) earlier findings. They did not provide any hard numbers, though, and noted that those costs only scaled weakly as an archive grew larger. In other words, they learned that an archive’s initial size governs the costs, and that when an archive starts small and grows larger, the costs do not scale.

Gray, Carozzi and Woan (2011) called for a costing model to be developed, as they found that there is a lack of consensus on the long-term costs related to the preservation of large-scale data. Lord and MacDonald (2003), Lyon (2007), Fry, Lockyer, and Oppenheim (2008), and Ball, (2010) have all previously made calls for the development of a solid cost model as well, as they had found it challenging to determine the “full costs of curating data”. One of the primary reasons driving the confusion regarding how much data stewardship will cost is determining who is responsible (re: who will pay) for data stewardship and the differing degrees of data curation (Blue Ribbon Task Force on Sustainable Digital Preservation and Access, 2008; Fry, Lockyer, and Oppenheim, 2008).

The problems the Blue Ribbon Task Force on Sustainable Digital Preservation and Access (2008) defined as barriers to developing an accurate cost model are systemic, rather than simply about finding and setting a price for the product. The problems they identified include: the idea that “current practices are good enough”; the fear of addressing adequate data stewardship because it is “too big”; inadequate incentives to support the group effort needed to create sustainable economic models; a lack of long-term thinking regarding funding models; and lack of clarity and alignment with regards to the various responsibilities and roles between data stakeholders.

The Task Force reviewed several models including the LIFE (Life Cycle Information for E-Literature) project and the model by Beagrie, Chruszcz, and Lavoie (2008). The members of the LIFE project aimed the model towards libraries, and one of their discoveries has been that “upfront (i.e., one-time) costs of a project are often distinct in structure from the recurring maintenance aspects of the same project” (Blue Ribbon Task Force on Sustainable Digital Preservation and Access, 2008).

Figure 4 - LIFE (Life Cycle Information for E-Literature) Project (Blue Ribbon Task Force on Sustainable Digital Preservation and Access, 2008).
Figure 4 – LIFE (Life Cycle Information for E-Literature) Project (Blue Ribbon Task Force on Sustainable Digital Preservation and Access, 2008).

So, for example, when SHERPA-DP IR used the LIFE model to determine their full lifecycle costs, they determined that, excluding interest rates and depreciation, their costs measured at the unit for which metadata is created (e.g., per object cost for analogue, per page cost for digital) are:

  • Year 1: 18.40 English pounds per year total cost
  • Year 5: 9.70 English pounds per year total cost
  • Year 10: 8.10 English pounds per year total cost (Blue Ribbon Task Force on Sustainable Digital Preservation and Access, 2008).

Beagrie, Chruszcz, and Lavoie (2008) developed a model to inform institutions of higher learning of their preservation costs. They built upon the work of the LIFE project team, and mapped it to the Trustworthy Repositories Audit & Certification: Criteria and Checklist and the OAIS Reference Model. The authors discovered upon the application and testing of the model “that the costs of preservation increase, but at a decreasing rate, as the retention period increases” (Blue Ribbon Task Force on Sustainable Digital Preservation and Access, 2008). The administrators of the Archaeology Data Service studied and re-adjusted their charging policy after applying Beagrie, Chruszcz, and Lavoie’s (2008) method to examine staff salaries, time, and days, and, therefore, reached a more realistic assessment of costs.

Beagrie and JISC (2010) summarized the model in a fact sheet that outlined recommendations to funders and institutions regarding what costs most (acquisition and ingest), the impact of fixed costs (they do not vary by the size of the collection and staff costs remain high), and the declining costs of preservation over time (they decline to minimal levels after 20 years). The authors outlined the benefits (direct, indirect, near- and long-term, private and public) to preserving research data; those benefits have been outlined throughout this paper. The authors discussed the various types of repositories and recommended a federated model with local storage at the departmental level, with additional back up at the institutional level. They also encouraged institutions to work with existing archives over creating new ones. Finally, they pointed out that research data are heterogeneous and are less likely to be stored in an Institutional Repository.

In conclusion, while Beagrie, Chruszcz, and Lavoie (2008) and the LIFE project, among others, have developed substantive cost models that provide very useful financial information for repository managers, these will need to be revised and updated over the long-term in order to determine the accuracy of the respective models.

Research Data Curation

General data curation is covered in another section, “Managing Data: the Emergence & Development of Digital Curation & Digital Preservation Standards”. Therefore, the remainder of this section will address only those areas related to research data curation that were not covered in the previous literature review on digital curation.
According to Ball (2010), the curation of research data is best understood in terms of the research data life cycle, data repositories, and funders’ requirements and guidance.

Data Curation vs. Digital Curation

What is data curation, and how does it differ from digital curation, if at all? First, it is important to note that the curation of scientific data goes back centuries. Data curation is an older term than “digital curation”. It applied to journals, reports, or databases that were selected, annotated, normalized and integrated to be used and re-used by other researchers or historians. These data were not and are not always in digital form. Data curation is a less broad concept than digital curation, and although the two phrases are often used synonymously, they are not interchangeable (Ball, 2010).

Second, to further clarify, Lord and MacDonald (2003) included the following tasks as part of data curation:

  • Selection of datasets to curate.
  • Bit-level preservation of the data.
  • Creation, collection and bit-level (or hard-copy) preservation of metadata to support contemporaneous and continuing use of the data: explanatory, technical, contextual, provenance, fixity, and rights information. Surveillance of the state of practice within the research community, and updating of metadata accordingly.
  • Storage of the data and metadata, with levels of security and accessibility appropriate to the content.
  • Provision of discovery services for the data; e.g. surfacing descriptive information about the data in local or third-party catalogues, enabling such information to be harvested by arbitrary third-party services.
  • Maintenance of linkages with published works, annotation services, and so on; e.g., ensuring data URLs continue to refer correctly, ensuring identifiers remain unique.
  • Identification and addition of potential new linkages to emerging data sources.
  • Updating of open datasets.
  • Provision of transformations/refinements of the data (by hand or automatically) to allow compatibility with previously unsupported workflows, processes and data models.
  • Repackaging of data and metadata to allow compatibility with new workflows, processes and (meta)data models (Ball, 2010).

The authors included curation tasks that are part of the broader concept of digital curation, such as bit-level curation, metadata creation, and selection. They provisioned for data curation when they included data transformation and refinement, and repackaging — e.g., data clean up — all of which are tasks not normally associated with the curation of, say, digital objects consisting of e-prints or photographic images.

The Research Data Lifecycle

Ball (2012) wrote that lifecycle models help practitioners plan in advance for the various stages involved in the stewardship of digital data. There are several lifecycle models available for guidance. The author described the “I2S2 Idealized Scientific Research Activity Lifecycle Model” as a model produced from the researchers’ perspective, while the “DCC Curation Lifecycle Model” is produced from the perspective of information professionals. These two lifecycle models are a representative sample of the information available in the various lifecycle models and were chosen with that in mind; time and space limitations prohibit a longer discussion of the pros and cons of all lifecycle models.

Thus, this section will discuss the “I2S2 Idealized Scientific Research Activity Lifecycle Model”, and will attempt to describe the common themes across several available data management lifecycle models. The “DCC Curation Lifecycle Model” is covered in a previous essay, “Managing Data: the Emergence & Development of Digital Curation & Digital Preservation Standards”.

The members of the I2S2 project created the “I2S2 Idealized Scientific Research Activity Lifecycle Model” with the researcher’s perspective in mind, not the data manager’s perspective. Thus, Ball (2012) wrote that archiving is a very small part of the lifecycle. The goal of the project team members was to integrate, accelerate, and automate the research process, and so they created this lifecycle model in support of those goals. They designed the model to support research activity, not data management per se. Thus, they outlined the tasks involved throughout the lifecycle of a research project.

Figure 5 - I2S2 Idealized Scientific Research Activity Lifecycle Model (Ball, 2012).
Figure 5 – I2S2 Idealized Scientific Research Activity Lifecycle Model (Ball, 2012).

The project team designed the model with four key elements: curation activity, research activity, publication activity, and administrative activity. They sketched out the curation activity as a task performed by a data archive or repository. They outlined the administrative activity as the process of applying for funding, providing for reports, and writing final reports. The authors of the model defined publication activity as those tasks involved with preparing the data for public use and the writing and publication of papers. And, finally, they defined the research activity as that part of the project that involves conducting the research itself.

The data management lifecycle models included in this section for purposes of describing themes common across all life cycles are: the Interagency Working Group on Digital Data (IWGDD) Digital Data Lifecycle Model (Interagency Working Group on Digital Data, 2009); the Data Document Initiative (DDI) Combined Life Cycle Model; the Australian National Data Service (ANDS) Data Sharing Verbs; the DataONE Data Lifecycle; the UK Data Archive Data Lifecycle; Research360 Institutional Research Lifecycle; and, the Capability Maturity Model for Scientific Data Management (Ball, 2012).

The themes common across all lifecycle models include planning the project; gathering, processing, analyzing, describing and storing the data; and, archiving the data for future use. It is interesting to note that only the “DCC Curation Lifecycle Model” provides for the deletion of data; an unstated assumption by the authors in the remaining models is that all data will be re-used and re-purposed.

Data Repositories

This section’s content is discussed in the previous essay, “Managing Data: Preservation Repository Design (the OAIS Reference Model)”.

Funders’ Requirements and Guidance

Administrators at both the National Institutes of Health and the National Science Foundation now require researchers to provide data management plans in their grant proposals. They have instituted policies that require researchers to make the resulting research data from the grant-funded project available for re-use within a reasonable length of time.

The National Institutes of Health

The authors of the National Institutes of Health (2010; 2003) (NIH) requirements have mandated that researchers share the final data set once the publication of the primary research findings has been accepted. They have allowed for large studies, and, as such, the data sets from large studies may be released in a series, as the results from each data set are published or as the data set becomes available.

The administrators at the NIH have required that all organizations and individuals receiving grants make the results of their research available to the public and to the larger research community. They have required a simple data management plan for any grant proposals requesting more than $500,000. If a researcher cannot share the data, then they must provide a compelling reason to the NIH in the data management plan.

The grantors at the NIH have asked grantees to provide the following information in the data management plan: mode of data sharing; the need, if any, for a data sharing agreement; what analytical tools will be provided; what documentation will be provided; the format of the final data set; and, the schedule for sharing the data.

The following are three examples that employees of the NIH have provided to grant applicants as example data management plans.

  • Example 1: The proposed research will involve a small sample (less than 20 subjects) recruited from clinical facilities in the New York City area with Williams syndrome. This rare craniofacial disorder is associated with distinguishing facial features, as well as mental retardation. Even with the removal of all identifiers, we believe that it would be difficult if not impossible to protect the identities of subjects given the physical characteristics of subjects, the type of clinical data (including imaging) that we will be collecting, and the relatively restricted area from which we are recruiting subjects. Therefore, we are not planning to share the data.
  • Example 2: The proposed research will include data from approximately 500 subjects being screened for three bacterial sexually transmitted diseases (STDs) at an inner city STD clinic. The final dataset will include self-reported demographic and behavioral data from interviews with the subjects and laboratory data from urine specimens provided. Because the STDs being studied are reportable diseases, we will be collecting identifying information. Even though the final dataset will be stripped of identifiers prior to release for sharing, we believe that there remains the possibility of deductive disclosure of subjects with unusual characteristics. Thus, we will make the data and associated documentation available to users only under a data-sharing agreement that provides for: (1) a commitment to using the data only for research purposes and not to identify any individual participant; (2) a commitment to securing the data using appropriate computer technology; and (3) a commitment to destroying or returning the data after analyses are completed.
  • Example 3: This application requests support to collect public-use data from a survey of more than 22,000 Americans over the age of 50 every 2 years. Data products from this study will be made available without cost to researchers and analysts. https://ssl.isr.umich.edu/hrs/

User registration is required in order to access or download files. As part of the registration process, users must agree to the conditions of use governing access to the public release data, including restrictions against attempting to identify study participants, destruction of the data after analyses are completed, reporting responsibilities, restrictions on redistribution of the data to third parties, and proper acknowledgement of the data resource. Registered users will receive user support, as well as information related to errors in the data, future releases, workshops, and publication lists. The information provided to users will not be used for commercial purposes, and will not be redistributed to third parties. (National Institutes of Health, 2003)

The implementers of the NIH data management plans wanted to make them as simple as possible, as these plans are but one part of the NIH grant application. However, it is evident to most information professionals, that these plans are not adequate for long-term data stewardship.

The National Science Foundation

The authors of the National Science Foundation (2011) policy on data management wanted to provide a way to share data within a community while recognizing intellectual property rights, allow for the preparation and submission of publications, and protect proprietary or confidential information. They have made it clear to grant recipients that they must facilitate and encourage data sharing.

The reviewers have required grant applicants to include a 2-page supplementary document entitled, “Data Management Plan”. The grant applicants must describe how any data resulting from the NSF-funded research will be disseminated and shared via NSF’s policies. The authors of the NSF’s data management plan (DMP) policy have recognized that each of the seven directorates have different cultures and requirements for data sharing. Therefore, the administrators at the NSF have given each directorate leeway to determine the best data management practices for each domain, including whether or not researchers must deposit data in a public data archive (Hswe and Holt, 2010).

These policy makers have defined the following areas as items that may be included in a data management plan.

  • The types of data, samples, physical collections, software, curriculum materials, and other materials to be produced in the course of the project.
  • The standards to be used for data and metadata format and content (where existing standards are absent or deemed inadequate, this should be documented along with any proposed solutions or remedies).
  • Policies for access and sharing including provisions for appropriate protection of privacy, confidentiality, security, intellectual property, or other rights or requirements.
  • Policies and provisions for re-use, re-distribution, and the production of derivatives.
  • Plans for archiving data, samples, and other research products, and for preservation of access to them (National Science Foundation, 2011).

The authors of the data management plan policy have allowed for exceptions to the policy. They stated that grant applicants may include a data management plan that includes “the statement that no detailed plan is needed, as long as the statement is accompanied by a clear justification” (National Science Foundation, 2011).

Due to the fact that grant administrators at the NSF have only required the data management plans from grant applicants beginning in January 2011, unlike with the NIH, examples of data management plans from successful grant applications are not yet available.

Librarians and archivists in the United States have drawn heavily upon the work performed by the employees and researchers of the Joint Information Systems Committee (JISC) and the Digital Curation Centre (DCC) in the United Kingdom. Most academic and research librarians at major research universities and related institutions have provided a plethora of online templates, tools, and resources for NSF grant applicants to use. While there is some variation in minor details, most of the data management plans created by information professionals contain the same elements. The Interuniversity Consortium for Political and Social Research (ICPSR) (2012) and the California Digital Library (2012) are among those institutions and individuals that have developed extensive data management plan guidance for researchers.

Information professionals at ICPSR compiled their recommended elements for a data management plan that researchers may draw from when compiling a plan for either the NSF or NIH. They recommended that researchers include a description of the data; a survey of existing data; the existing formats of the data; any and all relevant metadata; data storage methods and backups; data security; the names of individuals responsible for the data; intellectual property rights; access and sharing; the intended audience; the selection and retention period; any procedures in place for archiving and preservation; ethics and privacy concerns; data preparation and archiving budget; data organization; quality assurance; and, legal requirements (ICPSR, 2011).

Researchers and employees of the California Digital Library created the “Data Management Plan Tool” (2012) based on prior work by the Digital Curation Centre (2012) to allow researchers to quickly create a legible plan suitable to their particular funder’s requirements. For example, the authors of the tool took into account each NSF directorate’s requirements and created a separate template based on those requirements. They included funding agencies such as the Institute of Museum and Library Services (IMLS), the Gordon and Betty Moore Foundation, the National Endowment for the Humanities (NEH), and, of course, the NSF. They did not include a template for the NIH. The authors created the templates so that outputs in the final document created by the researcher may include information about data types, metadata and data standards, access and sharing policies, redistribution and re-use policies, and archiving and preservation policies. They designed the templates to output only the fields the researcher completes, so while there are standard templates based on requirements, the output may vary based on the information provided by the researcher (California Digital Library, 2012).

Carlson (2012) created a data curation profile toolkit for librarians and archivists to use when interviewing researchers about their data. While Carlson did not create this toolkit in support of the NSF requirements, reference librarians may find it a useful resource for questions to draw upon when they collaborate with a scientist. The author designed the toolkit as a semi-structured interview to assist librarians in conducting a data curation assessment with a researcher. Carlson created a user guide, an interviewer’s manual, an interview worksheet, and a data curation profile template. He designed the questions to elicit the information required to curate data; most of the information required from the researcher maps to the recommended elements of the ICPSR Data Management Plan, above.

In conclusion, information professionals have been working hard to assist researchers in developing appropriate planning tools with which the researchers may steward the data. However, many researchers are unaware of these services, or consider them to be yet another bureaucratic hurdle (Research Information Network, 2008). It remains to be seen whether or not data creators will use the services information professionals have made available. It also remains to be seen whether or not the data management plans required and approved by the National Institutes of Health and the National Science Foundation will be adequate for long-term data stewardship, at least by the standards of information professionals.

The Application of Policies to Repositories and Data

This section briefly discusses the automation of preservation policies and the application of policies to data curation.

The Automation of Preservation Management Policies

How can information professionals tame the data deluge while stewarding data? One way is for these professionals to take human-readable data stewardship policies and implement them at the machine-level (Rajasekar, et al., 2006; Moore, 2005). This “policy virtualization” is discussed in a previous section, “Managing Data: the Emergence & Development of Digital Curation & Digital Preservation Standards”, and an example is presented.

Reagan W. Moore has stated that the challenge to virtualizing human-readable policies to machine-readable code is that most groups cannot prove that they are doing what they say they are doing (personal communication, January 6, 2012). This is a known problem, as Waters and Garrett (1996) stated in the Executive Summary of the their seminal report that archives must be able to prove that “they are who they say they are by meeting or exceeding the standards and criteria of an independently-administered program”.

Moore & Smith (2007) automated the validation of Trusted Digital Repository (TDR) assessment criteria. They created four levels of assessment criteria mapped to TDR Assessment Criteria: Enterprise Level Rules, such as descriptive metadata; Archives Level Rules, such as consistency rules for persistent identifiers; Collection Level Rules, such as flags for service level agreements; and, Item Level Rules, such as periodic rule checks for format consistency. The authors implemented these rules using iRODS with DSpace.

The researchers successfully demonstrated that preservation policies could be implemented automatically at the machine level, and that an administrator could audit the system and prove that the TDR assessment criteria have been successfully implemented. In other words, Moore & Smith (2007) were able to prove that they are preserving what they have said they are preserving by virtualizing human-readable policies to machine-readable code. One application of Moore, et al.’s work includes the SHAMAN (2011) project. These researchers have also successfully implemented an automated preservation system by virtualizing policies using iRODS (Moore, et al., 2007).

Another method is to encode all metadata with the object itself. Gladney and Lorie (2005) and Gladney (2004) have proposed the creation of durable objects in which all relevant information is encoded with the object itself. This was briefly discussed in a previous essay, “Managing Data: Preservation Standards and Audit and Certification Mechanisms (e.g., “policies”)”.

The Application of Policies to Data and Data Curation

Beagrie, Semple, Williams, and Wright (2008) outlined a model of digital preservation policies and analyzed how those policies could underpin key strategies for United Kingdom (UK) Higher Education Institutions (HEI). They mapped digital preservation links to other key strategies for other higher education institutions, such as records management policies. They also examined current digital preservation policies and modeled a digital preservation policy. The authors proposed that funders use their study to evaluate the implementation of best practices within UK HEIs.

Similarly, Jones (2009) examined the range of policies required in HEIs for digital curation in order to support open access to research outputs. She argued that curation only begins once effective policies and strategies in place. She wanted to map then current curation policies to pinpoint the areas that need further development and support so that open access to research will be supported. The author wrote that the implementation of curation policies in UK HEIs is patchy, although there have been some improvements. She concluded that for effective digital curation of open access research to occur, a robust infrastructure must be in place; financing and actual costs must be determined; and, the differing roles and responsibilities must be defined and put in place.

As noted earlier in this paper, research data has slightly different policy-requirements than general digital library collections, such as ePrint archives. Green, MacDonald, and Rice (2009) addressed those policy differences and created a planning tool and decision-making guide for institutions with existing digital digital repositories who may add research data (sets) to their collections.

The authors based the guide on the OAIS Reference Model (CCSDS, 2002), the Trusted Digital Repository Assessment Criteria (CCSDS, 2011) and the OpenDOAR Policy Tool (Green, MacDonald, and Rice, 2009). They addressed policies related to datasets, primarily social science, but they included policies for content such as grey literature, video and audio files, images, and other non-traditional scholarly publications. The authors designed the guide with the idea of supporting sound data management practice, data sharing, and long-term access in a simplified format.

Thus, sound, strategically applied policies must underpin the efforts to steward data for the indefinite long-term, whether they are applied at the machine-level, or via human effort.

Summary: the Implications for the Long-term Stewardship of Research Data

Research data management is in flux, much like early digital libraries. In spite of all of this work to create standards, and various funder requirements, some data will be lost. The questions are: how much data will be lost; by whom; whether or not the data is replaceable; and, how valuable is having the actual data set itself, versus knowing the reported results of any published analysis of the lost data set(s)? It is also likely that some data sets will languish, unused but very carefully curated.

Having said that, much less data will be lost than if no repository and policy standards, and funder requirements, had been created and required in the first place. Standards and funder requirements can only do so much; the data creators themselves have to want to ensure the data is shareable and accessible for the long-term, and the infrastructure to do so must be in place for them to do so. This infrastructure includes not only the physical hardware and software, but also defined policies, standards, metadata, funding, and, roles and responsibilities, among others.

First among this infrastructure must be explicit incentives for researchers to take the time to annotate and clean up the data and any related software and scripts for re-use — or to take the time to ensure someone else does it for them. Information professionals must provide the data stewardship services, but it is up to the data creators to provide the data.

The final conclusion is that researchers want to focus on creating and analyzing data. Some researchers care about the long-term stewardship of their data, while others do not. It remains to be seen whether or not funders’ requirements for data sharing will impact how much data is actually made available for re-purposing, re-use, and, preservation.
Effective data stewardship requires not just technical and standards-based solutions, but also people, financial, and managerial solutions. As the old proverb states, “You can lead a horse to water, but you cannot make him drink” (Speake & Simpson, 2008). 


References

Anderson, C. (2008, October 23). The end of theory: the data deluge makes the scientific method obsolete. Wired, 16.07. Retrieved November 18, 2010, from http://www.wired.com/science/discoveries/magazine/16-07/pb_theory

Ball, A. (2012). Review of Data Management Lifecycle Models. Project Report. Bath, UK: University of Bath. Retrieved March 10, 2012, from http://opus.bath.ac.uk/28587/1/redm1rep120110ab10.pdf

Ball, A. (2010). Review of the State of the Art of the Digital Curation of Research Data. Project Report. Bath, UK: University of Bath, (ERIM Project Document erim1rep091103ab12). Retrieved January 25th, 2012, from http://opus.bath.ac.uk/19022/

Beagrie, C. & JISC. (2010). Keeping Research Data Safe Factsheet Cost issues in digital preservation of research data. Charles Beagrie Ltd and JISC. Retrieved September 29, 2010 from http://www.beagrie.com/KRDS_Factsheet_0910.pdf

Beagrie, C., Chruszcz, J. & Lavoie, B. (2008). Keeping Research Data Safe. JISC. Retrieved September 9, 2009, from http://www.jisc.ac.uk/publications/publications/keepingresearchdatasafe.aspx

Beagrie, N., Chruszcz, J. & Lavoie, B. (2008). Executive summary. In Keeping research data safe. JISC. Retrieved January 24, 2009, from http://www.jisc.ac.uk/publications/publications/keepingresearchdatasafe.aspx

Beagrie, N., Semple, N., Williams, P. & Wright, R. (2008). Digital Preservation Policies Study Part 1: Final Report October 2008. Salisbury, UK: Charles Beagrie, Limited. Retrieved January 24, 2012 from http://www.jisc.ac.uk/media/documents/programmes/preservation/jiscpolicy_p1finalreport.pdf

Blue Ribbon Task Force on Sustainable Digital Preservation and Access. (2008, December). Sustaining the digital investment: issues and challenges of economically sustainable digital preservation. San Diego, CA: San Diego Supercomputer Center. Retrieved January 24, 2009, from http://brtf.sdsc.edu/biblio/BRTF_Interim_Report.pdf

Borgman, C.L. (2008). Data, disciplines, and scholarly publishing. Learned Publishing, 21, 29-38. Retrieved January 25, 2012, from http://www.ingentaconnect.com/content/alpsp/lp/2008/00000021/00000001/art00005

Borgman, C.L. (2010). Research Data: Who will share what, with whom, when, and why? Fifth China-North America Library Conference, September 8-12, 2010, Beijing, China. Retrieved December 15, 2010, from http://works.bepress.com/borgman/238

Burgess, C. (2011, January 31). Your Name, Your Privacy, Your Digital Exhaust. Infosec Island. Retrieved March 7, 2011, from http://infosecisland.com/blogview/11450-Your-Name-Your-Privacy-Your-Digital-Exhaust.html

California Digital Library. (2012). DMPTool. Retrieved February 12, 2012, from https://dmp.cdlib.org/

Carlson, J. (2012). Demystifying the data interview: developing a foundation for reference librarians to talk with researchers about their data. Reference Services Review, 40(1), 7-23. Retrieved February 9, 2012, from http://dx.doi.org/10.1108/00907321211203603

CCSDS. (2011). Audit and certification of trustworthy digital repositories recommended practice (CCSDS 652.0-M-1). Magenta Book, September 2011. Washington, DC: National Aeronautics and Space Administration (NASA).

CCSDS. (2002). Reference model for an Open Archival Information System (OAIS) (CCSDS 650.0-B-1). Washington, DC: National Aeronautics and Space Administration (NASA). Retrieved April 3, 2007, from http://nost.gsfc.nasa.gov/isoas/

Cragin, M.H., Palmer, C.L., Carlson, J.R. & Witt, M. (2010). Data sharing, small science, and institutional repositories. Philosophical Transactions of the Royal Society, 368, 4023-4038.

Curry, A. (2011). Rescue of Old Data Offers Lesson for Particle Physicists. Science, 331, 694-695.

Digital Curation Centre. (2012). DMPOnline. Retrieved February 12, 2012, from http://www.dcc.ac.uk/dmponline

Editor. (2009). Data’s shameful neglect. Nature, 461, 145.

Editor. (2005). Let data speak to data. Nature, 438, 531.

Evans, J.A. & Foster, J.G. (2011). Metaknowledge. Science, 331, 721-725.

Foster, N.F. & Gibbons, S. (2005). Understanding Faculty to Improve Content Recruitment for Institutional Repositories. D-Lib Magazine, 11(1). Retrieved March 8, 2012, from http://www.dlib.org/dlib/january05/foster/01foster.html

Fry, J., Lockyer, S., Oppenheim, C., Houghton, J., & Rasmussen, B. (2008). Identifying benefits arising from the curation and open sharing of research data produced by UK Higher Education and research institutes (Final report). London: JISC. Retrieved January 25, 2012, from http://ie-repository.jisc.ac.uk/279/

Gallagher, S. (2012, January). The Great Disk Drive in the Sky: How Web giants store big—and we mean big—data. Ars Technica. Retrieved March 7, 2012, from http://arstechnica.com/business/news/2012/01/the-big-disk-drive-in-the-sky-how-the-giants-of-the-web-store-big-data.ars

Gantz, J. & Reinsel, D. (2011). Extracting Value from Chaos. IDC #1142. Retrieved February 21, 2012, from http://idcdocserv.com/1142

Ginsparg, P. (2011). ArXiv at 20. Nature, 476, 145-147.

Gladney, H.M. & Lorie, R.A. (2005). Trustworthy 100-Year digital objects: durable encoding for when it is too late to ask. ACM Transactions on Information Systems, 23(3), 229-324. Retrieved December 17, 2011, from http://eprints.erpanet.org/7/

Gladney, H.M. (2004). Trustworthy 100-Year digital objects: evidence after every witness is dead. ACM Transactions on Information Systems, 22(3), 406-436. Retrieved July 12, 2008, from http://doi.acm.org/10.1145/1010614.1010617

Gray, N., Carozzi, T., & Woan, G. (2011). Managing Research Data — Gravitational Waves. Draft final report to the Joint Information Systems Committee (JISC). University of Glasgow: Research Data Management Planning (RDMP). Retrieved March 3, 2011, from https://dcc.ligo.org/public/0021/P1000188/006/report.pdf

Green, A., Macdonald, S., & Rice, R. (2009). Policy-making for research data in repositories: a guide. Edinburgh, UK: University of Edinburgh.

Hanson, B., Sugden, A., & Alberts, B. (2011). Making Data Maximally Available. Science, 331, 649.

Harley, D., Acord, S.K., Earl-Novell, S., Lawrence, S., & King, C.J. (2010). Assessing the Future Landscape of Scholarly Communication: An Exploration of Faculty Values and Needs in Seven Disciplines – Executive Summary. UC Berkeley: Center for Studies in Higher Education. Retrieved January 23, 2012, from http://escholarship.org/uc/item/0kr8s78v

Hey, T. and Trefethen, A. (2003). The Data Deluge: An e-Science Perspective. In F. Berman, G. Fox, and A. Hey (Eds.), Grid Computing – Making the Global Infrastructure a Reality (pp. 809-824). Chichester, England: John Wiley & Sons. Retrieved January 23, 2012, from http://eprints.ecs.soton.ac.uk/7648/

Hilbert, M. & López, P. (2011). The World’s Technological Capacity to Store, Communicate, and Compute. Science Express, 332(6025), 60-65.

Hough, M.G. (2009). Keeping it to ourselves: technology, privacy, and the loss of reserve. Technology in Society, 31, 406-413. Retrieved February 1, 2010, from http://libproxy.lib.unc.edu/login?url=http://dx.doi.org/10.1016/j.techsoc.2009.10.005

Hswe, P. & Holt, A. (2010). Guide for Research Libraries: The NSF Data Sharing Policy. E-Science. Association of Research Libraries. Retrieved January 6, 2012, from http://www.arl.org/rtl/eresearch/escien/nsf/index.shtml

Interagency Working Group on Digital Data. (2009). Harnessing the power of digital data for science and society. Report of the Interagency Working Group on Digital Data to the Committee on Science of the National Science and Technology Council. Washington, DC: Office of Science and Technology Policy. Retrieved April 9, 2009, from http://www.nitrd.gov/about/Harnessing_Power_Web.pdf

Inter-university Consortium for Political and Social Research (ICPSR). (2012). Guide to Social Science Data Preparation and Archiving: Best Practice Throughout the Data Life Cycle (5th ed.). Ann Arbor, MI. Retrieved January 5, 2012, from http://www.icpsr.umich.edu/icpsrweb/content/ICPSR/access/deposit/guide/

Inter-university Consortium for Political and Social Research (ICPSR). (2011). Elements of a Data Management Plan. Data Deposit and Findings. Ann Arbor, MI: University of Michigan, Institute for Social Research. Retrieved March 10, 2012, from http://www.icpsr.umich.edu/icpsrweb/content/ICPSR/dmp/elements.html

Jones, S. (2011). Summary of UK research funders’ expectations for the content of data management and data sharing plans. University of Glasgow: Digital Curation Centre (DCC). Retrieved January 26, 2012, from http://www.dcc.ac.uk/webfm_send/499

Jones, S. (2009). A report on the range of policies required for and related to digital curation. DCC Policies Report, v. 1.2. University of Glasgow: Digital Curation Centre. Retrieved January 26, 2012, from http://www.dcc.ac.uk/webfm_send/129

Kling, R. & McKim, G.W. (2000). Not just a matter of time: field differences and the shaping of electronic media in supporting scientific communication. Journal of the American Society for Information Science and Technology, 51(14), 1306-1320.

Lazorchak, B. (2011). Digital Preservation, Digital Curation, Digital Stewardship: What’s in (Some) Names? Retrieved March 11, 2012, from http://blogs.loc.gov/digitalpreservation/2011/08/digital-preservation-digital-curation-digital-stewardship-what’s-in-some-names/

Lord, P. & Macdonald, A. (2003). Data curation for e-Science in the UK: An audit to establish requirements for future curation and provision (E-Science Curation Report). London: JISC. Retrieved January 26th, 2012 from http://www.jisc.ac.uk/media/documents/programmes/preservation/e-science reportfinal.pdf

Lynch, C. (2008). How do your data grow? Nature, 455, 28-29.

Lynch, C. (2008). The institutional challenges of cyberinfrastructure and e-research. Educause Review, 43(6). Washington, DC: Educause. Retrieved January 22, 2009, from http://www.educause.edu/EDUCAUSE+Review/EDUCAUSEReviewMagazineVolume43/TheInstitutionalChallengesofCy/163264

Lyon, L. (2007). Dealing with Data: Roles, Rights, Responsibilities and Relationships. Consultancy Report. University of Bath: UKOLN. Retrieved January 10, 2012, from http://opus.bath.ac.uk/412/

Manyika, J., Chui, M., Brown, B., Bughin, J., Dobbs, R., Roxburgh, C. & Byers, A.H. (2011). Big data: the next frontier for innovation, competition, and productivity. Report. Seoul: McKinsey Global Institute. Retrieved June 1, 2011, from http://www.mckinsey.com/Insights/MGI/Research/Technology_and_Innovation/Big_data_The_next_frontier_for_innovation

Moore, R., Rajasekar, A., & Marciano, R. (2007). Implementing Trusted Digital Repositories. In Proceedings of the DigCCurr2007 International Symposium in Digital Curation, University of North Carolina – Chapel Hill, Chapel Hill, NC USA, 2007. Retrieved September 24, 2010, from http://www.ils.unc.edu/digccurr2007/papers/moore_paper_6-4.pdf

Moore, R. (2005). Persistent collections. In S.H. Kostow & S. Subramaniam (Eds.), Databasing the brain: from data to knowledge (neuroinformatics) (pp. 69-82). Hoboken, NJ: John Wiley and Sons.

Moore, R. & Smith, M. (2007). Automated Validation of Trusted Digital Repository Assessment Criteria. Journal of Digital Information, 8(2). Retrieved March 2, 2010, from http://journals.tdl.org/jodi/article/view/198/181

Narayanan, A. & Shmatikov, V. (2007). How To Break Anonymity of the Netflix Prize Dataset. Retrieved March 7, 2012, from http://arxiv.org/abs/cs/0610105

National Aeronautics and Space Administration. (2010). The National Aeronautics
and Space Administration (NASA) Earth Observing System (EOS) Data Processing Levels. NASA Science Earth. Retrieved March 14, 2012, from http://science.nasa.gov/earth-science/earth-science-data/data-processing-levels-for-eosdis-data-products/

National Institutes of Health. (2010). Data Sharing Policy. NIH Grants Policy Statement (10/10) – Part II: Terms and Conditions of NIH Grant Awards, Subpart A: General – File 6 of 6. Retrieved March 7, 2012, from http://grants.nih.gov/grants/policy/nihgps_2010/nihgps_ch8.htm#_Toc271264951

National Institutes of Health. (2003). NIH Data Sharing Policy and Implementation Guidance. Grants Policy. Retrieved March 7, 2011, from http://grants.nih.gov/grants/policy/data_sharing/data_sharing_guidance.htm#fin

National Research Council. (2010). Steps toward large-scale data integration in the science summary of a workshop. Reported by S. Weidman and T. Arrison, National Research Council. Washington, D.C.: The National Academies Press.

National Science Board. (2011). Digital Research Data Sharing and Management. NSB-11-79, December 14, 2011. Arlington, VA: National Science Board. Retrieved January 18, 2012, from http://www.nsf.gov/nsb/publications/2011/nsb1124.pdf

National Science Foundation. (2011). NSF 11-1 January 2011 Chapter II – Proposal Preparation Instructions. Grant Proposal Guide. Retrieved January 16, 2011, from http://www.nsf.gov/pubs/policydocs/pappguide/nsf11001/gpg_2.jsp

National Science Foundation. (2011). Dissemination and Sharing of Research Results. NSF Data Sharing Policy. Retrieved January 15, 2011, from http://www.nsf.gov/bfa/dias/policy/dmp.jsp

National Science Foundation. (2010). Data Management for NSF Engineering Directorate Proposals and Awards. Engineering (National Institutes of HealthENG), the National Science Foundation. Retrieved September 2, 2010 from http://nsf.gov/eng/general/ENG_DMP_Policy.pdf

National Science Foundation. (2005). Long-lived digital data collections enabling research and education in the 21st century (NSB-05-40). Arlington, VA: National Science Foundation. Retrieved May 5, 2008, from http://www.nsf.gov/pubs/2005/nsb0540/

National Science Foundation Cyberinfrastructure Council. (2007). Cyberinfrastructure vision for 21st century discovery (NSF 07-28). Arlington, VA: National Science Foundation. Retrieved November 12, 2007, from http://www.nsf.gov/pubs/2007/nsf0728/index.jsp

Nelson, B. (2009). Data sharing: empty archives. Nature, 461, 160-163.

Pienta, A.M., Alter, G. & Lyle, J. (2010). The Enduring Value of Social Science Research: The Use and Reuse of Primary Research Data. Paper presented at the workshop on “the Organisation, Economics and Policy of Scientific Research”, held in in April 23-24, 2010, Torino, Italy. Retrieved January 5, 2012, from http://deepblue.lib.umich.edu/handle/2027.42/78307

Rajasekar, A., Wan, M., Moore, R. & Schroeder, W. (2006). A prototype rule-based distributed data management system. Paper presented at a workshop on “next generation distributed data management” at the High Performance Distributed Computing Conference, June 19-23, 2006, Paris, France.

Research Information Network, Institute of Physics, Institute of Physics Publishing, & Royal Astronomical Society. (2011). Collaborative yet independent: information practices in the physical sciences. A Research Information Network Report. London, UK: Research Information Network, December 2011. Retrieved January 26, 2012, from http://www.iop.org/publications/iop/2012/page_53560.html

Research Information Network. (2011). Data centres: their use, value, and impact. A Research Information Network report. London, UK: Research Information Network, September 2011.

Research Information Network. (2009). Patterns of information use and exchange: case studies of researchers in the life sciences. A Research Information Network Report. London, UK: Research Information Network, November 2009. Retrieved January 25, 2012, from http://www.rin.ac.uk/our-work/using-and-accessing-information-resources/patterns-information-use-and-exchange-case-studie

Research Information Network. (2008). Stewardship of digital research data: A framework of principles and guidelines. A Research Information Network report. London, UK: Research Information Network, January 2008.

Research Information Network. (2008). To Share or not to Share: Publication and Quality Assurance of Research Data Outputs. A Research Information Network report. London, UK: Research Information Network, June 2008.

Schofield, P.N., Bubela, T., Weaver, T., Portilla, L., Brown, S.D., Hancock, J.M., Einhorn, D., Tocchini-Valentini, G., Hrabe de Angelis, M., Rosenthal, N. & CASIMIR Rome Meeting participants. (2009). Post-publication sharing of data and tools. Nature, 461, 171-173.

Science and Technology Council. (2007). The digital dilemma strategic issues in archiving and accessing digital motion picture materials. The Science and Technology Council of the Academy of Motion Picture Arts and Sciences. Hollywood, CA: Academy of Motion Picture Arts and Sciences.

SHAMAN. (2011). Automation of Preservation Management Policies. SHAMAN – WP3-D3.4 (Report). Seventh Framework Programme and European Union.

Soehner, C., Steeves, C., Ward, J. (2010, August). E-science and data support services: a study of ARL member institutions. Washington, D.C.: Association of Research Libraries. Retrieved November 18, 2010, from http://www.arl.org/bm~doc/escience_report2010.pdf

Solove, D.J. (2007). “I’ve got nothing to hide” and other misunderstandings of privacy. San Diego Law Review, 44, 745-772.

Speake, J. & Simpson, J. (2008). Oxford Dictionary of Proverbs. Oxford, UK: Oxford University Press.

Stewardship. (2012). ForestInfo.org. Dovetail Partners, Inc. Retrieved March 9, 2012, from http://bit.ly/zmNzy1

Stewardship. (2012). Free Merriam-Webster Dictionary. An Encyclopaedia Brittannica Company. Retrieved March 9, 2012, from http://www.merriam-webster.com/dictionary/stewardship

Sullivan, B. (2012, March 6). Govt. agencies, colleges demand applicants’ Facebook passwords. MSNBC. Retrieved March 7, 2012, from http://redtape.msnbc.msn.com/_news/2012/03/06/10585353-govt-agencies-colleges-demand-applicants-facebook-passwords

Swan, A. & Brown, S. (2008). The skills, role and career structure of data scientists and curators: an assessment of current practice and future needs report to JISC. Truro, UK: Key Perspectives, Ltd. Retrieved January 18, 2012, from http://www.jisc.ac.uk/publications/reports/2008/dataskillscareersfinalreport.aspx

Sweeney, L. (2002). K-anonymity: a model for protecting privacy. International Journal on Uncertainty, Fuzziness and Knowledge-based Systems, 10(5), 557-570.

Tickletux. (2007). Did Bill Gates say the 640k line ? Retrieved from http://imranontech.com/2007/02/20/did-bill-gates-say-the-640k-line/

Toronto International Data Release Workshop Authors. (2009). Prepublication data sharing. Nature, 461, 168-170.

UKRDS. (2008). UKRDS interim report UKRDS the UK research data service feasibility study (v0.1a.030708). London: Serco Ltd. Retrieved April 9, 2009, from http://www.ukrds.ac.uk/UKRDS%20SC%2010%20July%2008%20Item%205%20(2).doc

Van den Eynden, V., Corti, L., Woollard, M., Bishop, L. & Horton, L. (2011). Managing and Sharing Data: Best Practices for Researchers, 3rd edition. University of Essex: UK Data Archive. Retrieved January 5, 2012, from http://www.data-archive.ac.uk/media/2894/managingsharing.pdf

Walters, T. & Skinner, K. (2011). New roles for new times: digital curation for preservation. Report prepared for the Association of Research Libraries. Washington, D.C.: Association of Research Libraries. Retrieved April 2, 2011, from http://www.arl.org/bm~doc/nrnt_digital_curation17mar11.pdf

Waters, D. and Garrett, J. (1996). Preserving Digital Information. Report of the Task Force on Archiving of Digital Information. Washington, DC: CLIR, May 1996.


Appendix A

The following tables (figures) of organizations, individuals, roles, sectors, and types involved with data management are from the Interagency Working Group on Digital Data (2009).

  1. Entities by Role
  2. Entities by Individual
  3. Entities by Sector
  4. Individuals by Role
  5. Individuals by Life Cycle Phase/Function
  6. Entities by Life Cycle Phase/Function

 Figure 6 - Entities by Role, 1 of 3 (Interagency Working Group on Digital Data, 2009).
Figure 6 – Entities by Role, 1 of 3 (Interagency Working Group on Digital Data, 2009).


 Figure 7 – Entities by Role, 2 of 3 (Interagency Working Group on Digital Data, 2009).
Figure 7 – Entities by Role, 2 of 3 (Interagency Working Group on Digital Data, 2009).


 Figure 8 – Entities by Role, 3 of 3 (Interagency Working Group on Digital Data, 2009).
Figure 8 – Entities by Role, 3 of 3 (Interagency Working Group on Digital Data, 2009).


 Figure 9 - Entities by Individuals (Interagency Working Group on Digital Data, 2009).
Figure 9 – Entities by Individuals (Interagency Working Group on Digital Data, 2009).



 Figure 10 – Entities by Sector with footnotes (Interagency Working Group on Digital Data, 2009).
Figure 10 – Entities by Sector with footnotes (Interagency Working Group on Digital Data, 2009).


 Figure 11 - Individuals by Role (Interagency Working Group on Digital Data, 2009).
Figure 11 – Individuals by Role (Interagency Working Group on Digital Data, 2009).


 Figure 12 - Individuals by Life Cycle Phase/Function (Interagency Working Group on Digital Data, 2009).
Figure 12 – Individuals by Life Cycle Phase/Function (Interagency Working Group on Digital Data, 2009).


 Figure 13 - Entities by Life Cycle Phase/Function (Interagency Working Group on Digital Data, 2009).
Figure 13 – Entities by Life Cycle Phase/Function (Interagency Working Group on Digital Data, 2009).

If you would like to work with us on a big data or digital stewardship project, please see our services page.

Managing Data: the Data Deluge and the Implications for Data Stewardship

Content Analysis Methodology Literature Review

Content Analysis Methodology | Literature Review and comprehensive exams

Abstract

Content analysis is a systematic research technique that provides a method for the qualitative and quantitative analysis of a corpus of information, generally text. This section introduces content analysis, describes applications of the technique, the types of content measured, along with sampling considerations. The reliability and validity of the study and research results are discussed, especially when applied by human coders and computer analysis. The similarities and differences between quantitative and qualitative content analysis are explored and outlined. Finally, the section concludes with a methodological assessment of two peer-reviewed articles that used the content analysis method to obtain answers to specific research questions.

Citation

Ward, J.H. (2012). Managing Data: Content Analysis Methodology. Unpublished manuscript, University of North Carolina at Chapel Hill. (pdf)

Creative Commons License
This work is licensed under a Creative Commons Attribution-NoDerivatives 4.0 International License.

 

Table of Contents

Abstract

Introduction

Applications of Content Analysis

Content Types—Manifest Versus Latent

Population, Sampling and the Unit of Analysis

Reliability and Validity

Human Versus Computer Coding—Reliability and Validity Examined

Quantitative Versus Qualitative Content Analysis

The Steps: Quantitative Data Analysis
The Steps: Qualitative Data Analysis

Examples

Example 1: “Cataloging Professionals in the Digital Environment: A Content Analysis of Job Descriptions”
Example 2: “Research Anxiety and Students’ Perceptions of Research: An Experiment. Part II. Content Analysis of Their Writings on Two Experiences”

Conclusion

References

 

Introduction

Content analysis is a research technique that involves a systemic analysis of text, including images and symbolic matter, which makes replicable valid inferences from the material examined (Krippendorf, 2004; Weber, 1990). The method may be used in qualitative, quantitative, or mixed-methods studies with a multitude of research objectives and questions. It “is the study of recorded human communications” (Babbie, 2001) with a “systematic, objective, quantitative analysis of message characteristics” (Neuendorf, 2002). The flexibility and objectives of this process make it particularly suitable for Information Science research, given that the domain is the “study of gathering, organizing, storing, retrieving, and dissemination of information” (Bates, 1999).

A researcher applying content analysis methods would be interested in the “aboutness” of the content, more so than the content itself. For example, how often is a particular word used or not used? What can one infer from the text that is not directly stated? What themes or trends do the data indicate? How does the sample population feel about X, Y, or Z based on an analysis of the text? Thus, an Information Science researcher may utilize content analysis to answer questions about the underlying structure, form, and organization of the information contained in survey responses, books, transcribed interviews, journal articles, newspapers, web content, recorded conversations, etc.

While it is primarily a product of the 20th Century, content analysis has some long historical roots. Precursors to content analysis range from the analysis of texts by the Catholic Church in the 1600s in order to monitor and enforce orthodoxy, to the dissection of hymns in 18th Century Sweden, on to the statistical evaluation of news and novels in the late 1800s to early 1900s.

The rise of mass communication during the 1920s in the form of radio and, later, in the 1950s in the form of television, combined with the 1929 economic crash, the Depression, World War II, and the start of the Cold War, created the conditions ripe for the evolution of content analysis from a journalism-driven quantitative analysis to an established and codified research method with both qualitative and quantitative variants (Krippendorf, 2004). The public and researchers wanted answers to questions related to everything from the buying trends of a particular demographic to an analysis of Soviet propaganda. Berelson (1952) provided the first consolidated text about content analysis, and, as a result, its use spread beyond newspapers, espionage, and sociology to other disciplines and fields as diverse as psychology, anthropology, and history.

The development of computers in the mid-20th century and the rise of computer-aided text analysis (CATA) further integrated content analysis into mainstream human communications research. Over the past half-century, researchers have repeatedly demonstrated that computers using a variety of software may be used to reliably process large tracts of text much faster than humans. Computer software is available to support both quantitative (deductive) and qualitative (inductive) content analysis.

Computer-aided text analysis works by providing a standard dictionary against which the software processes the text. Alternately, a researcher may create a custom dictionary based on variables relevant to the study (Neuendorf, 2002). The computer may perform a quantitative analysis of word count, for example, or a more nuanced “analysis” of textual patterns (Evans, 1996). One example of pattern analysis would be to predict stock market fluctuations by analyzing Twitter posts (Bollen, Mao & Zeng, 2010). However, in spite of more than 50 years of computer-aided text analysis, the results of human- vs computer coding of the same text have had markedly different findings (Spurgin & Wildemuth, 2009). While some studies have concluded that computers and humans may code and analyze text equally badly or well (Nacos, et al., 1991), at this point in time, computers are viewed as aids to the human process of coding and analysis, not a substitute (Krippendorf, 2004; Spurgin & Wildemuth, 2009).

Whether or not a researcher or research team chooses to use computer software to aid in the analysis of text, they must follow the scientific method and be systematic in their approach. The quantitative approach to content analysis requires a deductive method in which a hypothesis is formed and valid and replicable inferences may be made from the text (White & Marsh, 2006; Krippendorf, 2004). The investigator will choose the data via random, systematic sampling and all data will be gathered prior to coding. The researchers will develop the coding scheme a priori, and she may re-use existing coding schemes. The coding objective is to test for reliability and validity using statistical analysis (White & Marsh, 2006).

A researcher who applies the qualitative content analysis method will use an inductive, grounded theory approach where the research questions guide the iterative data gathering and analysis. The investigator uses purposive sampling and may continue to gather data after coding has begun. As themes arise in the course of coding and analyzing the data, the researcher will determine the important patterns and concepts, and may add additional coding schemes as needed. This is a subjective method that still requires the systematic application of techniques to ensure the credibility, transferability, dependability, and confirmability of the eventual results (Lincoln & Guba, 1985; White & Marsh, 2006). Thus, the results of a qualitative content analysis are subjective and descriptive, but they are systemically grounded in the themes and concepts that emerge from the data. Weber (1990) writes that the best content analyses use both quantitative and qualitative operations, while Krippendorf (2004) states that both methods are indispensable to the analysis of texts.

Applications of Content Analysis

A common application of content analysis in ILS research is the study of position announcements. White (1999) analyzed electronic resource position announcements posted between 1990 and 1998 to determine whether or not position requirements had changed with the rise of the World Wide Web in the mid-1990s. He quantitatively examined the words that appeared in the postings to produce tables of salaries offered, position titles, job responsibilities, required skills and qualifications, and educational requirements. The results of the study indicated that technology-related skills were increasingly important, and that salaries increase above inflation and are higher than average. Over the long term, this type of study may inform LIS curricula, as well as provide information to practitioners on what skills they need to develop and/or maintain in order to remain relevant.

A similar LIS study by Park, Lu, & Marion (2009) ten years later examined cataloging professionals’ job descriptions to re-assess the current skill set requirements. Study researchers applied the quantitative content analysis method, but with the added “layer” of additional statistical analysis to check the results. That is, in addition to the straight “count” of terms, Park, Lu & Marion (2009) converted the category counts to co-occurence similarity values to “compensate for large differences in counts for commonly occurring terms”. Similar to White (1999), the results of this study indicate that technological advances in the 2000s have influenced job responsibilities, position titles, job descriptions, skills, and the qualifications required for a cataloging position.

Researchers may also use content analysis to gauge users’ perceptions of the phenomenon of interest or to predict how those sentiments may drive changes to other indicators. Kracker and Wang (2002) qualitatively compared participants’ perceptions of a past research project with their current perceptions of a research project. The investigators examined students’ emotional states, perceptions, and like or dislike of the various research stages; they cross-referenced feelings and thoughts with regards to demographics. The results confirmed Kuhlthau’s Information Search Process (ISP), and the study will be discussed in more depth later in this paper.

One application of sentiment analysis is to examine Twitter posts to determine if the emotions expressed predict the direction of the Dow Jones Industrial Average (DJIA) (Bollen, Mao, & Zeng, 2010). The researchers in this study used two software programs, plus a Granger causality analysis and a Self-Organizing Fuzzy Neural Network, to determine the collective mood of Twitter users. They then compared the results to the up and down movement of the DJIA. The data indicates that the collective mood as determined via Twitter can predict the direction of the stock market. The team cross-validated the results by examining users’ moods on Twitter prior to the Presidential Election and Thanksgiving 2008. Again, communal Twitter sentiment predicted the 2008 Presidential Election and events around Thanksgiving 2008. The results of the content analysis indicate that public mood may be correlated to and predictive of economic events.

Content Types—Manifest Versus Latent

Initially, content analysis operations focused on manifest content — that is, communications that are objective, systematic, and quantitative (Berelson, 1952). Researchers focused on those facets of the text that were present, easily observable, and countable. For example, White’s (1999) study of electronic resource position announcements considered primarily manifest comment. Either a word or phrase appeared and was counted, or it did not and, therefore, could not be counted.

As the method evolved, content analysis researchers have examined the latent meanings held within the text, not just manifest content. For example, during World War II, intelligence agents were able to predict Axis military campaigns by examining the underlying meanings of manifest communications to the public that were designed by Axis governments to build popular support for a forthcoming political or military campaign (Krippendorf, 2004). Allied intelligence agents determined these campaigns were impending by reading between the lines of seemingly innocuous news stories and announcements.

Two modern applications of latent content analysis are the analysis of sentiment, mentioned previously in Kracker and Wang’s (2002) analysis of students’ perceptions of research, and Bollen, Mao, and Zeng’s (2010) analysis of Twitter posts to predict the DJIA. In both studies, manifest content was examined to determine latent content. Strictly speaking, content analysis should only consider manifest content (Berelson, 1952), but leading content analysis methodologists such as Neuendorf (2002) and Krippendorf (2004) agree that study results obtained via latent analysis of manifest content can produce results that are both reliable and valid. However, the researcher who examines latent content must be sure to pay strict attention to the issues of reliability and validity to ensure a solid study design (Spurgin & Wildemuth, 2009).

Population, Sampling and the Unit of Analysis

When a researcher is designing a content analysis study, she must first determine the sample population from which she will draw her data. Then, she must determine the unit to be examined. The unit of analysis, sometimes referred to as the recording or coding unit, is “distinguished for separate description, transcription, recording, or coding” (Krippendorf, 2004) so that the population may be identified, the variables measured, or the analysis reported. The unit itself may be physical, temporal, or conceptual (Spurgin & Wildemuth, 2009). An example of a physical unit of analysis is a word, sentence, or paragraph. If the unit of analysis is temporal, then an investigator would count some amount of time, i.e., minute, hour, etc., of an audio or video recording. A conceptual count involves examining every instance of an argument or statement.

For example, when Nacos, et al. (1991) chose to compare human- versus computer coding of content, they chose a sample that consisted of articles about the invasion of Granada from The New York Times and The Washington Post. The date range of the news articles ranged from January 1, 1983 to November 25th, 1983. The team only examined articles in the first sections of the newspapers, and excluded Op-Ed pieces. Within each article, the unit of analysis they chose to examine was the paragraph. Park, Lu, & Marion (2009) examined cataloging job descriptions posted on a listserv between January 2005 and December 2006 as the sample. As part of a pilot study, they coded several dozen job descriptions and determined the unit of analysis to be the most frequently occurring categories, such as responsibilities, job titles, required job qualifications and skills, and preferred job qualifications and skills. An alternate unit of analysis within the latter study might have been the job posting itself.

There are as many as nine types of sampling methods that could be applied to an examined text as part of a quantitative content analysis — random, systematic, stratified, varying probability, cluster, snowball, relevance, census, and convenience (Krippendorf, 2004). The sampling technique the investigator will use does depend on the type of content analysis to be performed on the material chosen. If a researcher applies the quantitative method, then one type of systematic sampling is used to provide for the generalization of the results to a larger population. In this instance, random sampling is preferred (White & Marsh, 2006). If a researcher applies the qualitative content analysis method, then purposive sampling is applied. Data may continue to be gathered throughout the project as themes and patterns emerge.

Reliability and Validity

The reliability of a content analysis depends on whether and to what extent agreement can be achieved among coders, judges, observers, or measuring instruments (Krippendorf, 2004). Inter-coder reliability implies, for example, that all coders have consistently and repeatedly coded material the same way, regardless of which or what texts they examined. Reliability provides an empirical grounding for the confidence that the interpretation of the data will mean the same to anyone who analyzes it, and that as much bias as possible has been removed from the interpretation. Reliability ensures that the results of a study may be replicated when the same research procedure is applied; it ensures that a measurement is consistently the same throughout a study. A researcher may check the reliability of a variable by using Spearman’s rho, Scott’s pi, or Pearson’s r (Neuendorf, 2002). Krippendorf (2004) has also developed an alpha to aid reliability testing.

Validity ensures other evidence available for scrutiny that is independent of the study itself may corroborate research results. The accuracy of the measurement is gauged — it must measure what the researcher intends to measure (Neuendorf, 2002). This evidence may be in the form of new observations, other available texts, open data, or competing theories and interpretations. The quality of the study results must be “true” — they must be what the researcher states they are.

If a measure is not reliable, then it cannot be considered valid (Neuendorf, 2002). It can be challenging for a researcher to balance reliability and validity; however, if the measurement is not accurate (valid), then it is less important that it has been consistently measured (reliable). Thus, it is better for an investigator to aim for high validity rather than high reliability.

Human Versus Computer Coding—Reliability and Validity Examined

One question researchers have considered as part of CATA is whether or not human coding is more reliable and valid than computer coding. After all, in spite of easy access to computers and the Internet, human coders often perform content analysis. This makes it a labor-intensive, costly, time-consuming, and tedious operation. Software that aids in content analysis while providing high validity and reliability would be highly desirable as a way to cut costs and increase the speed at which a corpus may be measured. Evans (1996) examined the available tools and techniques for computer-supported content analysis, but he did not evaluate the effectiveness of the tools against human coders.

When Nacos, et al., (1991) took an existing corpus that had already been examined by human coders, and compared the results of a computer analysis of the same data set, they concluded that computers have the advantage when it comes to processing large volumes of text consistently, accurately and quickly, especially when the goal of the study is a combined measure of content. However, they found that the advantages of using human coders over computers are not trivial. For example, when it comes to coding text, computers cannot recognize when there is a problem — such as ambiguity — when the rules and data dictionaries are not as precise or comprehensive as needed. Nor can a computer determine when a particular paragraph that is being coded does not make sense within the context of the preceding paragraphs, and adjust accordingly. The study results indicate that the computer provides high reliability at the expense of validity, while the human coders provide high validity at some expense to reliability.

However, a previous study by Rosenberg, Schnurr, & Oxmann (1990) concluded that human-scored methods provided less validity when compared to the computerized method when used to make inferences about the psychological states and traits of a writer or speaker. These researchers compared one simple and one sophisticated computerized approach with a context-sensitive, human-scored system. Their final recommendation is that a simple, computerized content analysis should be the first procedure of any content analysis study design. These conflicting results are reinforced by Morris’ (1994) comparison of human and computer coding results in the management research domain — she found no significant difference overall between the results of human and computer coding regardless of the unit of analysis.

Whereas Naco, et al. (1991), compared human and computer coding at the paragraph level, and Rosenberg, Schnurr, & Oxmann (1990) compared human and computer coding of speech samples of five minutes in length, Morris compared human versus computer coding at the sentence, word or phrase, paragraph, sentence density, paragraph density, and hit density units of analysis. She designed the study to compare not just human and computer coding, but to determine whether or not the unit of analysis affects the results. She drew her sample population from the mission statements and letters to shareholders of Fortune 500 firms. She found that there was no significant difference between the results obtained by human coders when compared to the results of the computer analysis by unit of analysis.

When there were insignificant differences between the results of human and computer coding, the differences were due to two possible sources of error. Either the human coders did not receive accurate training and coding instructions, or there was an error in the computer’s coding instructions. In both instances, human and computer coding errors may be minimized by revising the computer analysis programs during the study, in the same way that human coders sometimes receive additional training and experience during the course of an investigation.

There are several advantages to using CATA over human coders, among them: high reliability, quantitative results (word counts, etc.) that would be time-consuming to produce manually, and the ability to process large volumes of data quickly and inexpensively (Morris, 1994). However, she also recognizes that computers have limitations that may impact validity, such as:

  •  an inability to recognize unambiguous language and the intent of the communication within context;
  • the inability of the computer to resolve references to words appearing elsewhere, such as pronouns referring to nouns in other sentences;
  • word crunching that produces quantitative data may produce spurious results; and,
  • the reliability and validity of the computer results will have to be validated by a human and computer inter-coder reliability pilot test.

Her final conclusion is that although there was no significant difference between the results of human and computer coding in her study, if a researcher wishes to use machine coding over human coders, the study design and research question must be appropriate for CATA.

In a more recent study, King & Lowe (2003) studied the results of human versus machine coding of international conflict events data, including automatically generated events data. The results of their study, like Morris, found no significant difference between the results obtained by human coders when compared to computers. Unlike Morris (1994), King & Lowe (2003) recommend using computers over humans in all studies because of the reduced expense. They do not recommend making the choice on a case-by-case basis.

In conclusion, while there are advantages to using computers over humans because of the high reliability, Spurgin and Wildemuth (2009) caution that if the rule sets are not consistent there may be questions about internal consistency (re: reliability). Therefore, in order to use CATA for a content analysis, a researcher must have the appropriate research questions, study design, must understand the software she is using, and choose the right software for the job. If a sample size is small enough, it may be faster for two people to code the data, rather than set up the software and data dictionary to process it. Again, the decision to use human or computer coders should be made on a case-by-case basis.

Quantitative Versus Qualitative Content Analysis

The most basic form of text analysis is quantification of text, yet to do so reduces text analysis to a simple tallying activity. The value of a content analysis lies with discovering any context and meaning that may be hidden within the categorized message. However, while the best content analyses should apply both quantitative and qualitative methods (Krippendorf, 2004; Weber, 1990), each method is based on a slightly different process. A quantitative content analysis is based on the inductive, scientific method, while the qualitative approach is based on a deductive, grounded theory process.

The Steps: Quantitative Data Analsis

The core steps of the scientific method applied to any study in any domain are: theory, operationalization, and observation. The scientific method operationalizes deductive logic, which goes from the more general to the more specific in the following order: theory, hypothesis, observation, and empirical generalization (Babbie, 2001). As applied in ILS, Crawford and Stucki (1990) identified eight steps:

  1. Establish a question.
  2. Devise a hypothesis or question to be tested.
  3. Design the study methodology.
  4. Create a research team, write a proposal, and receive funds.
  5. Set up the research team.
  6. Gather the data, code the data, and test the hypothesis.
  7. Analyze the data to determine if it supports the hypotheses or provides an answer to the research question.
  8. Report the results to the larger community for peer-review and to contribute to the field.

In theory, a quantitative content analysis follows the general outline of the scientific method. According to Neuendorf (2002), a typical content analysis process is comprised of nine steps.

  1. Theory and rationale: What are the questions? The hypotheses? What body of work will be examined, and why? Why is this important? Does the current literature address this question or these questions
  2. Conceptualization: What dictionary-type definitions will you use with what variables? What will you sample, and what sample will you gather and why?
  3. Operationalization: What type of a priori coding scheme will the researcher use? Do the measures match the conceptualizations? What units will be sampled? How do you determine and verify validity and reliability?
  4. Develop the Coding Scheme: If human coders are used, what codebook and coding form will be used? If a computer is used, then what dictionary will be created or re-used?
  5. Sample: What sample size does the researcher need to be valid? How will the researcher randomly sample the data?
  6. Run a Pilot Test and Check Inter-coder Reliability: How much do the coders agree during a pilot test? Are the variables reliable? Have the codebook and form been revised as needed? Has the researcher run a spot test of humans versus the computer to check for human-computer reliability?
  7. Code the Data: If human coders are used, are there at least two coders? Does the data overlap by at least 10% to check for reliability? If a computer is used, has the researcher spot-checked for validity?
  8. Calculate the Reliability: What reliability figure is used for each variable? Pearson’s r, Spearman’s rho, Krippendorf’s alpha, Cohen’s kappa, or Scott’s pi? And why?
  9. Tabulate and Report the Results: What statistical operation is appropriate for the data? Univariate? Cross-tabulation? Are there other bivariate and multivariate techniques that may be run on the data? Why were these techniques used?

Content analysis as a method examines not just the mere count of some unit within a corpus; a researcher applying the method must also be concerned with latent meaning within the text. The study design itself must be rigorous and follow a logical design. It is entirely possible to design a study that examines both latent and manifest content, yet also follows the reasoning of the scientific method.

The Steps: Qualitative Data Analysis

There are multiple approaches to qualitative data analysis. Miles & Huberman (1994) identified three: interpretivism, social anthropology, and collaborative social research. An investigator applying a qualitative content analysis design would be applying the second, because she is examining both manifest and latent content for patterns. A researcher applying a qualitative, inductive content analysis uses the same four steps as with the scientific method, except that instead of moving from the general to the specific, the process flows from the specific to the general (Babbie, 2001).

The steps involved are almost the reverse of the deductive method — the researcher begins with an observation, discovers patterns, creates a hypothesis, and then proposes a theory. Glaser (1965) named this approach the constant comparative method, and it became the foundation of Glaser & Strauss’ (1967) Grounded Theory, which is a systematic, iterative method for developing a theory from raw data. The research questions guide the data gathering and analysis, but as patterns and themes arise from the data analysis additional questions may be proposed (White & Marsh, 2006) and new categories may be coded until the saturation point is reached (Glaser, 1965). If an investigator is attempting to develop a theory, then the coding scheme develops from the data, but a priori schemes may be used if the researcher is not verifying an existing theory or describing a particular phenomenon (Zhang & Wildemuth, 2009; White & Marsh, 2006). The use of a qualitative design for a content analysis study does not preclude the use of deductive reasoning, or the re-use of concepts or variables from previous studies (Zhang & Wildemuth, 2009).

A qualitative content analysis follows a systematic series of steps, some of which overlap with quantitative content analysis. Krippendorf (2004) writes that both quantitative and qualitative content analysis sample text; unitize text; contextualize the text; and have specific research questions in mind. Zhang & Wildemuth (2009) outlined the process of qualitative content analysis as a series of eight steps, once the initial research question has been developed.

  1. Prepare the data: Can your data be transformed into written text? Is the choice of content justified by what the researcher wants to know?
  2. Define the Unit of Analysis: What theme is the coding unit? How large is the instance of that theme? Is the theme reflected in a paragraph or within an entire document?
  3. Develop Categories and a Coding Scheme: Will the coding scheme be developed as patterns and themes emerge, or will it be developed from previous studies or theories?
  4. Test Your Coding Scheme on a Sample of Text: How consistent is the inter-coder agreement in the pilot test?
  5. Code All the Text: Has the researcher repeatedly checked the consistency of the inter-coder agreement
  6. Assess Your Coding Consistency: As new coding categories are added, are the coders still in agreement for the entire corpus?
  7. Draw Conclusions from the Coded Data: What themes and patterns have emerged from the data? What sense can you make of these patterns?
  8. Report Your Methods and Findings: How well can the study be replicated? Has the researcher presented all of the necessary information to replicate the study? Are the results important? If so, why?

Similar to validity, the study design, data gathering, and results of a qualitative content analysis must have a degree of “truth” so that a peer-review of the results will provide confidence to other researchers and students that the study results are accurate. Lincoln & Guba (1981) describe this “truth” as having four dimensions: credibility, transferability, dependability, and confirmability. Credibility is similar to internal validity, in that the data gathered accurately reflects the research question. That is, the study data will measure what the research questions seek to measure. Transferability is similar to external validity, wherein the results of a study are applicable from one frame of reference to another. Dependability ensures that a study may be replicated, and examining intercoder reliability assesses confirmability. It ensures the objectivity of the researchers such that there is “conceptual consistency between observation and conclusion” (White & Marsh, 2006).

Examples

Two examples of content analysis are discussed in this section. The first study is a job description analysis, which is a fairly common application of content analysis in ILS. Park, Lu, & Marion (2009) examined job descriptions for catalogers over a two-year period to determine what skills and competencies are desired by employers. The authors provided an analysis that used both straight frequency counts and statistical analysis to examine the data. The second study (Kracker & Wang, 2002) analyzed students’ perceptions of research and research anxiety by using a mixed-methods (qualitative and quantitative) design. The study results confirmed Kuhlthau’s Information Search Process (ISP) model.

Example 1: “Cataloging Professionals in the Digital Environment: A Content Analysis of Job Descriptions”

As noted previously, Park, Lu, & Marion (2009) applied a quantitative content analysis to assess the current skill requirements for catalogers. The authors identified emerging technology-related roles and competencies and discussed how these new requirements related to traditional cataloging skills.
Park, Lu, & Marion (2009) gathered 349 distinct cataloging job descriptions from an established online listserv over a two year period. The researchers followed procedures for data analysis used by Marion in previous peer-reviewed publications, which included co-term and co-citation analysis. The investigators determined the coding scheme a priori, and no additional categories were added once the data gathering phase began after the initial pilot study. They achieved intercoder agreement initially by manually coding 55 job descriptions. The authors used content-analysis software, and created the dictionary based on a combination of sources, including counts of the most frequently occurring terms, a literature review, and their own combined professional knowledge.

The research team entered all complete job descriptions into the content-analysis software. The initial output of the software was a frequency count of terms. The researchers then converted this count of terms to a matrix of co-occurence similarity in order to offset any large differences in commonly occurring terms. More importantly, the co-occurence similarity provided more useful information about the structure of the cataloging profession. Finally, the team created a visual graph of the data and cluster analysis to explore a co-occurence profile for each category term. The researchers also used hierarchical cluster analysis and multi-dimensional scaling to identify clusters of categories. Using these clusters, they generated a map to determine patterns in the data.

The authors presented the results based around four categories: job titles, required qualifications and skills, preferred qualifications and skills, and responsibilities. The tables that detail the most frequently occurring job titles, categories and skills, and responsibilities using frequency, percentage, and terms and phrases, are clearly rendered. The dendrograms and map of the same information provided an easy visual clue as to where the clusters are in the data.

There are four areas where the study could be improved. First, the authors mention manually coding 55 job descriptions, but they do not describe how the results of these manual codings compared to the output of the content analysis software during the pilot phase. Did they pilot test the software results against human coders? The researchers do not mention spot-checking the results from the software against a human during the main study, either. Second, the coding scheme consisted of eight categories based on commonly used terms in job descriptions (i.e., “background information”, “job responsibilities”, etc.), and the dictionary for the content analysis software was also custom built. Are there existing schemes available via psychology, business, or human resources that could have been used in place of these custom schemes? The authors do not say whether or not they looked for existing job-related categorizations prior to custom building the manual coding scheme and dictionary.

Third, do the results of the study correlate with any existing theories on job changes over time, for example? The authors do not state whether or not they were trying to support an existing theory or could have. Fourth, the authors did not provide any indication that they statistically determined the sample size. How do we know that 349 is a valid sample of the population? The authors did cite this as a limitation in the conclusion of the paper.

In conclusion, Park, Lu, & Marion (2009) used quantitative techniques to perform a content analysis of cataloging job positions in order to inform current catalogers and LIS curricula developers of evolving skill sets. The authors performed a basic term frequency count (which provided for a manifest content analysis) supplemented by established statistical techniques such as occurence similarity values (which provided for latent content analysis). They used content analysis software to analyze the full texts, thus aiding in reliability. The investigators manually coded the text during the pilot phase. The sample population was chosen from a publicly available listserv, so the study may be replicated. The authors provided a sample job description and a list of digital environment job titles in the appendices. While this content analysis has some limitations, overall, the authors achieved their goal of assessing the (then) current state of cataloging skill sets and responsibilities.

Example 2: “Research Anxiety and Students’ Perceptions of Research: An Experiment. Part II. Content Analysis of Their Writings on Two Experiences”

Kracker and Wang (2002) conducted a two-part experiment that examined both quantitative and qualitative data. The results of the quantitative study were presented in a separate first paper and will not be discussed in this section. The second paper, which is described here, presented the results of the qualitative analysis. That content analysis examined study participants’ descriptions of both a past memorable research experience and a current research paper in order to determine student’s perceptions of research.

The researchers’ sample consisted of 90 students from a technical and professional writing course. Each student was assigned either to a control group or an experimental group. Each person in both the control and experimental groups completed a pre-test questionnaire that asked the students to recall their most memorable research experience to date, and to write a paragraph describing their thoughts and feelings as they each worked through the assignment from start to completion. The students in the experimental group attended a lecture on Kuhlthau’s ISP model; the control group attended a placebo lecture. The students in the class were required to complete a research paper as part of the course. Once the research paper was turned in at the end of the term, the students from both the placebo and control groups were asked to describe their thoughts and feelings about this recent research experience.

Content analysis techniques were used initially to assign the 16 feelings identified in Kuhlthau’s ISP model as categories. The researchers added categories as themes emerged from the data and classified feelings into three meta-groups — emotional states related to the process, perceptions of the process, and affinity to research. The units of text were coded at the subcategory level; the authors provided examples of the coding schemes and classifications as part of the appendices. The two coders crosschecked their coding, and achieved a 90% intercoder agreement in two rounds for eight of the thirteen categories.

However, methodologists such as Krippendort (2004) and Neuendorf (2002) are firm that percentage agreement is a misleading measure that overstates the real value of the intercoder agreement. In addition, the authors determined intercoder agreement for affective and cognitive coding by using Holsti’s (1969) method. This method, similar to percent agreement, does not take chance into account and is not as useful a measure as other intercoder agreement statistical methods (Spurgin & Wildemuth, 2009). Therefore, the study design could have been improved by using Cohen’s kappa, since it is often used in behavioral research and is a modification of Scott’s pi. Scott’s pi is applicable to nominal data with two coders, as well.

The numbers and percentages for feelings and thoughts, respectively, as well as the relationship between thoughts and feelings by participants, were clearly presented in tables. The authors used the content analysis data to examine feelings in relationship to demographic factors, groups, and broke down negative emotional states into clusters. Kracker and Wang (2002) also examined thoughts across groups in relation to Kuhlthau’s ISP model. The results of the analysis confirmed this model. While this study is a simple content analysis, the authors did examine manifest content, discovered latent content, and integrated both qualitative and quantitative content analysis methods into their study.

This experiment is an example of using both qualitative and quantitative content analysis methods to measure perceptions — that is, the feelings and thoughts of study participants about a particular topic. The authors performed basic quantitative content analysis to count words related to the study participant’s thoughts and feelings. The researchers began with a defined coding scheme (quantitative content analysis), but added to it as themes emerged (qualitative content analysis). They took the themes that emerged and mapped those themes to an existing theory (Kuhlthau’s ISP model), which is an example of using deductive reasoning to support an existing theory and add to the current body of knowledge. The authors provided coding words, classifications, and other schemes in the appendices, which adds to the validity and reliability, as well as replicability and generalization, of the results. This example included both quantitative results and description, providing the reader with information about the factors that affect students’ perceptions of research.

Conclusion

Content analysis is a systematic approach to the analysis of a corpus of information categorized as data. The approach offers both an inductive, quantitative approach for researchers who wish to prove an existing theory, yet it is flexible enough to be used by an investigator who wishes to establish a new theory grounded in data. The qualitative and quantitative content analysis methods overlap somewhat in their operationalization, but each is grounded in established theory. The quantitative approach is based on the deductive scientific method, and the qualitative approach is based on the inductive grounded theory model. Both sample texts, unitize the text, contextualize what is being read, and seek answers to defined research questions (Krippendorf, 2004). Both approaches to content analysis require the evaluation of reliability and validity, i.e., trustworthiness, and may use human and/or computer coding and analysis. Through careful study design, data gathering, coding, analysis and reporting, content analysis can provide valuable insight into the examination of both manifest and latent content.

 

References

Babbie, E. (2001). The Practice of Social Research (9th Edition). Belmont, CA: Wadsworth/Thomson Learning.

Bates, M.J. (1999). The Invisible Substrate of Information Science. Journal of the American Society for Information Science, 50(12), 1043-1050.

Berelson, B. (1952). Content analysis in communications research. New York, NY: Free Press.

Bollen, J., Mao, H., & Zeng, X. (2010). Twitter mood predicts the stock market. Journal of Computational Science, 2(1), 1-8.

Crawford, S. & Stucki, L. (1990). Peer Review and the Changing Research Record. Journal of the American Society for Information Science, 41(3), 223-228.

Evans, W. (1996). Computer-Supported Content Analysis: Trends, Tools, and Techniques. Social Science Computer Review, 14(3), 269-279.

Glaser, B.G. (1965). The Constant Comparative Method of Qualitative Analysis. Social Problems, 12(4), 436-445.

Glaser, B.G. & Strauss, A.L. (1967). The Discovery of Grounded Theory Strategies for Qualitative Research. Chicago, IL: Aldine Publishing Company.

Holsti, O.R. (1969). Content Analysis for the Social Sciences and Humanities. Reading, MA: Addison-Wesley.

King, G. & Lowe, W. (2003). An Automated Information Extraction Tool for International Conflict Data with Performance as Good as Human Coders: A Rare Events Evaluation Design. International Organization, 57(3), 617-642.

Kracker, J. & Wang, P. (2002). Research Anxiety and Students’ Perceptions of Research: An Experiment. Part II. Content Analysis of Their Writings on Two Experiences. Journal of the American Society for Information Science, 53(4), 295-307.

Krippendorf, K. (2004). Content analysis an introduction to its methodology (2nd ed.). Thousand Oaks, CA: Sage Publications.

Lincoln, Y.S. & Guba, E.G. (1985). Naturalistic Inquiry. Beverly Hills, CA: Sage Publications.

Miles, M.B. & Huberman, A.M. (1994). Qualitative Data Analysis (2nd ed.). Thousand Oaks, CA: Sage Publications.

Morris, R. (1994). Computerized Content Analysis in Management Research: A Demonstration of Advantages and Limitations. Journal of Management, 20(4), 903-931.

Nacos, B.L., Shapiro, R.Y., Young, J.T., Fan, D.P., Kjellstrand, T., & McCaa, C. (1991). Content Analysis of News Reports: Comparing Human Coding and a Computer-Assisted Method. Communication, 12(2), 111-128.

Neuendorf, K.A. (2002). The Content Analysis Guidebook. Thousand Oaks, CA: Sage Publications.

Park, J., Lu, C. & Marion, L. (2009). Cataloging Professionals in the Digital Environment: A Content Analysis of Job Descriptions. Journal of the American Society for Information Science, 60(4), 844-857.

Rosenberg, S. D., Schnurr, P. P., & Oxman, T. E. (1990). Content Analysis: A Comparison of Manual and Computerized Systems. Journal of Personality Assessment, 54(1/2), 298-310.

Spurgin, K.M. & Wildemuth, B.M. (2009). Content Analysis. In Applications of Social Research Method to Questions in Information and Library Science (pp. 297-307). Westport, CT: Libraries Unlimited.

Weber, R.P. (1990). Basic Content Analysis (2nd Ed). Newbury Park, CA: Sage Publications.

White, G.W. (1999). Academic Subject Specialist Positions in the United States: A Content Analysis of Announcements from 1990 through 1998. The Journal of Academic Librarianship, 25(5), 372-382.

White, M.D. & Marsh, E.E. (2006). Content Analysis: A Flexible Methodology. Library Trends, 55(1), 22-45.

Zhang, Y. & Wildemuth, B.M. (2009). Qualitative Analysis of Content. In Applications of Social Research Method to Questions in Information and Library Science (pp. 308-319). Westport, CT: Libraries Unlimited.

If you would like to work with us on a content analysis or data analysis and analytics project, please see our services page.

Blog post | Social media | Content Analysis Methodology Literature Review