Preservation Standards and Digital Policy Enforcement

Consulting on TDR | trusted digital repositories

Preservation Standards, and Audit and Certification Mechanisms Question

What types of policies would you expect to be enforced on a digital repository, based on the emerging Trustworthiness assessment criteria? What types of additional policies would you expect to find related to administrative or management functions?

Citation

Ward, J.H. (2012). Doctoral Comprehensive Exam No.4, Managing Data: Preservation Standards and Audit and Certification Mechanisms (e.g., “policies”). Unpublished, University of North Carolina at Chapel Hill. (pdf)

Creative Commons License
This work is licensed under a Creative Commons Attribution-NoDerivatives 4.0 International License.

Note: All errors are mine. I have posted the question and the result “as-is”. The comprehensive exams are held as follows. You have five closed book examinations five days in a row, one exam is given each day. You are mailed a question at a set time. Four hours later, your return your answer. If you pass, you pass. If not…well, then it depends. The student will need to have a very long talk with his or her advisor. I passed all of mine. — Jewel H. Ward, 24 December 2015

Preservation Standards, and Audit and Certification Mechanisms Response

The CCSDS’ “Audit and Certification of a Trusted Digital Repository” (2011) describes policies in terms of the (1) technical framework, (2) the organizational framework, and, (3) the digital object itself. These policies may be applied and enforced manually (by humans) or at the machine level (by computers using computer code). Some of the policies required for a repository to be considered a Trusted Digital Repository (TDR) are also required for day-to-day management of the repository generally. Other types of policies are completely outside of the requirements for a TDR, yet they are important for the day-to-day management of it. This essay will address both types of policies.

Some examples of the types of technical policies this author would expect to be enforced on a digital repository in practice based on the TDR assessment criteria are as follows. In some instances in the examples below, the policy of the repository administrators may be, for example, to save the original file format/SIP…or not save it. The enforced policy will depend on the mission of the repository and the implicit and explicit policies that are developed and applied by the human managers of the repository.

  1. The hardware, software, and file formats must/must not be migrated.
  2. A copy of the original file format and the original software version to render the original version must be/must not be retained for provenance purposes.
  3. At least two off-site backups must be implemented, and the back ups must be tested periodically to ensure they are actually backing up the data as required and expected.
  4. The contents of the repository must be catalogued; i.e., the administrators of the repository have logged what objects are in the repository.
  5. The administrator of the repository must be able to audit all actions performed on an object, including what, by whom, and when.
  6. Upon ingest, the digital object is scanned for viruses and a checksum is performed.
  7. The administrator must be able to access, retrieve, and render all digital objects in the repository, either for his or her own erudition, or, if appropriate for users.
  8. Any software required to render the digital object will be maintained and migrated (if possible; some software may not have newer versions).
  9. If a digital object is to be deleted on X date, then it must be deleted, and a follow up audit run to ensure the object was actually deleted.
  10. If the content rendered via a digital object requires any clean up, then the clean up of the data/content will be documented. The original (un-cleaned up file) must be saved for provenance purposes. Some organizations may make the decision not to save the original (un-cleaned) digital object.
  11. The administrator of the repository must enforce appropriate restrictions to the data. For example, some digital objects may be only available to users via a certain IP (Internet Protocol) range.

Some examples of the types of organizational policies this author would expect to be enforced on a digital repository in practice based on the TDR assessment criteria are as follows.

  1. The organization maintaining the digital repository commits to employing an appropriate level of staff with an appropriate level of training in order to maintain the archive based on Information and Library Science (ILS) best practices and standards.
  2. The organization maintaining the digital repository commits to providing an appropriate level of funding for the (preservation) maintenance of the repository and its content.
  3. The organization commits to finding an appropriate organization to take over the repository in the event the original managing organization can no longer do so.
  4. The staff of the organization commit to documenting the policies, procedures, workflows, and system design of the preservation repository.
  5. The management and staff maintaining the repository agree to periodically audit the policies and procedures of the repository in order to ensure that they are doing what they say they are doing. This may be a self-assessment using a standard self-audit such as DRAMBORA, or via an outside auditor who will certify that the repository meets Trusted Digital Repository (TDR) criteria.
  6. Barring any extenuating circumstances, the organization commits to honoring all contracts signed and agreed to at the time the content was acquired or created in-house. This includes the spirit and intent of the agreement, especially if the originating party no longer exists (either a person or an institution).
  7. The management and organization maintaining the repository agree to honor and enforce all copyright, intellectual property rights, and other legal obligations related to the digital object and repository. These agreements may be separate from any agreements entered into in order to acquire or create the content.

Some examples of the types of digital object management policies this author would expect to be enforced on a digital repository in practice based on the TDR assessment criteria are as follows. These example policies are related to ingest. The files are in a staging area (SIPs), awaiting upload into the preservation repository as AIPs. These policies are in addition to or supplement the policy examples provided above.

  1. If the digital object does not have a unique ID, or the current unique ID will not be used, then a new unique identifier will be assigned. A record of the changed ID or new ID assignment will be logged.
  2. A virus scan and a checksum will be run and the fact that these actions were taken on the digital object will be logged. In the event of a virus, the object will be quarantined until the virus is eliminated.
  3. Any metadata associated with the digital object will be checked for quality and appropriateness. If necessary, the metadata may be supplemented by additional information. If there is no associated metadata, then some metadata will be created.
  4. Storage and presentation methods will be applied, if appropriate. For example, if the policy is to store the original .tiff file but create .jpeg files for Web rendering and storage via a database, then the .jpeg files may be created in the staging area and stored. Another possible policy may be to create .jpeg files on the fly from the .tiff as needed, once the collection is live and online. This type of policy would save on storage space.
  5. If the SIP, AIP, and DIP are different, then the final version of the file must be created prior to upload into the repository from the staging area. The original SIP may be stored or deleted, per the policy of the repository. This is not recommended for files that have been cleaned up, as the original “dirty” file may need to be viewed later for provenance and data accuracy purposes.
  6. Set access privileges, both for internal staff accessing the digital object, and for any external users, assuming the content of the repository is publicly accessible.
  7. Upload the digital object to the repository, log that the object has been uploaded, and test that the files are retrievable and “renderable”.

In terms of what types of additional policies this author would expect to find related to administrative or management functions that are not part of the TDR assessment criteria, the following types of policies might be applied to a preservation repository. These are not preservation policies per se, but they may (or may not) affect the policies enforced for preservation.

  1. Collection policies. For example, what types of collections are included or not included in the archive? Images? Documents? Data sets? Only peer-reviewed articles related to Physics? Only Social Science data sets?
  2. File format policies. Are there any limitations on the type of file formats the repository will or will not store and make available to users? For example, the policy may be to store a .tiff file but only make .jpegs available to users.
  3. Type of archive policies. Is the repository a dark archive only? A public archive? An archive with limited public access?
  4. “This is not a preservation repository” policy. The policy may be not to plan to preserve any of the material in the repository, because that is neither the mission nor the concern of the repository managers or the reason for the existence of the repository itself.
  5. WYSIWYG content and metadata policies. The policy of the repository may be not to invest in quality control on the content or metadata. Therefore, there is no clean up of the digital object or any vetting of the metadata. If and when a user accesses the material, it is What-You-See-Is-What-You-Get (WYSIWYG). This is sometimes related to the limitations of personnel time and funding. For example, in the early 2000s the developers of the National Science Digital Library had to accept what the content owners and creators could provide regarding metadata quality, which was “non-existent” or “terrible”, and rarely “good” or “excellent” (Hillmann & Dushay, 2003).
  6. Legal, financial, ethical, and collection policies. What types of material will the repository accept and acquire, even when the material falls within the collection policy purview? For example at the University of Southern California, the focus of the digital archive was “Southern California”, and L.A., specifically. The archive primarily consisted of images. In the mid-2000s, the staff discussed acquiring photographic images related to L.A. gangs with the idea of building a gang archive, but the legal issues were deemed to be extremely challenging by all involved. The only way to acquire the material and work around the legal issues would be to require that no access to the photos be allowed until 100 years had passed. The staff could not justify the costs of acquiring the collection for the purposes of embargoing it for that long of a period; this includes the costs associated with maintaining the collection as a dark archive. All digital archive staff agreed, however, that such a collection would be very valuable to historians.
     
    More recently, an archive in the Northeastern United States had recently faced legal action by the British government over oral histories of living former IRA members. The historian who recorded the oral histories had promised the former IRA members that the recordings would be private and not subject them to legal action. The courts are saying otherwise. Thus, a repository manager may have to take into account multiple types of policies with regards to content.
  7. Software, hardware, and repository design policies. Will the repository use off-the-shelf or one-off/home-grown software? What hardware will the repository run on? Whether home-grown or off-the-shelf, will the software comply with preservation repository recommendations, per the OAIS Reference Model (CCSDS, 2002)? Is compliance with the OAIS Reference Model part of the policies guiding the repository design?
  8. Policies regarding conflicts between international standards, domain standards, and local rules and regulations. Which policies, standards, rules, and/or regulations will take priority over others? For example, if your national standard (Beedham, et al., 2004 (?)) requires providing access to handicapped citizens, but fulfilling this requirements means that the repository is not compliant with international standards or the standards of the domain represented by the archive and, therefore, will not be considered a TDR, whose rules do you follow? (In this case, Beedham, et al., (2004?) followed their national laws, but criticized the authors of the OAIS Reference Model for not taking into account local laws.)
  9. Federation policy. Will the repository federate with other repositories? This excludes reciprocal back-up agreements. The federation may include providing metadata for metadata harvesting, or the sharing of the content and metadata itself. For example, the Odum Data Archive provides metadata via an OAI-PMH Data Provider, and also provides users of their data archive with access to ICPSR metadata. A user may or may not be able to access the actual non-Odum Institute ICPSR data sets, however. Therefore, the policy applied by the managers of the Odum Institute data archive is to provide access to the metadata of non-Odum Institute data sets, but not to the data sets themselves.

In conclusion, the CCSDS’ recommendation, “The Audit and Certification of a Trusted Digital Repository” (2011) divides policies into three main types: Technical, Organizational, and Digital Object Management. The policies required to be a Trustworthy Digital Repository encompass many of the policies required to manage a digital archive generally. This means, if the policy of a repository administrator is not to preserve the content, then many of the policies required for a Trusted Digital Repository will still be implemented, as many of those are required for general repository management, anyway.

Repository managers and administrators must also implement managerial and administrative policies that are not part of preserving the content, but yet reflect important decisions that must be made with regards to the repository and the content it contains. This essay has outlined a sample of policy types related both to a Trusted Digital Repository, and to a non-Trusted Digital Repository.

If you would like to work with us on a digital preservation and curation or data governance project, please review our services page.

Preservation Standards, and Audit and Certification Mechanisms

Repository Design: Understand the Value of the OAIS’ Preservation

Consulting on the OAI Reference Model | research data

The OAIS Reference Model Repository Design Question

In your literature review #3 you state that ”the conclusion from a variety of experienced repository managers is that the authors of the OAIS Reference Model created flexible concepts and common terminology that any repository administrator or manager may use and apply, regardless of content, size, or domain.”

  1. Does this one-size-fits-all model really work for repositories large and small? Please discuss.
  2. You also note that Higgins and Boyle (2008) in their critic of OAIS for the DCC talk about the need for an OAIS lite. Please discuss what that might look like, who would be its primary audience, and how useful it could be.
  3. Finally, how can repositories such as the US National Archives work with the concept of designated community as their mission is to serve all citizens. Is the notion of designated audience generally useful? Why or why not and under which conditions is it most valuable?

Citation

Ward, J.H. (2012). Doctoral Comprehensive Exam No.3, Managing Data:
Preservation Repository Design (the OAIS Reference Model)
. Unpublished, University of North Carolina at Chapel Hill. (pdf)

Creative Commons License
This work is licensed under a Creative Commons Attribution-NoDerivatives 4.0 International License.

Note: All errors are mine. I have posted the question and the result “as-is”. The comprehensive exams are held as follows. You have five closed book examinations five days in a row, one exam is given each day. You are mailed a question at a set time. Four hours later, your return your answer. If you pass, you pass. If not…well, then it depends. The student will need to have a very long talk with his or her advisor. I passed all of mine. — Jewel H. Ward, 24 December 2015

The OAIS Reference Model Repository Design Response

Based on the feedback this author has received from participants attending the DigCCurr Professional Institute in May 2011, no, the one-size-fits all OAIS Reference Model recommendation does not work for repositories both large and small. The repository administrators in question were discussing digital curation concepts in general, but this may also be applied to the OAIS Reference Model, as this is one part of digital curation. The repository administrators wanted to know what part of what they had learned at the Institute they should apply, and which parts they could safely leave out. The attendees thought the information presented to them was useful, but that it would be “overkill” for their particular repositories.

Beedham, et al. (2004?) noted that the design of the OAIS Reference Model is for that of a repository within a large bureaucracy. The authors wrote that the Reference Model is not designed for a small archive collection with a limited audience with limited funding and personnel to build, maintain, and preserve the collections in the repository. It is designed for an institution with a team of personnel working on the repository, not one or two people responsible for all aspects of creating, maintaining, and preserving it. This author would add that the OAIS Reference Model is designed with an archive whose collections consist of tens of thousands to n number of objects. It is not designed for an archive of a few hundred or a few thousand objects with one person to administer it, who may or may not be trained in digital library/digital archive Information and Library Science (ILS) Best Practices.

The Reference Model has been designed such that it may federate with other OAIS archives, presumably to create access to one Really Large Preservation Repository. It has also been designed so that the object may have three different “versions”: the Submission Information Package (SIP); the Archival Information Package (AIP); and the Dissemination Information Package (DIP). As a concept, these are three different things, but in practice, a SIP may equal an AIP, which may equal a DIP. For a large repository with different audiences, the DIP may need to be different from the AIP. For a small archive with a homogeneous audience, the AIP and DIP may be exactly the same.

Therefore, with regards to my statement, “…any repository administrator or manager may use and apply, regardless of content, size, or domain”, the key is in the use of the word, “may”. They may use it. It is not that they must use it, or they will use it, it is simply that the repository administrator may use it. A repository administrator must take into account the rules and regulations that apply to their repository when applying the OAIS Reference Model. These rules and regulations may be domain Best Practices that differ from ILS practices, or federal, state, institutional or other local policies that differ from what the OAIS recommends. The OAIS Reference Model is a recommendation, not a requirement or a law. As Thibodeau wrote (2010?), any evaluation of a repository must be taken on a case-by-case basis. In other words, one size does not fit all.

The primary responsibility of a repository manager is to ensure the near-term availability of the objects in the repository, and the long-term as well, if that is part of the mission of the digital archive. This author has two views of what an “OAIS Lite” might look like. The first is to determine what is actually required to preserve content for the long-term, regardless of the model used. The second is how the documentation of the recommendation could be adapted to create an “OAIS Lite”. The primary audience for an OAIS-lite would be the managers of small- to medium repositories who do not operate within large bureaucracies, and, perhaps, have some kind of computer science knowledge, but who will generally not have an ILS background.

Jon Crabtree of the Odum Institute at the University of North Carolina at Chapel Hill supports the use of standards, but he has noted on several occasions that the Odum Institute “preserved” their digital data for decades without explicit preservation standards or policies. They did this because they hired competent people who did their job, and because it was understood that the data itself must be migrated, and the software and hardware must be migrated, replaced, upgraded, etc. This author’s own work experience seconds Crabtree’s comments.

At the time of this writing, the following must occur in order for data to be preserved without following any particular recommendation for preservation. Although this section is designed to be illustrative of “bare bones” preservation requirements, the “[]” designates the OAIS Reference Model section in which this would fit; i.e., either the “Information Model” or the “Functional Model”.

  1. [Functional Model] Document the holdings of the archive and its system design. Update the documentation if and when there are any changes to numbers 2-4 below.
  2. [Information Model] Ensure the appropriate metadata for the digital objects.
  3. [Functional Model] Migrate & refresh the hardware and software periodically, as well as any software required to render the objects in the repository (for example, CAD files). Upon ingest run integrity checks and virus scans. Periodically run these scans on the data. Set up at least 2 off-site back ups, and check that the back ups are actually backing up the data. Ensure all of the objects in the repository may actually be found and accessed, assuming access is permitted and desired.
  4. [Functional Model] Find someone to take the data if the organization in charge of the data goes out of existence. Keep (1) above updated in order to facilitate a takeover of the archive’s contents.
  5. [Functional Model] Hire competent people who ensure that numbers 1-4 above occur.

Additional steps a repository administrator may take are to take the documentation from (1) above, map it to the OAIS Reference model and identify gaps. Then, as time and resources permit, address any existing gaps within the current system design and content versus the OAIS Reference Model. At the least, identify that the gaps exist and document this in (1) above.

This author’s vision of an “OAIS Lite”, therefore, would be very general guidelines for the type of administration and management required to maintain a digital repository over time. This may not be what Higgens & Boyle (2006) had in mind.

However, if this author were to create an “OAIS Lite” based purely on the OAIS Reference Model recommendation itself, then it would be the current recommendation, but with each subsection designated as:

  1. “Must have”/required.
  2. “Nice to have”/recommended.
  3. “Optional”.

The assumption is that if some part of the recommendation is not necessary, then it won’t be in the OAIS Reference Model recommendation at all. Thus, “not needed” is not provided as an option. This also assumes the same audience as outlined above for the “bare bones” preservation guidelines. This would have the advantage of breaking down the Reference Model into manageable chunks. A repository manager of any size could begin by implementing the “must haves”; as time permits, add in the “nice to haves”; and, again, as time permits, add in any “optional” sections.

Another possibility is to divide the recommendations in the Reference Model by repository size, and then break those down by “required”, “recommended”, and “optional”. A committee of experienced repository administrators working with small repository owners could set up the Reference Model in this way. Either of these formats would be a useful version of the recommendation.

Thus, an “OAIS Lite” could consist of two types of recommendations. The first is a description of the bare bones functions required to maintain a repository and its contents over the long-term, mapped to the general OAIS models. The second version would be to take the recommendation itself, and break it down into required, recommended, and optional sections. Breaking down the recommendations would be useful to the managers of both large-and small repositories. The challenge would be get a committee of repository experts to agree on what constitutes “required”, “recommended”, and “optional” within the OAIS Reference Model.

The concept of a Designated Community is useful within the OAIS Reference Model, as it reminds repository managers that the goal of the repository is to serve a set of users. The goal is not necessarily to serve the needs of the repository managers! The concept is most useful when the users of a repository are homogeneous, and it is least useful when the users are heterogeneous. This is because the more heterogeneous the population using a repository, the less “one size fits all” fits all users. It is easier to serve a specific set of users (“scholars”) than all users (“hobbyists” and “scholars”).

Having said that, an organization like the National Archives may work around this limitation by aiming collections at specific users, once a baseline standard has been met. So, for example, the Southern Historical Collection at UNC was initially put online for scholars and, to some extent, “to serve the people of North Carolina (NC)” (as that is also the stated mission of the University of North Carolina at Chapel Hill), but the administrators of the collection soon realized that K-12 educators were using the resource. Thus, the administrators of the digital library still serve their “generic” audience (“the people of NC”) and scholars of Southern history, but they have developed K-12 educational materials for teachers to use as part of the state curriculum.

This author believes it is possible for the National Archives to serve “the people of the United States” by breaking down the digital collections by themes, collections, etc., and determine who uses what collections, and how. They can thus better serve specific audiences, and tailor the site as needed. The administrators of an archive still must determine who their “general” Designated Community is, and set standards for that community, but can, as needed, serve targeted communities.

In conclusion, the “one size fits all” model of the OAIS Reference Model does not fit all. It is important to have standards for preservation repository design, but when the preservation repository design is more suited to a large bureaucratic institution than a small repository with fewer resources, then not all of those standards may be useful. If not all of the standards are applicable ore seem like “overkill”, then the repository manager will need to decide which of the standards to use, and how. One way to ease this “cherry picking” of preservation repository standards is to determine the processes required to ensure preservation, regardless of repository design. A second way is for ILS and Computer Science experts to break down the OAIS Reference Model recommendations into “required”, “recommended”, and “optional”, also possibly based on a repository’s size. This would be useful to managers of repositories of all sizes, as it would help the manager figure out what they have right so far, or where they need to start, and allow him or her to figure out what gaps remain.

A downside to this idea is that if a repository only implements the “required” recommendations of the OAIS Reference Model, then they may be only partially OAIS-compliant, and it might encourage laziness among repository administrators.

Regardless, “content is king”, so the important issue is that the content and its metadata, along with any required software to run it, are preserved. The model used to preserve it is secondary. Finally, while the concept of a Designated Community is important, it is a more valuable term when the users of an archive are more homogeneous, and less useful when the user base is heterogeneous. Large archives at the national level may work around this limitation by setting a baseline standard of quality for all users, and then targeting the archive’s collections to particular audiences who use those collections.

If you would like to work with us on a data governance or digital preservation project, please review our services page.

Repository Design: Understand the Value of the OAIS’ Preservation

Learn the Priorities of Digital Preservation Community

Community Digital Preservation Standards | Trusted Digital Repositories

Digital Preservation Question

Since 1996, the digital preservation community has been emerging as evidenced by the increasing number of formalized standards, conferences, publishing options, and discussion venues.

What are the most significant community developments and why? What gaps remain in terms of community standards and practice? What roles should/could academic programs, professional associations, curatorial organizations, and individual researchers and practitioners play in those developments? What priorities and desired outcomes should there be for building the community’s literature?

Citation

Ward, J.H. (2012). Doctoral Comprehensive Exam No.2, Managing Data: the Emergence & Development of Digital Curation & Digital Preservation Standards. Unpublished, University of North Carolina at Chapel Hill. (pdf)

Creative Commons License
This work is licensed under a Creative Commons Attribution-NoDerivatives 4.0 International License.

Note: All errors are mine. I have posted the question and the result “as-is”. The comprehensive exams are held as follows. You have five closed book examinations five days in a row, one exam is given each day. You are mailed a question at a set time. Four hours later, your return your answer. If you pass, you pass. If not…well, then it depends. The student will need to have a very long talk with his or her advisor. I passed all of mine. — Jewel H. Ward, 24 December 2015

Digital Preservation Response

Ascertaining the most significant community developments is a task likely to cause a few religious wars amongst long-time digital preservationists. However, the following are the most significant events in this author’s humble opinion.

  1. The realization by the overall Computer Science (CS) and Information & Library Science (ILS), among other domains and industries, that there is a digital preservation problem in the first place. This realization took place among individuals and organizations over the course of several decades, from the 1960s to the early 1990s. This is important because you cannot fix a problem if you don’t know you have one.
  2. The 1996 Waters (et al.) report on the digital preservation problem. This report was a significant event because it outlined the problem(s) and what steps needed to be taken to ameliorate it.
  3. The development of the OAIS Reference Model (RM) by the Consultative Committee on Space Data Systems (CCSDS) and the standardization of the model in 2002. The development of the OAIS RM is important because it the committee that created it consisted of, and was informed by, practitioners and users of data beyond the Space Data Community. It defined common preservation terms that would mean the same to all who used them (or, “should”). And, finally, the OAIS RM defined a common preservation repository standard against which digital repository managers could compare their own systems to determine, at least subjectively, their own preservation-worthiness.
  4. The creation of the Digital Curation Centre (DCC) in the UK in the 2000s. Although the DCC is designed to serve UK Higher Education Institutions (HEIs), it has provided a central location for digital preservation practitioners to go to for information related to digital preservation. The centre has also provided a platform from which further research and standardization in digital curation and preservation may continue.
  5. The creation of the National Digital Infrastructure Preservation Program (NDIPP) in the United States in the 2000s. Much like the DCC in the UK, NDIPP had provided a central location in the USA from which digital preservation research and development, and the application of it, is promoted. As well, the program has provided an avenue through which private industry, government, research, and academia may come together to address the common problem of digital preservation. The Science and Technology committee of the Academy of Motion Picture Arts & Sciences (AMPAS) mentions NDIPP in their report, “The Digital Dilemma”, (2007) as being an important program with which private industry should be involved in order to coordinate resources to solve a problem (digital preservation) that all industries are facing.
  6. The publication of the AMPAS report (2007) on the digital preservation problem within the movie industry. The movie industry’s products and libraries represent a large source of profit for the industry, as well as the cultural heritage of the respective countries that produce movies. The AMPAS report about the digital preservation problem is important because:
    1. it meant that a major, high dollar industry was also seeking solutions to the digital preservation problem. This made it “not just a library issue”, and provided additional clout (financial and political) to the task of finding solutions.
    2. The authors of the AMPAS report clearly stated that the digital preservation problem was not just a movie industry problem, it was everyone’s problem who used digital data, thus the solutions must be found by working together across private industry, government, academia, and other research institutions.
    3. Reflecting the work done in ILS, the industry stated that digital preservation costs were far higher (1100% more) than non-digital preservation. This is the only non-research, non-academic report this author has read that shows the costs as determined by private industry. The authors of the report stated that standards will reduce costs, and the movie industry should resist implementing one-off solutions. This promotes the use of standards as an integral part of the digital preservation problem, even within a high-profit commercial industry.
  7. The development of the concept of a “Trusted Digital Repository”, as well as the mechanisms to audit and certify that a repository is actually “trustworthy”. This includes the development of TRAC (“Trusted Repository Audit and Certification”), DRAMBORA (quantitative self-assessment of a repositories trustworthiness), other assessment criteria developed in Europe, and the development of TRAC into an ISO standard via the CCSDS called, “the Audit and Certification of Trusted Digital Repositories”. This development also includes the development of standards with which to certify the certifiers. The significance of the development of an ISO standard for a “trusted digital repository” is that it gives practitioners and other repository managers a base set of policies from which they can build or assess their repository’s ability to survive over the indefinite long-term, especially when use in conjunction with the OAIS RM.
  8. The development of outlets for publication, forums for discussion, web sites, with information on preservation, etc., has given practitioners and researchers avenues for their work that can be used for their own professional advancement. Providing incentives for researchers and practitioners to do preservation work is one way to ensure the necessary preservation is done. It also provides an iterative feedback loop, such that researchers and practitioners can adapt policies and standards as new information and research become available.

Some gaps do remain in terms of community standards and practice. The gaps are in some instances managerial, others are more technical.

  1. The standards for preservation policies and repository design, such as “the Audit and Certification of a Trusted Digital Repository” and the OAIS RM are designed with large organizations in mind. What if you aren’t NASA or the Library of Congress? For example, what if you are the lone digital archivist for the Harley Davidson archive? Or a digital library on quilts? The standards outlined for preservation policies and repositories are stacked in favor of large organizations with large bureaucracies. The ILS community ought to develop a “lite” version aimed at small “mom and pop” repositories whose administrators curate important material but don’t need all of the overhead presented in the ISO standards for trusted digital repositories and the OAIS RM.
  2. The same idea applies for data management training for researchers and other administrators of data archives in non-ILS domains. These researchers don’t want to spend their time on the full curation of the data, but neither are many of them likely to turn the data sets over to libraries and archives for stewardship in the near term. (The long-term is another issue.) Yet, in order to support science and the requirements of funding agencies such as the NSF and the NIH, the data must be preserved and shareable. The development of a “lite” curriculum for data management, aimed at non-ILS data managers, would be useful for non-ILS data managers, and would strengthen librarian and archivists’ roles as information managers by providing a consulting and outreach function to scientists and researchers.
  3. The designation of certification as “trustworthy” does not seem to take into account “local” rules and regulations that a repository may have to take into account when designing the preservation system and the policies that must be applied to it. Some repositories may have to forgo international preservation standards in lieu of following national, state, county, or other regulations. Does that mean the repository is not “trustworthy”? Will the repository now be considered “2nd class” because it isn’t certified as “trustworthy”? Thibodeau has discussed looking at each repository on a case-by-base basis.
  4. Preservation policy and repository standards have been designed from the top-down. Granted, the people designing the standards have worked with repositories themselves (usually) and so based the standards on their own experiences with the running of a repository. However, it is one thing to define standards, it is another to ensure their implementation. For example, this author’s masters paper work (2002) involved studying 100 Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) data providers, to determine which Dublin Core elements were used or not used. At the time, practitioners were arguing for and against qualifying DC to be more detailed, per other metadata standards (Lagoze, c. 2001). Practitioners have certain standards for metadata quality. No one had actually examined how people were using DC. This author found that out of 15 DC elements, only 3 (title, author, & date) were being used the majority of the time. A separate but related study by Dushay & Hillmann around 2003 with the National Science Digital Library found the quality of the metadata content was abysmal. Follow up studies by Shreeves, et al. in the mid-2000s examined metadata quality in the OAI-PMH and also found it to be abysmal. These studies combined made the religious war over qualifying or not qualifying DC moot. Why qualify DC if only 3 elements out of 15 are being used the majority of the time, and the quality of the metadata content in those elements is abysmal? Perhaps the quality of the metadata content should be improved, then more elements should be used, and then practitioners can worry about qualifying DC? The same discrepancy may not exist for preservation policies, but one might think it would be interesting to find out what people are actually doing, as opposed to what they say or think they are doing with regards to compliance with standards. Then again, that is the purpose of auditing and certifying trusted digital repositories, so one may consider this argument circular!
  5. Further examination as to what digital preservation is going to cost. If material must be curated from its birth, then it is also be true that decisions will have to be made early on as to whether or not material should be preserved at all. Even AMPAS noted that the movie industry must change its mindset from “save everything”, which worked fine with film, and must now curate the digital move data.
  6. A large gap in digital preservation is the transfer of data from one system to another, whether external or internal. The development of standard ingest tools would help reduce the costs of preservation. The Producer-Archive Interface Method Standard (PAIMAS) by the CCSDS (c. 2003-2004) has been one step in this direction, but it only outlines a method. A technically simple way to transfer data between repositories is a solution in need of an answer.
  7. Metadata is one large gap still in need of a solution. The problem relates to both metadata quality (mentioned earlier) and tools with which to create metadata. Scientists and researchers who work with data have repeatedly stated, both in readings by others and in this author’s work on the DataNet project, that the “killer app” is a tool which helps them to appropriately and simply annotate their data. Like ingest, metadata is a challenging problem in search of an answer.

The roles that academic programs, professional associations, curatorial organizations, and individual researchers and practitioners play in these developments as a function are that individuals must identify the gaps. As a community, individuals must agree on those gaps, and then apply to their organizations for the time to work on the problems, or else, obtain grans to that they may work on the solutions. Eventually, the solutions may be taught as part of the ILS and CS curriculums.

In terms of the gaps identified above, researchers and practitioners should continue to work on the metadata and ingest problems, realizing that these are two huge gaps to preserving materials that cross all domains.

Some other areas organizations and individuals may play involve deciding what are the penalties if a repository is not an “OAIS RM” “TDR”? What are the rewards for being “trustworthy”? Should there be rewards or penalties? If so, what? For example, there is a “charity navigator” that provides certain criteria against which a possible donator may determine whether or not they wish to give money. It does prevent people from giving money to organizations who, say, waste a lot of money on administration. But it is also true that some smaller organizations may not have the money to re-fit themselves to meet Charity Navigator’s criteria. Does this mean that they are no less worthy of donations, or that the money will do to waste? Not necessarily. It may mean that a charity may receives less money than they would otherwise, because Charity Navigator gives them a lower rating than an organization with more funding. This in turn gives more funding to the charity that already receives more funding, and less funding to a charity that already has less. If an organization does not have the time and resources to self-assess or receive certification as “trustworthy”, will they encounter any penalties, whether implicit or explicit?

This implies that standards should be a guide, and viewed as one part of a whole package.

One role individuals and organizations in the preservation field might play is that of consultant to small organizations that manage data and to individual researchers. One output of this could be an OAIS RM “lite” and an “audit and certification for trusted digital repositories” “lite”, aimed at repository managers who do not work for large bureaucratic organizations. This could be a document or standard or online training that a practitioner could do on his or her own time. Currently, even the DRAMBORA self-assessment requires a large time commitment from at least one, if not more, repository administrators. Part of this consultant role would involve education graduate students and researchers on data management. One output of this could be an online certification program that scientists and researchers could do on their own time, as they have time, on how to manage their data. This is slightly different from, but related to, personal information management. The data in this sense would be data gathered in the course of one’s work, not personal data, such as a digital photo album. This would include learning how to tag metadata, and thus begin to fix this problem where it starts – with the data creator.

Some possible outputs for the above problems are librarians and archivists continuing to provide consulting and outreach to scientists and researchers regarding preservation standards. The creation of “lite” versions of preservation policy standards and repository designs would be helpful for small repository administrators and those whose local standards might supersede international preservation standards. Technology is not an ILS strength, information management is ILS’s strength. ILS practitioners and researchers must continue to work with CS and other technical folks on developing ingest and metadata tools, especially with preservation in mind. These tools may also need to be designed with individual researchers and small repository administrators in mind. ILS practitioners and researchers must build upon their strength in information management, and not cede ground to CS, if they wish to remain relevant.

One outcome of a consulting role within other domains regarding preservation, and providing tools that aid in preservation, is to raise standards for data preservation within other communities. This will make the long-term preservation of that data and information easier for those who must eventually manage it. And, thus, make it more likely that the data will be preserved at all. This will also strengthen ILS as a field.

If you would like to work with us on a data governance or digital preservation and curation project, please see our consulting services.

Doctoral Comprehensive Exam No.2, Managing Data: the Emergence & Development of Digital Curation and Digital Preservation Standards

Understand the Criteria to Permanently Curate Data

Big data analytics and the preservation community

Curate Data Question

You have characterized the growing data deluge in scholarship well (though heavily focused towards the sciences). Unfortunately, there is likely to be much more data, today and in the future, than there is funding to permanently curate this data. What criteria should we use to decide what to keep, and for how long? How should we make decisions among data collections that all have legitimate and worthy claims to preservation and curation? What data should be privileged over other data, and why? Please speak to both the underlying theoretical and philosophical issues here and also to possible process and organizational/institutional approaches to making operational choices?

In addition, what should happen when data is condemned to oblivion, either thorough some decision making process at the dataset/data collection level, or because of larger scale events, such as the defunding of a repository?

Citation

Ward, J.H. (2012). Doctoral Comprehensive Exam No.5, Managing Data: the Data Deluge and the Implications for Data Stewardship. Unpublished, University of North Carolina at Chapel Hill. (pdf)

Creative Commons License
This work is licensed under a Creative Commons Attribution-NoDerivatives 4.0 International License.

Note: All errors are mine. I have posted the question and the result “as-is”. The comprehensive exams are held as follows. You have five closed book examinations five days in a row, one exam is given each day. You are mailed a question at a set time. Four hours later, your return your answer. If you pass, you pass. If not…well, then it depends. The student will need to have a very long talk with his or her advisor. I passed all of mine. — Jewel H. Ward, 18 December 2015

Curate Data Response

The dilemma faced by data stewards with regards to the amount of data and its intrinsic value (or lack of) versus available funding is not a new problem. Librarians and archivists have faced this same problem for centuries. The solution to this problem is for data managers to develop a rigorous collection development policy and apply that to the data sets and data collections in question, while taking political issues into consideration. This is exactly what must be done when deciding what to cull and what to keep with regards to objects in the physical world. The difference between preserving a digital data set and a physical data set (say, a paper notebook of temperatures at a given location at a given time of day), however, is that data must be migrated and refreshed over time. This means that data collection and preservation must occur almost at the time of creation, rather than decades or centuries later.

There lies the crux of the problem. It is easier to decide what to preserve out of what has survived at the end of someone’s career, for example, because an archivist will be able to determine if this person’s scholarly contribution was significant enough to keep or save various aspects of this person’s scholarly work and personal effects. It is much more difficult to decide to develop a collection “looking forward”. An archivist will not know if this person’s contributions will be important enough to warrant the cost of providing storage for decades of this person’s research, for example. However, even the movie industry has had to shift from the “save everything” policy of film-based movie making, to culling material up front in order to save money on storage costs (Science and Technology Council, 2007). Therefore, data stewards will have to make some up front decisions that will involve both procedural and theoretical issues.

A data steward appraising a data set or collection for inclusion into a repository should use the standard toolkit for archivists (acquisition, appraisal, accession, etc.). Some of the criteria that might be used as part of a digital data set collection development policy might include the following.

  1. Is the data set replaceable? For example, is it observational data, which cannot be replaced? Or, is it experimental data, which may be regenerated?
  2. How re-usable is the data? Does is have appropriate metadata, annotations, and is it of a reasonable quality to provide replicable research? Does it require special software or hardware to render and make readable and usable?
  3. If the data is re-usable, has it been re-used, by whom, and how often?
  4. If the data set were to be deleted because it is replaceable, how expensive would it be to re-gather this data? Would it be less expensive to store it?
  5. What type of research does the data support? Have the research results from the data set been highly cited or never cited? Has there been a high demand for the data set, or no demand at all?
  6. How expensive will it be to maintain the data set, and does the organization have the resources to maintain it so that it is accessible and re-usable? Does the data set require any special software, scripts, programs, or hardware to run?
  7. What are the national, international, institutional, local laws or policies, domain, or other regulations with regards to the disposition of this data set? For example, does this data set have to be maintained indefinitely, for 10 years, or can it be deleted as soon as a project has been completed?

As part of establishing a collection development policy, some philosophical and methodological considerations must be made. For example, if a data set consists of observational data, which is not replaceable, then saving that type of data should take priority over saving experimental data, which is presumably replaceable. An assistant professor’s basic data sets that he or she has used or re-used for a course have a certain level of value, but it is likely that a Nobel Prize winner’s data set will take priority over the former. One might also assume that course data sets are likely to be replaceable, whereas a Nobel Prize winner’s data set may not be. However, if the course data sets belong to the Nobel Prize winner, well, then one might make the argument to save both, assuming funding is available to maintain both data sets. If a data steward can choose only one, however, then the data set used for the research results that brought the researcher a Nobel Prize award are more likely to be of long-term value.

However, with regards to deciding which collections “have legitimate and worthy claims to preservation and curation” and which do not, and what data should be privileged over other data, the answer, again, lies in the creation of a rigorous collection development policy with regards to data sets. Some of the considerations to be made, in addition to the ones listed above, might be one or more of the following. The considerations below are from the perspective of an Information and Library Science (ILS) trained practitioner working in an academic library, not from the perspective of a domain scientist seeking to preserve his or her data set(s).

  1. Does this data set support the mission of the repository? That is, a data manager should not be adding Physics data sets to a Social Science data archive, unless there is a legitimate reason for doing so.
  2. What will this data set add to or detract from the existing collection? What is the answer to the “so-what” factor? (Why should the data steward add this data set to the digital repository?)
  3. What is the quality of the data itself? What is the value of the data set, both current and projected? How replaceable is the data, if it isn’t saved?
  4. What is the quality of the metadata and any additional annotations or included information?
  5. How much time and effort will it take to add this data set to the current repository? Is it worth the effort, if it will take a lot of effort to clean up the data to make it re-usable?
  6. What are the Intellectual Property and Copyright issues associated with this data set? Are there any other legal issues to consider? Will the repository own the “rights” to the data, will the researcher(s), or is it public domain?

When this author worked as a Program Manager of a digital archive at the University of Southern California, we had a standard set of questions and an entire matrix we would use to determine whether or not a collection should be added to the digital archive. The above criteria reflect some of the collection development decisions that we made. If we thought a collection was worthwhile, but that it would not be a good fit (whatever the reason), it was not uncommon to refer the potential donor to another organization that might be a better fit for the content. That should also be the case with regards to most data sets.

Perhaps data center X or library Y cannot archive certain digital material, but the data steward may point the donor to a more appropriate archive. In many cases, however, data sets that are valuable to researchers will be lost, regardless of the availability of funding. This also happens with information in the physical word. There is a point at which data stewards have and will have to accept that not all data can be saved, and that there will be some loss.

Other criteria to consider with regards to which data to save and which not to save involves political issues. The points above refer to most collection development. However, some collection development decisions have nothing to do with logic and a rigorous collection development policy, and everything to do with politics, whether local, institutional, or national and international. It will be the case that the following considerations will be made, either consciously and explicitly, or unconsciously and implicitly, when deciding which data sets to collect.

  1. Does the data set belong to a major donor, alumni or other affiliate of the organization for which the data steward works?
  2. Does the data set belong to anyone who is “buddies” with anyone in the chain of command above the data steward?
  3. Does the data set belong to anyone with power that either the data steward or others within the organization (large or small) that maintains the repository would like to please or make feel important by archiving his or her data sets?
  4. Are there any other political issues not mentioned above that the data steward ought to consider when deciding whether to accept or reject a particular data set, regardless of the logic of including the data set in the repository, and regardless of the personal opinion of the data steward as to the wisdom of including said data set?

In theory, the decision as to whether or not one should “save” one data set over another ought be made based on a well-thought out, rigorous and logical collection development policy based on standard archival processes applied to the digital realm. In most cases, the decision to save or not save a particular data set will be made based on a standard set of policies. In reality, however, politics may play a role in deciding what is saved or not saved. Those politics may exist within the department, within the institution (especially large ones), within the domain, or elsewhere. It is unfortunate that because of this very human tendency, some valuable data sets will be lost, and some not very useful data sets will be carefully curated. One can hope, however, that a responsible data steward will be able to point a potential donor, or send a potentially valuable data set, to another institution or repository whose administrators can provide a home for the valuable data set.

With regards to “what should happen when data is condemned to oblivion, either thorough some decision making process at the dataset/data collection level, or because of larger scale events, such as the defunding of a repository”, it depends. If the policy for the data set is that the data is to be deleted at X time, then it should be deleted. If the data has intrinsic value but the repository can no longer maintain it at the collection level, or because a repository is defunded, then the current repository manager may be able to locate a new “home” elsewhere for the data collection by contacting a network of colleagues. Another option is to contact non-ILS practitioners who may have an interest in the data and may be able to maintain it out of their own funding stream. If none of the above will apply to the situation at hand, then a practitioner may simply have to delete the data, regardless of the their personal opinion. This is also often the case in the physical world, with regards to books, paper archives, photographs, etc.

Librarians often have to cull their collections. They often cull books or move them to off-site storage. If, say, 20% of the books haven’t been checked out in 10 years, then perhaps those books should be culled or moved to off-site storage before the library pays to add more square footage and shelving. The challenge with data sets is that, unlike books, there are not likely to be more copies available for users. However, books also go out-of-print, so, that too, has an analogy in the physical world. Data stewards may have to make culling data collections part of their repository maintenance policy, just as librarians have had to do with books. It doesn’t mean that the books or data sets that are culled have no intrinsic value, or that only the most popular and most used books and data sets should be stored, but there has to be a realistic collection policy in place. If something isn’t used, then perhaps a new home should be found for it where it will be used, or the data set can be safely culled (deleted).

In conclusion, data stewards are and will face the same type of collection development decisions that “traditional” librarians and archivists have faced for centuries. Funding and resources are limited, therefore, not all information can and will be saved. The decisions regarding what to keep versus what to delete should be based on a well-thought out and rigorous collection development plan, but the reality is that politics may supersede the best-laid plans and policies. Unlike the physical world, however, the decisions to keep or delete data items must be made up front, not after the fact. This does require a shift in thinking, but even so, it still requires a sound methodological approached based on established collection development and archival principles.

If you would like to work with us on a digital curation and preservation or data governance project, please see our services page.

curate | Doctoral Comprehensive Exam No.5, Managing Data: the Data Deluge and the Implications for Data Stewardship

Know the Trade-offs: Human Coding vs. Computer Coding

content analysis methodology mixed methods

Human versus Computer Coding Question

In your literature review of content analysis, you discuss the trade-offs between human coding and computer coding in terms of reliability, validity, and costs. Thinking specifically about data policies and management documents, what dimensions do you see that would be well-suited for computer analysis and what dimensions would require human coding? What processes would you use to ensure reliability and validity? How would you resolve disagreements between computer coding and human coding?

Citation

Ward, J.H. (2012). Doctoral Comprehensive Exam No.1, Content Analysis Methodology. Unpublished, University of North Carolina at Chapel Hill. (pdf)

Creative Commons License
This work is licensed under a Creative Commons Attribution-NoDerivatives 4.0 International License.

Note: All errors are mine. I have posted the question and the result “as-is”. The comprehensive exams are held as follows. You have five closed book examinations five days in a row, one exam is given each day. You are mailed a question at a set time. Four hours later, your return your answer. If you pass, you pass. If not…well, then it depends. The student will need to have a very long talk with his or her advisor. I passed all of mine. — Jewel H. Ward, 25 October 2015

Human versus Computer Coding Response

There are five primary dimensions for one to consider when deciding whether or not to use computer coding or human coding generally with regards to an analysis of data policies and management documents. Most of these dimensions may be decided when setting up the research study in the first place.

  1. What does one intend to measure? Is one examining latent or manifest content? Both?
  2. Is this a qualitative Content Analysis? A Quantitative Content analysis? Both?
  3. What is the unit of analysis? Is it the entire document or interview? Is it the sentence? The word?
  4. What types of policies is one examining and analyzing? Standards? Best practices? Local policies (e.g., the policies of the archive itself, and its governing department and/or institution)? Policies implemented at the machine-level, in the code?
  5. What is the goal of the analysis? Is it to create a typology of policies? To classify policies by domain? To classify policies by the issue(s) that drove their creation?

Each of these dimensions will drive the choice of human versus computer coding, or the decision that either one would be an appropriate choice.

Policy implementation tends to be both top-down and bottom-up; it is both explicit and implicit. It is a bi-directional process. An example of a top-down implementation would be an archive administrator of a government digital data archive, who provides public access to the data, who must comply with federal or national and/or state/province laws; archival standards and best practices; international standards (such as the OAIS Reference Model); and the archive’s own policies, which are determined by the archive staff, and the department and institution, if the latter are applicable. Thus, any written policies would be considered explicit, and any “understood” policies would be considered implicit.

A competent technical administrator of the same archive should implement certain policies based on computer science best practices, even without the written policies. Those policies should include regular backups of the data to at least two off-site locations; the testing of the backups to make sure they are actually backing up the data; fixity/integrity checks of the data upon ingest and regularly thereafter to ensure the data has not been corrupted or changed; virus scans upon ingest; the use of unique identifiers; the migration of hardware and software; and, a method for accessing all stored files, among others. The technical administrator may implement these policies because of written, explicit policies, or simply because it is implicitly understood that this is what is done.

The policies themselves may be thus classified as:

  1. Standards (“explicit”). For data policy and management documents, these include international standards such as the “OAIS Reference Model” and the “Audit & Certification of Trusted Digital Repositories”. One can assume that the wording within these documents has been rigorously examined to avoid ambiguity, although some ambiguity may exist.
    Standards are thus explicit, written policies.
  2. Best Practices (“explicit or implicit”). For data policy and management documents, these would encompass two types of Best Practices. The first are the best practice policies of the domain that created the data, such as Social Science, Hydrology, or Physics, for example. Some domains have no explicit written data policies; it is a kind of Wild West from an Information and Library Science (ILS) perspective. Other domains, such as Social Science, have very explicit policies.

    The second are the Best Practices of the domain that is stewarding the data. In some cases, the domain that created the data is also the domain that is stewarding the data for the indefinite long term. In other cases, the domain that is stewarding the data is a separate domain, such as ILS. If the domain stewarding the data is ILS, then policies tend to be explicit.

    Best Practices are thus either implicit (not written) or explicit (written) and usually vary by domain. If written, the wording may or may not be rigorous in order to avoid ambiguity.

  3. Implicit. These policies are policies implemented knowingly or unknowingly by the managerial and technical administrators of an archive. An example of this is the regular migration of the Social Science data and supporting software and hardware by Odum Institute employees from the 1960s to the late 2000s. The Odum Institute Data Archive currently has a stated preservation policy, but from the early 1960s to the late 2000s, the archive did not. The Data Archive administrators had an implicit goal of preserving the data and ensuring access to it over the long-term, but they did not know they had a preservation policy. The preservation policy was not explicitly stated in any documentation until sometime around 2008.

    Thus, implicit policies are, well, “implicit”. The wording may or may not be unambiguous.

  4. Machine code (“explicit”, but based on both “explicit” and “implicit” policies). These maybe scripts or programs that implement human-created policies at the machine-level. As previously stated, these policies may be implicit or explicit, and may conform to national, local or other standards and Best Practices.

    Machine code policies are explicit in that they are implemented in code; the source of the policy itself may be written (explicit) or unwritten (implicit). The policy implemented by the code should be unambiguous, but in some instances may not be.

  5. Local policies (“explicit or implicit”). Local policies are the policies of the archive itself, its department and/or institution, and any relevant federal, state or county rules and regulations (within the USA). Local policies may or may not contradict Best Practices and International Standards. For example, archivists in the UK had to forgo certain requirements of the OAIS Reference Model in order to comply with UK guidelines for providing access to handicapped persons.

Thus, it may be more appropriate to use computer coding for one kind of policy, and human coding for another. I have outlined the following dimensions, based on the above. The following assumes that the corpus to be analyzed is small enough to be analyzed by one or two human coders. If the corpus is too large, either more humans need to work on it, or computer coding must be utilized.

Table 1 – Dimensions of Human Coding Versus Computer Coding Trade-offs, Content Analysis Methodology

Table 1 – Dimensions of Human Coding Versus Computer Coding Trade-offs, Content Analysis Methodology

All of the above would require cross-checking and cross-validation. For example, if one wants to analyze domain Best Practices that are written, a small subset coded between two humans may determine whether or not the language is ambiguous. If the wording is reasonably unambiguous, then once agreement is reached between the two humans, then a cross-check between the human coding and the computer software may determine whether or not the Best Practices may be better analyzed by a human or a computer.

Another factor in the decision to use human or computer coding – not mentioned in the chart above – is the unit of analysis. If one simply wished to perform a purely quantitative Content Analysis consisting of a straight word count, then computer coding is the best choice, regardless of the quality of the source material. If the unit of analysis is the sentence, the paragraph, or the document itself, then the implication is that the analysis also includes an interest in latent material. In that case, a computer or a human may analyze the policy documents, based on the matrix above.

If the goal of the analysis is simply to analyze policies in order to classify them by type of policy (“security”, for example) and domain (“Physics archives use these types of policies”), a straight up word count may be enough data to provide the answer to that question. In that instance, a computer coded analysis should be sufficient. If one is also interested in what issues the implemented policies represent, for example, then a qualitative analysis is a better choice. This is likely to involve both a quantitative analysis of word-count, and human coding of the chosen corpus that then uses computer software to analyze the coded data.

In order to ensure reliability and validity, one should explicitly state which existing policy documents (including computer code) were used, from whom they were obtained, when they were obtained, what version was examined, and any observed anomalies. If the policies are generated via an interview or survey, that must also be stated. In the event that human coding is used, then at least two human coders should code the same material, and then the results must be compared. The two or more coders must come to agreement on their results to ensure reliability.

In order to ensure validity, then one must be sure to design the study so that it measures what it is supposed to measure. If one is interested in determining which issues drive the creation of policies, then one should not design the study to measure the number of times “the” is used in a particular policy document. The study must gather data that is relevant to the question, not spurious. Again, inter-coder reliability should help ameliorate any unintentional study design that would affect validity. With regards to Content Analysis, however it is generally better to go for high validity over high reliability.

Another way to ensure validity is to use human coding over computer coding. For example, since computers cannot always determining meaning, if one is examining latent content, if the corpus is small enough, it may be better to use human coders. A computer analysis will certainly plow through a far larger corpus than a human or group of humans will be able to, but if the coding is not set up correctly, or the computer cannot determine meaning, then the results may be spurious. Whereas, one or more humans analyzing the same corpus may come up with valid and important results, because a human is generally better at assessing the implied meanings within the text, especially in terms of what is not stated.

If and when human coding does not match well with computer coding, there are several avenues to address the problem. The first is to compare the coding instructions for the software versus the code book for the humans, in order to determine whether or not there are any discrepancies. Another is to determine if this software has any particular bugs. One could make some small adjustments to the configuration of the software to test if and how those changes affect the results. If the human training and the software configuration are in sync, and the software does not contain any bugs, then one can try installing the software on another machine as a completely fresh install, re-use the configuration, and run the analysis again. It could be that the second instance of the software may provide different results, as the first instance my have some unknown bugs. One can also check to ensure that the human coders and the machine are actually examining the same corpus. A human may “eyeball” the corpus the computer just examined to determine whether or not the computer is even ingesting the material correctly. If the results still don’t pair up, then the researcher will have to make a judgment call as to which results to use.

In conclusion, there are multiple dimensions to consider when conducting a Content Analysis using data policy and management documents. As well, reliability and validity must be considered, as must inter-coder reliability, and human-computer inter-coding reliability. There are trade-offs for all of these.

If you would like to work with us on a content analysis or data analysis and analytics project, please see our services page.

Know the Trade-offs: Human Coding vs. Computer Coding Content Analysis Methodology