Preservation standards for repositories do not exist in a void. They were created to address a particular issue, which is the long-term preservation of digital objects. Preservation repository and policy standards are designed to address long-term digital storage (i.e., digital curation and preservation) by defining “the what” (preservation repository design) and “the how” (preservation policies). This essay focuses primarily on the research data deluge and the implications for the long-term stewardship of data. The conclusion is that researchers want to focus on creating and analyzing data. Some researchers care about the long-term stewardship of their data, while others do not. Effective data stewardship requires not just technical and standards-based solutions, but also people, financial, and managerial solutions. It remains to be seen whether or not funders’ requirements for data sharing will impact how much data is actually made available for re-purposing, re-use, and, preservation.
Ward, J.H. (2012). Managing Data: the Data Deluge and the Implications for Data Stewardship. Unpublished manuscript, University of North Carolina at Chapel Hill. (pdf)
This work is licensed under a Creative Commons Attribution-NoDerivatives 4.0 International License.
Table of Contents
Table of Figures
Figure 1 – The National Aeronautics and Space Administration (NASA) Earth Observing System (EOS) Data Processing Levels (National Aeronautics and Space Administration, 2010; Ball, 2010).
Figure 2 – Space Science Board Committee on Data Management and Computation (CODMAC) Space Science Data Levels and Types (Ball, 2010).
Figure 4 – LIFE (Life Cycle Information for E-Literature) Project (Blue Ribbon Task Force on Sustainable Digital Preservation and Access, 2008).
Figure 12 – Individuals by Life Cycle Phase/Function (Interagency Working Group on Digital Data, 2009).
Figure 13 – Entities by Life Cycle Phase/Function (Interagency Working Group on Digital Data, 2009).
Preservation standards for repositories do not exist in a void. They were created to address a particular issue, which is the long-term preservation of digital objects, i.e., “data”. Waters & Garrett (1996) wrote that these standards must be created in order for archives to demonstrate that “they are what they say they are” and they can “meet or exceed the standards and criteria of an independently-administered program”. Preservation repository and policy standards are designed to address long-term digital storage (i.e., digital curation and preservation) by defining “the what” (preservation repository design) and “the how” (preservation policies). The next step is to examine what types of data are being curated and preserved, put into an “OAIS Reference Model inside” repository and managed with Audit and Certification of Trustworthy Digital Repositories Recommended Practices, as well as to examine any related issues and factors.
Hey and Trefethen (2003) defined the data deluge with an examination of eScience. The authors called for “new” types of digital libraries for science data that would provide data-specific services and management. While the data deluge cuts across all sectors (Manyika, et al., 2011; Science and Technology Council, 2007), this essay focuses primarily on the research data deluge. It defines research data, the types of research data and collections; attempts to determine how much data exists; and, examines “big data” versus privacy. It also describes the reasons researchers do and do not share their data, the role of data curators, and, provides an overview of infrastructure. Finally, this literature review describes research data curation; examines example applications of general data management policies to repositories and to the data itself; and, discusses the implications for the long-term stewardship of research data based on the literature reviewed.
What does it mean to “steward” data? The editors of Merriam-Webster (2012) defined stewardship as, “the conducting, supervising, or managing of something; especially : the careful and responsible management of something entrusted to one’s care “. The authors of ForestInfo.org (2012) wrote that stewardship is, “the concept of responsible caretaking; the concept is based on the premise that we do not own resources, but are managers of resources and are responsible to future generations for their condition”. Therefore, one may extrapolate that “data stewardship” is the “careful and responsible management of something entrusted to one’s care” so that future generations may access the data with full confidence that the data is what the provider says it is.
How does data stewardship differ from digital curation and digital preservation? Lazorchak (2011) wrote that he has used the terms interchangeably, but they are really three different processes. The detailed definitions for digital curation and digital preservation are available in the previous section, “Managing Data: the Emergence & Development of Digital Curation & Digital Preservation Standards”. However, in short, digital curation addresses the whole life cycle of digital preservation. Lazorchak (2011) stated that the concept of “digital stewardship…brings preservation and curation together…pulling in the lifecycle approach of curation along with research in digital libraries and electronic records archiving, broadening the emphasis from the e-science community on scientific data to address all digital materials, while continuing to emphasize digital preservation as a core component of action”.
Thus, one might say that digital preservation is the “what”; digital curation is the “how” for preserving the data; and digital or data stewardship is the “why” (to manage entrusted resources for future generations). Lynch (2008) wrote that the best data stewardship “will come from disciplinary engagement with preservation institutions”. That is, if scientists wish to manage their data so that it will be accessible for the indefinite long-term, then they will need to work with librarians, archivists, computer scientists, domain specialists, and other information professionals whose expertise lies in the curation and preservation of data.
What are data, metadata, and ontologies in the context of science research data? The National Science Foundation Cyberinfrastructure Council (2007) defined these terms. They wrote that “data are any and all complex data entities from observations, experiments, simulations, models, and higher order assemblies, along with the associated documentation needed to describe and interpret the data”. Next, the authors described metadata as a subset of, and about, data. They wrote that “metadata summarize data content, context, structure, interrelationships, and provenance…. They add relevance and purpose to data, and enable the identification of similar data in different data collections” (National Science Foundation Cyberinfrastructure Council, 2007). Finally, the council members defined ontology as “the systematic description of a given phenomenon. It often includes a controlled vocabulary and relationships, captures nuances in meaning and enables knowledge sharing and reuse” (National Science Foundation Cyberinfrastructure Council, 2007).
Employees of the U.S. Office of Management and Budget defined research data as, “the recorded factual material commonly accepted in the scientific community as necessary to validate research findings, but not any of the following: preliminary analyses, drafts of scientific papers, plans for future research, peer reviews, or communications with colleagues” (National Science Board, 2011). As part of this definition, the authors also included metadata and the analyzed data. The former may include computational codes, apparatuses, input conditions, and so forth, while the latter may include published tables, digital images, and tables of numbers from which graphs and charts may be generated, among others. Furthermore, they differentiated “digital research data” from research data by including a separate definition. They wrote that digital research data is “any digital data, as well as the methods and techniques used in the creation and analysis of that data, that a researcher needs to verify results or extend scientific conclusions, including digital data associated with non-digital information, such as the metadata associated with physical samples” (National Science Board, 2011).
Last, the members of the National Science and Technology Council Interagency Working Group on Digital Data (2009) wrote that:
“digital scientific data” refers to born digital and digitized data produced by, in the custody of, or controlled by federal agencies, or as a result of research funded by those agencies, that are appropriate for use or repurposing for scientific or technical research and educational applications when used under conditions of proper protection and authorization and in accordance with all applicable legal and regulatory requirements. It refers to the full range of data types and formats relevant to all aspects of science and engineering research and education in local, regional, national, and global contexts with the corresponding breadth of potential scientific applications and uses (National Science and Technology Council Interagency Working Group on Digital Data, 2009).
Thus, while there is some variation between the definitions of research data, the general consensus is that it consists of the items or objects that scientists analyze, create, and use in the process of conducting research.
When data are organized, they become collections. The National Science Foundation (2005) and the National Science Foundation Cyberinfrastructure Council (2007) defined three types of data collections: research, resource, and reference collections. The authors of the 2005 National Science Foundation report chose not to refer to databases, but collections, because the authors wanted to refer to the individuals, infrastructure, and organizations indispensable to the management of the collection. Thus, the board members wrote that data collections fall under one of the three functional categories mentioned previously.
- Research Data Collections: these collections are created for a limited group, supported by a small budget, as part of one or more focused research projects, and may very in size. The researchers who collect the data do not intend to preserve, curate or process it, although this is often due to lack of funding. They may apply rudimentary standards for metadata structure, file formats, or content access policies. Often, there are not standards because the community-of-interest is very small. Some recent examples of these types of collections include Fluxes Over Snow Surfaces (FLOSS) and the Ares Lab Yeast Intron database.
- Resource/Community Data Collections: these types of data collections are maintained and created to serve an engineering or science community. The budgets to maintain the collection(s) are provided directly by agency funding and are generally intermediate in size. This funding model can make it challenging to gauge how long the collection will be available, due to changes in budget priorities. However, the community does tend to apply standards for the maintenance of the collection, either by developing community standards or re-purposing existing standards. Two examples of these types of collections include The Canopy Database Project and the PlasmoDB.
- Reference Data Collections: Characteristic features of these types of collections are a diverse set of user communities that represent large segments of the education, research, and scientific community. Users of these data sets include students, educators, and scientists across a variety of institutional, geographical, and disciplinary domains. The managers of these data collections tend to follow or create comprehensive, well-established standards. The creators, users, and managers of these data collections intend to make them available for the indefinite long-term, and budgetary support tends to come from multiple sources over the long-term. The examples for these types of data collections include The Protein Data Bank, SIMBAD, and the National Space Science Data Center (NSSDC) (The National Science Foundation, 2005; National Science Foundation Cyberinfrastructure Council, 2007).
The type of data collection does not necessarily indicate its long-term value to future researchers, but the collection type does increase the odds of the collection being usable and accessible within one or more generations. A small, under-funded, poorly documented research data collection may prove to be of great value to a future researcher or researchers who can figure out what the data is and how to access it, while a large, well-funded, and well-documented data collection may have no users after the original research study closes.
The types of data researchers have created fall into three primary categories used for one or more processes: structured, unstructured, or semistructured [sic]. Members of the National Research Council (2010) described structured data as rigidly formatted, while unstructured data consists of text. They provided examples of semi-structured data as consisting of personnel data, want ads, and so forth. The data researchers have generated in one of these categories may be created by a variety of processes that generally fall into one of three areas: scientific experiments, models or simulations, and observations.
The data generated from a scientific experiment is intended to be reproducible, at least in theory. Researchers often do not have the time and funding to reproduce many experiments (Lynch, 2008). With regards to model or simulation data, researchers have preferred to retain the model and related metadata rather than the outputted data. Scientists have considered observational data to be irreplaceable, as it is usually the result of data gathering at a specific location and time that may not be reproducible. They have gathered raw data in the course of observations and or experiments, while derived data results from combining or processing raw data (Research Information Network, 2008; Research Information Network, 2008).
The National Aeronautics and Space Administration (NASA), Earth Observing System (EOS), developed a set of terminology to describe the degree to which data has been processed (Ball, 2010; National Aeronautics and Space Administration, 2010). The authors designed it with four data levels, each with subsets; Level 0 is the least processed and Level 4 is the most processed (see Figure 1, below). Ball wrote that under this scheme, “data do not have significant scientific utility until they reach…Level 1”.
The author noted that Level 2 has the greatest long-term usefulness, and that most scientific applications require data processed to at least that level. He described Level 3 data as being the most “shareable”; those data contain smaller sets than Level 2 data, and are thus easier to combine with other data.
Alternately, Members of the Space Science Board have developed specific definitions for space data levels and types that range from raw data to a user description. See Figure 2, below.
These board members considered that space data is not just the data itself, but also any related documentation needed to access, run, correlate, calibrate, or extract information from the data.
The authors of the Research Information Network (2008) paper on research data sharing wrote that researchers and curators further process this data, either by reduction, annotation, or curation. They noted that researchers often share derived or reduced data; they do not often share raw data. The authors described how — once data has gone through this last process — it might be made available to other users and researchers, depending on the implicit and explicit policies of a particular domain. However, they stated that the trade-off to using derived data is that reproducibility may be compromised because something may have been lost in the processing.
In addition, the authors noted that if a researcher adds metadata information to describe the processing techniques used, then the original provenance might be compromised. They iterated that most researchers prefer to work with raw data, but practical reasons often prohibit its use by anyone other than the originating researchers. They described how, when researchers cannot or will not share raw data, sometimes it is because the data may be in a proprietary format that must be transferable to a more common format, and that “something” is lost in the transfer. They stated that the reason for this is that often the raw data set may be too unwieldy to share, or, the researcher(s) simply may not be willing to share the raw data set (Research Information Network, 2008).
Researchers and authors have found it challenging to determine how much data currently exists, much less how much data exists within science, much less how much will exist at X point in the future in any field. In order to make an educated estimate, a researcher must determine what does and does not constitute data. Is it the actual data created by someone, or the information about them, such as metadata or someone’s digital exhaust? How do you de-duplicate data? Do you count a compressed file or folder, or an uncompressed file or folder? Another question to consider is, how much is, “a lot of data”?
Tony Hey and Anne Trefethen’s seminal paper (2003) brought attention to the imminent e-Science data deluge and attempted to quantify the amount of data by examining Astronomy, Bioinformatics, Environmental Science, Particle Physics, Medicine and Health, and Social Sciences. Lord and MacDonald (2003) also attempted to quantify the amount of research data by domain. However, at the time of this literature review, the numbers in the two papers are around ten years out of date, so the numbers will not be quoted here as they are no longer relevant — although the authors’ arguments that a deluge of data is here has remained relevant. The point is that any researcher or author attempting to quantify and describe “the data deluge” must take into account the standards of the time, because what is considered “a lot of data” at one time may seem like “not much data” a generation later.
For example, technologists have often quoted Bill Gates as saying that “640K ought to be enough for anybody” in 1981 (Tickletux, 2007). (Various authors have written that he later denied making this statement, but whether or not Bill Gates made that statement, the point is that users tend to fill up whatever amount of digital storage is made available to them, and then they will complain that they need more.) Thus, in 1981, researchers used to measuring storage in kilobytes may have considered 10 gigabytes of data to be a “data deluge”. Researchers currently speak of data in terms of exa, zetta, and yottabytes; many, if not most, researchers will concede that “a lot of data” or a “data deluge” is a relative phrase. One imagines that ancient archivists managing clay tablets and papyri considered themselves in the midst of a “data deluge”! Or, a generation from now, future technologists will wonder why curators in the early 2000s considered exabytes “a lot of data”. However, whether the amount of data currently in existence is “a lot” or “not very much” data, analysts have attempted to quantify the current data deluge using a variety of methodologies.
Thus, more recently in the data deluge estimation universe, Hilbert and Lopez (2011) examined “all information that has some relevance for an individual” and did not try to distinguish between duplicate or different information. They considered the computation time of information, in addition to transmission through time (storage) and space (communication). Their study spanned two decades (1986-2007) and 60 categories worldwide (39 digital and 21 analog). Their research indicated that as of 2007, “humankind was able to store 2.9×1020 optimally compressed bytes, communicate almost 2×2021 bytes, and carry out 6.4×1018 instructions per second on general purpose computers” (Hilbert & Lopez, 2011).
Beginning in 2007, the research firm IDC created an annual report dedicated to estimating the amount of new digital information generated and replicated. The study has been sponsored by EMC, an information-management company. IDC’s 2007 study concluded that the world’s capacity to store data had been exceeded by our ability to generate new data. Their projection was that the annual growth rate for data through 2020 would be 40% (Manyika, et al., 2011). In June 2011, the author of the IDC report wrote, “the amount of information created and replicated will surpass 1.8 zettabytes…growing by a factor of 9 in just five years (Gantz & Reinsel, 2011).
While researchers have examined the amount of data that individuals and organizations are generating, there is little insight to how much variation there is among and between the different sectors, such as education, industry, and government. However, Manyika, et al.’s (2011) research indicated that while the Library of Congress (LC) had collected 235 terabytes of data as of April 2011, fifteen out of seventeen sectors in the USA have more data stored than the LC. For example, James Hamilton, the Vice President of Amazon, has noted that the amount of capacity Amazon ran on in all of 2001 is currently added to its data centers daily (Gallagher, 2012). Hamilton’s comment has reinforced the earlier point that “a lot of data” is a relative term; one imagines that Amazon’s employees considered that they processed and stored “a lot of data” in 2001, especially relative to their storage and processing capacities in the 1990s.
Regardless of whether or not the current data-intensive environment is a “deluge”, one must consider current technology and demands versus processing and storage requirements. Manyika, et al., (2011) determined that critical mass has been reached in every sector, but the level of the intensity of the data generated varies. They determined these aggregate results by examining four factors: utilization rate, duplication rate, average replacement cycle of storage, and annual storage capacities shipped by sector (please see Figure 3, above). The consultants found that for the year 2010, the amount of data stored in enterprise external disk storage for one year is 7.4×1018 bytes, including replicas. Their research indicated that for the same year, consumers generated 6.8×1018 bytes. Furthermore, Gallagher (2012) wrote that “Google processes over 20 petabytes of data per day” on searches alone. One must concede that, given current technologies versus user demands and expectations, that is a lot of data.
It is important to note that data is not just about the content that is created — it is also about the information around it. These sources include browsing histories, geographic locations, and other metadata and “digital exhaust” (Gantz & Reinsel, 2011). The two authors wrote that the amount of information being created about users of data is greater than the amount of data and information users are creating themselves. Evans and Foster (2011) stated that this “metaknowledge” — knowledge about knowledge — may include idioms particular to a domain or scientist, the status and history of researchers when included in a paper, as well as the focus and audience of a particular journal. The authors argued that studying metaknowledge could provide useful information about the spread of ideas within a research domain, particularly from teacher to student.
However, metaknowledge may also be considered digital exhaust. Evans and Foster (2011) described the former term as the explicit information about someone that is publicly available, such as a short biography submitted by an author as part of a paper. Burgess (2011) defined digital exhaust as the information all users leave behind when using digital technology. This exhaust can be as innocuous as a name in the metadata of a Microsoft Word document that allows a researcher to determine whom his or her anonymous reviewer is, to browsing history, to one’s physical location as determined by one’s location to a cell phone tower.
The data that is generated about individuals may be unimportant, but, en masse, gives government and corporations an incredible amount of data and information about individuals that has previously been private. This information may include Tweets, photos, emails, Facebook posts, etc., etc. For example, Hough (2009) discusses a study in which 75% of Facebook users post information indicating that they are out of town, thus putting themselves at risk of a break-in.
Sullivan (2012) wrote an article describing university and government agencies’ demands for athletes’ and job applicants’ Facebook account user names and passwords in order to better monitor each person’s personal habits and preferences. Some state legislators are banning the practice, citing the first amendment. Solove (2007) argues that just because an individual may not have anything to hide, does not mean that he or she must share their personal data, while Hough (2009) declares that individuals should not be so willing to give up their privacy as the price of using technology. Hough cites a study by Sweeny (2002) in which 87% of the population of the United States may be uniquely identified using only 1990 census data — gender, date of birth, and a five-digit zip code. Sweeny proved it is fairly easy to determine an individual’s Social Security Number, particularly if that individual was born after 1980 — simply by knowing their date and place of birth.
As well, Sweeny (2002) provided one of the most famous examples of how easy it is to find individual information. The researcher correlated the information contained in a public data set provided by the primary state employee health care provider for Massachusetts with publicly available voter registration data. The voter rolls contained each individual’s name, birth date, address, gender, and zip code. The data set provided by the Massachusetts state employee health care provider contained each anonymized individual’s birth date, zip code, gender and their individual medical information, such as medications and procedures. Sweeny used this information to find then-Massachusetts Governor Weld’s medical records, and promptly requested and mailed his own records to him! She found his medical records by matching shared attributes: Governor Weld then lived in Cambridge, Massachusetts. Based on the voter rolls, six people in Cambridge had the same birth date, three were men, and only one lived in Weld’s 5-digit zip code.
A few years later, in March 2010, Netflix cancelled an annual prize competition to develop better recommendation algorithms, due to privacy concerns. Narayanan and Shmatikov (2007) correlated the supposedly anonymized user data Netflix had provided to the contest’s participants and compared it to data from the Internet Movie Database. The researchers claimed they successfully identified the Netflix records of known users, thus revealing their implied political views and other potentially sensitive information.
Thus, researchers must be careful about what data they release, how much, and to whom. Even supposedly anonymized data may provide enough detail to be dangerous when correlated with other publicly available data.
The National Science Foundation and the National Institutes of Health in the United States, as well as major research funders in the United Kingdom, now require the researchers they fund to provide data management plans and be prepared to share the data generated from their research (National Science Foundation, 2010; National Institutes of Health, 2010; and, Jones, 2011). The policy arguments for sharing data are primarily based around two reasons: to ensure the reproducibility and replicability of science; and, so that the results of taxpayer funded research are made re-usable in order to maximize the returns from the high costs involved in gathering the data initially (Borgman, 2010; National Science Board, 2011).
As noted above, observational data is the most vulnerable with regards to reproducibility because it is based on a specific time and place; experimental data and model data are replicable in theory. However, if these data are curated in the appropriate formats with the required software, hardware, and any related scripts, then the research results should be replicable. Borgman (2010), Lynch (2008), Fry, et al. (2008), and, Lord and MacDonald (2003) stated that the reasons librarians and libraries should curate the outputs of scientific research are pretty simple: curation is not an end in itself, it is a way of supporting science by providing methods for access, use, re-use, and a more complete and transparent record of science. However, the members of the National Science Board (2011) have made the point that a one-size-fits-all approach to data sharing is neither desirable nor feasible. Instead, the National Science Foundation (2010) has encouraged each domain to establish its own standards for data management.
Other policy reasons cited by the National Research Council (2010) and Borgman (2010) included the creation of new science based on new questions of existing data, such as finding patterns, and advancing research in general by creating a new set of data-intensive methods that move science beyond theory, simulation, and empiricism, i.e., “the 4th paradigm”. Wired Magazine’s Chris Anderson (2008) took the 4th Paradigm idea too far, however, when he declared that “the data deluge makes the scientific method obsolete”. As Borgman (2010) observed, “access to data does not a scientist make”, as rigorous data analysis requires a certain amount of expertise to accurately interpret often-complex information and associated metadata. Fry, et al. (2008) cited a study in which researchers expressed concern that public access to research data would only increase confusion, rather than transfer any useful knowledge to the general public.
Given the potential dangers of providing data to others for their use and re-use, as noted in a previous section, why should a researcher share their data with anyone? The reasons vary, but generally involve coercion (i.e., a funder requires it); a requirement for reciprocal data sharing; the collaboration value; costs are reduced by preventing duplicate data collection; and, a desire to support the scientific method and ensure that studies are replicable (Borgman, 2010; Van den Eynden, et al., 2011). Researchers’ willingness to share their data varies by domain; it is rare for climate scientists to share their data or re-use another researcher’s model-run data. Therefore, climate scientists have little incentive to repurpose data for re-use.
However, for those researchers who work in a domain that shares data formally or informally, such as Astrophysics (Harley, et al., 2010), the Research Information Network (2008) study indicated that other incentives for sharing include paper co-authorship opportunities, greater collaboration opportunities, and greater visibility for the researcher’s institution and research group. Regardless of whether or not a particular domain encourages data sharing, Borgman (2010, 2008) wrote that publication is still the route to success and rewards, not data sharing, although research productivity is shown to increase with both informal and formal data sharing, especially with secondary publications (Pienta, Alter, & Lyle, 2010).
Borgman (2010, 2008) and Fry, et al. (2008) also noted that other disincentives to sharing data are the time and resources required to re-purpose the data; the researcher’s inability to control their intellectual property; and, concerns that their research results will be “scooped” by another researcher, if no embargo period on data sharing is required and enforced. In addition to these disincentives for data sharing, Lynch (2008), Fry, et al. (2008) and Cragin, et al. (2010), listed legal and ethical constraints, lack of expertise in data management, a lack of time to handle data requests, and a lack of technical infrastructure in which to publicly archive the data.
Scholars prefer to perform research and write the publications rather than curate data for re-use and storage (Lynch, 2008; Harley, et al., 2010). However, Pienta, Alter and Lyle (2010) studied the use and re-use of Social Science primary research data, and their research indicates that while informal data sharing is the norm in the Social Sciences, the sharing of data via an archive “leads to many more times the publications than not sharing data”.
Publications such as Science and Nature have called upon the larger science communities to create the infrastructure to share and curate data for the indefinite near term (Hanson, Sugden & Alberts, 2011; Editor, 2009, 2005). The editors of Science, for example, require authors to submit not just a copy of the data itself, but any computer code required to read the data. The Toronto International Data Release Workshop Authors (2009) examined prepublication data sharing within genomics, and they recommended that it be extended to related domains. At the opposite end, Schofield, et al., (2009) discussed ways to promote data sharing among mouse researchers in an opinion piece. The authors concluded that a research commons must be created, but that data sharing would require an entire culture change for their field.
Curry (2011) provided an example of particle physicists who rescued an old data set from the 1990s; these physicists then wrote more than a dozen new high-impact papers from this same set. In spite of these examples, and the support of major publications, Nelson (2009) wrote that the power to “prod” researchers to share their data must come from the organizations that have real clout with researchers: the funding agencies, scientific societies, and journals. However, as Lynch (2008) noted, the best use of scientists’ time is to devote it to practicing science. He wrote that researchers are not the best at data management, and this area should be left to professional data stewards.
Thus, it appears that most managers of major funding agencies, librarians and archivists, scientists, and journal editors and authors have been encouraging or requiring data sharing among researchers. However, whether or not a researcher is willing to do so may depend on a variety of factors, including personal preference. So long as data analysis takes up the majority of researchers’ time, they may not have the resources to share data, even with the appropriate infrastructure and policies in place, given the amount of time it takes a researcher to prepare data for use, re-use, and long-term preservation (Research Information Network, Institute of Physics, Institute of Physics Publishing, & Royal Astronomical Society, 2011). Thus, taking into account most researchers’ resource constraints, how well and often a researcher may share his or her data, even if they are willing to do so, is still to be determined, in spite of funders’ requirements.
Researchers may find incentives to share their data, as more data-centric infrastructure becomes the norm, even in domains in which data sharing is not the norm. However, as Lynch (2008) concluded, one of the issues that must be clarified concerns what institution or domain is responsible for providing the underlying infrastructure and data stewardship. Some librarians think that it is the library’s responsibility to provide this infrastructure; others believe it is better for each domain to come together and create this infrastructure, given the proprietary nature of data formats, software, etc.; while still others promote the concept of national data centers; and, finally, some data managers prefer institution-based infrastructure (Walters & Skinner, 2011; Research Information Network, 2011; UKRDS, 2008; Soehner, Steeves & Ward, 2010).
The members of the Association of Research Libraries (ARL) institutions have described four models of data infrastructure to support e-science: multi-institutional collaborations; a decentralized or unit-by-unit approach; a centralized or institution-wide response; or, a hybrid centralized and decentralized approach (Soehner, Steeves, & Ward, 2010). Lyon (2007) derived a “domain deposit model” and a “federation deposit model” from her study results. She described the domain deposit model as a “strong integrated community…with well-established common standards, policy and practice”, and defined the federation deposit model as a group of repositories which have come together “based on some agreed level of commonality” in a documented partnership. The author wrote that the “federation deposit model” might be built around an institution, regional geographic boundaries, format type, or software platform.
The debate over who will provide infrastructure, and what model that support service will follow, is similar to the problems that arose with the development of Institutional Repositories (IRs) in the 2000s. arXiv, while not an Institutional Repository per se, grew out of the Physics community’s culture of sharing research results immediately, and has grown to encompass Computer Science, Astrophysics, and Mathematics, among others, but that does not mean the arXiv model fits all e-print needs for all domains or institutions (Ginsparg, 2011). Researchers’ needs have been heterogeneous, as are each field’s communication styles and technical expertise (Kling & McKim, 2000; Borgman, 2008). Foster and Gibbons’ (2005) study proved that librarians eagerly built Institutional Repositories, only to find a lukewarm reception from faculty and researchers, which led to a lack of IR content.
The Research Information Network (2009) studied life sciences researchers and noted that one infrastructure and data sharing model will not fit all research domains, and that the information practices of life scientists do not match that of information practitioners and policy makers. Librarians may wish to grow data-sharing infrastructure more carefully than they did IRs, and grow them based on need, rather than the latest trend. So far, however, researchers have seemed to value data centers and they have stated that their existence has improved their ability to “undertake high quality-research” (Research Information Network, 2011). Currently, whether or not one or more of the above-mentioned ARL models will prove to be the best choice is in flux, generally because each institution and domain has different needs and requirements.
As regards other areas of big data infrastructure, such as preservation repository design and policies, those topics were covered in the previous sections, “Managing Data: Preservation Repository Design (the OAIS Reference Model),” and “Managing Data: Preservation Standards and Audit and Certification Mechanisms (i.e., ‘policies’)”. Other, more technical discussions, such as over-the-network and local data processing, data discoverability and indexing, physical networking infrastructure, interoperability, security, data center design, syncing, data replication, data backups, etc., are beyond the scope of this essay.
In conclusion, the results of the studies discussed in this essay have indicated that for data to be stewarded for the long-term, research scientists will need some type of support infrastructure, both technical, financial, and management.
Lyon (2007) observed that there was a “dearth of skilled practitioners, and data curators play an important role in maintaining and enhancing the data collections that represent the foundations of our scientific heritage”. The author wrote that in time, “native data scientists” would emerge from within each domain’s curriculum as data management becomes integrated into graduate research training. Gray, Carozzi and Woan (2011) noted, “normal data management practice…corresponds to notably good practice in most other areas”. Their recommendation was for administrators to formalize data management planning in order to make it more auditable. One aspect of this formalization is to define the roles and responsibilities by individual, role, and sector.
The members of the National Science Board (National Science Foundation, 2005) defined the primary roles and responsibilities of institutions and individuals. They defined four primary individual roles: data authors, data managers, data scientists, and data users.
- Data Author: this individual is involved in research that produces digital data. This person should receive credit for the production of the data, and ensure that it may be broadly disseminated, if appropriate. The data author must ensure that the metadata, and data recording, context, and quality all conform to community standards.
- Data Manager: this individual is responsible for the maintenance and operation of the database. This person must follow best practices for technical management such as replication, backups, fixity checks, security, enforcement of legal provisions, and implement and enforce community standards and preferences for data management. The data manager must provide appropriate contextual information for the data, and design and maintain a system that encourages data deposit by making it as simple and easy as possible.
- Data Scientist: the individuals who are data scientists have a variety of roles. This person may be a librarian, archivist, computer or information scientist, software engineer, database manager or other disciplinary expert. His or her contributions involve advising on the implementation of technology and best practices to the data for long-term stewardship and ensuring that it is implemented properly, as well as enhancing the ability of domain scientists to conduct their research using digital data. This role involves creative inquiry, analysis, and outreach, as well as participating in research appropriate to the data scientist’s own domain, for the purposes of publication and contributing to research progress.
- Data Users: this individual is a member of the larger research and scientific community, and this person will benefit from having access to data sets that are well-defined, searchable, robust, and well-documented. The data user must credit the data author, adhere to copyright and other restrictions, and must notify the appropriate data managers or data authors of any data errors (National Science Foundation, 2005).
The National Science Foundation (2005) authors also defined the responsibilities of the funding agencies. They stated that these agencies must provide a science commons to enable data sharing, help to create a culture in which data sharing is rewarded, and enable access to data across research communities. The board members were adamant that the representatives of the funding agencies, the agencies themselves, the various individuals, and their respective institutions, all have a part to play in ensuring the long-term stewardship of data.
Swan and Brown (2008) examined the roles and career structures of data scientists and curators in order to provide recommendations for their career development. They defined and examined both the roles and the career trajectories of those who manage the data itself. First, the authors distinguished the following four roles based on interviews of practicing data scientists and curators.
- Data creator: researchers with domain expertise who produce data. These people may have a high level of expertise in handling, manipulating and using data
- Data scientist: people who work where the research is carried out – or, in the case of data centre personnel, in close collaboration with the creators of the data – and may be involved in creative enquiry and analysis, enabling others to work with digital data, and developments in data base technology
- Data manager: computer scientists, information technologists or information scientists and who take responsibility for computing facilities, storage, continuing access and preservation of data
- Data librarian: people originating from the library community, trained and specialising in the curation, preservation and archiving of data (Swan & Brown, 2008).
Next, the authors interviewed practitioners regarding their roles and responsibilities, and career satisfaction. They discovered that most data scientists moved into their roles “by accident rather than design”; that “there is no defined career structure”; and that they feel undervalued within their research group due to the lack of professional training and/or a defined career path. Swan & Brown (2008) described three primary roles for libraries with respect to data stewardship. First, librarians must provide preservation and archiving services for data, particularly through Institutional Repositories. Second, they must provide consulting and training for data creators. Third, librarians must develop training curriculum for data librarians.
The Interagency Working Group on Digital Data (2009) defined the various roles involved with “harnessing the power of digital data for science and society”. They described the entities by role, individual, sector, and life cycle phase/function, and the individuals by role and life cycle phase/function. They defined entities as research projects, data centers, libraries, archives, etc., and defined the role for each one and provided an existing example. For example, they authors provided eleven tasks under “role” for the entity “archives”, and provided the name of the National Archives and Records Administration as an example. They defined eleven different types of individual roles, including data scientist, librarian, and researcher, for example, along with a corresponding definition for the role. Please go to Appendix A to view the complete set of tables as Figures 6-13.
In conclusion, the authors above have demonstrated that while one person may take on the multiple roles of data creator, data scientist, data manager, and data user, it ultimately takes an entire team and community to ensure the long-term survivability of research data.
General funding and sustainability estimates are covered in the section, “Managing Data: the Emergence & Development of Digital Curation & Digital Preservation Standards”. This section will focus on sustainability issues related to data sets, per se.
It is as challenging for information practitioners to determine the true cost of data stewardship as it is for them to measure the amount of digital data. Gray, Carozzi and Woan (2011) cited several studies and existing science archives, including one that had been built recently by an experienced archive staff. The authors wrote that staff costs, as well as acquisition and ingest costs, account for the substantial portion of preservation project funding, which reflected Lord and MacDonald’s (2003) earlier findings. They did not provide any hard numbers, though, and noted that those costs only scaled weakly as an archive grew larger. In other words, they learned that an archive’s initial size governs the costs, and that when an archive starts small and grows larger, the costs do not scale.
Gray, Carozzi and Woan (2011) called for a costing model to be developed, as they found that there is a lack of consensus on the long-term costs related to the preservation of large-scale data. Lord and MacDonald (2003), Lyon (2007), Fry, Lockyer, and Oppenheim (2008), and Ball, (2010) have all previously made calls for the development of a solid cost model as well, as they had found it challenging to determine the “full costs of curating data”. One of the primary reasons driving the confusion regarding how much data stewardship will cost is determining who is responsible (re: who will pay) for data stewardship and the differing degrees of data curation (Blue Ribbon Task Force on Sustainable Digital Preservation and Access, 2008; Fry, Lockyer, and Oppenheim, 2008).
The problems the Blue Ribbon Task Force on Sustainable Digital Preservation and Access (2008) defined as barriers to developing an accurate cost model are systemic, rather than simply about finding and setting a price for the product. The problems they identified include: the idea that “current practices are good enough”; the fear of addressing adequate data stewardship because it is “too big”; inadequate incentives to support the group effort needed to create sustainable economic models; a lack of long-term thinking regarding funding models; and lack of clarity and alignment with regards to the various responsibilities and roles between data stakeholders.
The Task Force reviewed several models including the LIFE (Life Cycle Information for E-Literature) project and the model by Beagrie, Chruszcz, and Lavoie (2008). The members of the LIFE project aimed the model towards libraries, and one of their discoveries has been that “upfront (i.e., one-time) costs of a project are often distinct in structure from the recurring maintenance aspects of the same project” (Blue Ribbon Task Force on Sustainable Digital Preservation and Access, 2008).
So, for example, when SHERPA-DP IR used the LIFE model to determine their full lifecycle costs, they determined that, excluding interest rates and depreciation, their costs measured at the unit for which metadata is created (e.g., per object cost for analogue, per page cost for digital) are:
- Year 1: 18.40 English pounds per year total cost
- Year 5: 9.70 English pounds per year total cost
- Year 10: 8.10 English pounds per year total cost (Blue Ribbon Task Force on Sustainable Digital Preservation and Access, 2008).
Beagrie, Chruszcz, and Lavoie (2008) developed a model to inform institutions of higher learning of their preservation costs. They built upon the work of the LIFE project team, and mapped it to the Trustworthy Repositories Audit & Certification: Criteria and Checklist and the OAIS Reference Model. The authors discovered upon the application and testing of the model “that the costs of preservation increase, but at a decreasing rate, as the retention period increases” (Blue Ribbon Task Force on Sustainable Digital Preservation and Access, 2008). The administrators of the Archaeology Data Service studied and re-adjusted their charging policy after applying Beagrie, Chruszcz, and Lavoie’s (2008) method to examine staff salaries, time, and days, and, therefore, reached a more realistic assessment of costs.
Beagrie and JISC (2010) summarized the model in a fact sheet that outlined recommendations to funders and institutions regarding what costs most (acquisition and ingest), the impact of fixed costs (they do not vary by the size of the collection and staff costs remain high), and the declining costs of preservation over time (they decline to minimal levels after 20 years). The authors outlined the benefits (direct, indirect, near- and long-term, private and public) to preserving research data; those benefits have been outlined throughout this paper. The authors discussed the various types of repositories and recommended a federated model with local storage at the departmental level, with additional back up at the institutional level. They also encouraged institutions to work with existing archives over creating new ones. Finally, they pointed out that research data are heterogeneous and are less likely to be stored in an Institutional Repository.
In conclusion, while Beagrie, Chruszcz, and Lavoie (2008) and the LIFE project, among others, have developed substantive cost models that provide very useful financial information for repository managers, these will need to be revised and updated over the long-term in order to determine the accuracy of the respective models.
General data curation is covered in another section, “Managing Data: the Emergence & Development of Digital Curation & Digital Preservation Standards”. Therefore, the remainder of this section will address only those areas related to research data curation that were not covered in the previous literature review on digital curation.
According to Ball (2010), the curation of research data is best understood in terms of the research data life cycle, data repositories, and funders’ requirements and guidance.
What is data curation, and how does it differ from digital curation, if at all? First, it is important to note that the curation of scientific data goes back centuries. Data curation is an older term than “digital curation”. It applied to journals, reports, or databases that were selected, annotated, normalized and integrated to be used and re-used by other researchers or historians. These data were not and are not always in digital form. Data curation is a less broad concept than digital curation, and although the two phrases are often used synonymously, they are not interchangeable (Ball, 2010).
Second, to further clarify, Lord and MacDonald (2003) included the following tasks as part of data curation:
- Selection of datasets to curate.
- Bit-level preservation of the data.
- Creation, collection and bit-level (or hard-copy) preservation of metadata to support contemporaneous and continuing use of the data: explanatory, technical, contextual, provenance, fixity, and rights information. Surveillance of the state of practice within the research community, and updating of metadata accordingly.
- Storage of the data and metadata, with levels of security and accessibility appropriate to the content.
- Provision of discovery services for the data; e.g. surfacing descriptive information about the data in local or third-party catalogues, enabling such information to be harvested by arbitrary third-party services.
- Maintenance of linkages with published works, annotation services, and so on; e.g., ensuring data URLs continue to refer correctly, ensuring identifiers remain unique.
- Identification and addition of potential new linkages to emerging data sources.
- Updating of open datasets.
- Provision of transformations/refinements of the data (by hand or automatically) to allow compatibility with previously unsupported workflows, processes and data models.
- Repackaging of data and metadata to allow compatibility with new workflows, processes and (meta)data models (Ball, 2010).
The authors included curation tasks that are part of the broader concept of digital curation, such as bit-level curation, metadata creation, and selection. They provisioned for data curation when they included data transformation and refinement, and repackaging — e.g., data clean up — all of which are tasks not normally associated with the curation of, say, digital objects consisting of e-prints or photographic images.
Ball (2012) wrote that lifecycle models help practitioners plan in advance for the various stages involved in the stewardship of digital data. There are several lifecycle models available for guidance. The author described the “I2S2 Idealized Scientific Research Activity Lifecycle Model” as a model produced from the researchers’ perspective, while the “DCC Curation Lifecycle Model” is produced from the perspective of information professionals. These two lifecycle models are a representative sample of the information available in the various lifecycle models and were chosen with that in mind; time and space limitations prohibit a longer discussion of the pros and cons of all lifecycle models.
Thus, this section will discuss the “I2S2 Idealized Scientific Research Activity Lifecycle Model”, and will attempt to describe the common themes across several available data management lifecycle models. The “DCC Curation Lifecycle Model” is covered in a previous essay, “Managing Data: the Emergence & Development of Digital Curation & Digital Preservation Standards”.
The members of the I2S2 project created the “I2S2 Idealized Scientific Research Activity Lifecycle Model” with the researcher’s perspective in mind, not the data manager’s perspective. Thus, Ball (2012) wrote that archiving is a very small part of the lifecycle. The goal of the project team members was to integrate, accelerate, and automate the research process, and so they created this lifecycle model in support of those goals. They designed the model to support research activity, not data management per se. Thus, they outlined the tasks involved throughout the lifecycle of a research project.
The project team designed the model with four key elements: curation activity, research activity, publication activity, and administrative activity. They sketched out the curation activity as a task performed by a data archive or repository. They outlined the administrative activity as the process of applying for funding, providing for reports, and writing final reports. The authors of the model defined publication activity as those tasks involved with preparing the data for public use and the writing and publication of papers. And, finally, they defined the research activity as that part of the project that involves conducting the research itself.
The data management lifecycle models included in this section for purposes of describing themes common across all life cycles are: the Interagency Working Group on Digital Data (IWGDD) Digital Data Lifecycle Model (Interagency Working Group on Digital Data, 2009); the Data Document Initiative (DDI) Combined Life Cycle Model; the Australian National Data Service (ANDS) Data Sharing Verbs; the DataONE Data Lifecycle; the UK Data Archive Data Lifecycle; Research360 Institutional Research Lifecycle; and, the Capability Maturity Model for Scientific Data Management (Ball, 2012).
The themes common across all lifecycle models include planning the project; gathering, processing, analyzing, describing and storing the data; and, archiving the data for future use. It is interesting to note that only the “DCC Curation Lifecycle Model” provides for the deletion of data; an unstated assumption by the authors in the remaining models is that all data will be re-used and re-purposed.
This section’s content is discussed in the previous essay, “Managing Data: Preservation Repository Design (the OAIS Reference Model)”.
Administrators at both the National Institutes of Health and the National Science Foundation now require researchers to provide data management plans in their grant proposals. They have instituted policies that require researchers to make the resulting research data from the grant-funded project available for re-use within a reasonable length of time.
The authors of the National Institutes of Health (2010; 2003) (NIH) requirements have mandated that researchers share the final data set once the publication of the primary research findings has been accepted. They have allowed for large studies, and, as such, the data sets from large studies may be released in a series, as the results from each data set are published or as the data set becomes available.
The administrators at the NIH have required that all organizations and individuals receiving grants make the results of their research available to the public and to the larger research community. They have required a simple data management plan for any grant proposals requesting more than $500,000. If a researcher cannot share the data, then they must provide a compelling reason to the NIH in the data management plan.
The grantors at the NIH have asked grantees to provide the following information in the data management plan: mode of data sharing; the need, if any, for a data sharing agreement; what analytical tools will be provided; what documentation will be provided; the format of the final data set; and, the schedule for sharing the data.
The following are three examples that employees of the NIH have provided to grant applicants as example data management plans.
- Example 1: The proposed research will involve a small sample (less than 20 subjects) recruited from clinical facilities in the New York City area with Williams syndrome. This rare craniofacial disorder is associated with distinguishing facial features, as well as mental retardation. Even with the removal of all identifiers, we believe that it would be difficult if not impossible to protect the identities of subjects given the physical characteristics of subjects, the type of clinical data (including imaging) that we will be collecting, and the relatively restricted area from which we are recruiting subjects. Therefore, we are not planning to share the data.
- Example 2: The proposed research will include data from approximately 500 subjects being screened for three bacterial sexually transmitted diseases (STDs) at an inner city STD clinic. The final dataset will include self-reported demographic and behavioral data from interviews with the subjects and laboratory data from urine specimens provided. Because the STDs being studied are reportable diseases, we will be collecting identifying information. Even though the final dataset will be stripped of identifiers prior to release for sharing, we believe that there remains the possibility of deductive disclosure of subjects with unusual characteristics. Thus, we will make the data and associated documentation available to users only under a data-sharing agreement that provides for: (1) a commitment to using the data only for research purposes and not to identify any individual participant; (2) a commitment to securing the data using appropriate computer technology; and (3) a commitment to destroying or returning the data after analyses are completed.
- Example 3: This application requests support to collect public-use data from a survey of more than 22,000 Americans over the age of 50 every 2 years. Data products from this study will be made available without cost to researchers and analysts. https://ssl.isr.umich.edu/hrs/
User registration is required in order to access or download files. As part of the registration process, users must agree to the conditions of use governing access to the public release data, including restrictions against attempting to identify study participants, destruction of the data after analyses are completed, reporting responsibilities, restrictions on redistribution of the data to third parties, and proper acknowledgement of the data resource. Registered users will receive user support, as well as information related to errors in the data, future releases, workshops, and publication lists. The information provided to users will not be used for commercial purposes, and will not be redistributed to third parties. (National Institutes of Health, 2003)
The implementers of the NIH data management plans wanted to make them as simple as possible, as these plans are but one part of the NIH grant application. However, it is evident to most information professionals, that these plans are not adequate for long-term data stewardship.
The authors of the National Science Foundation (2011) policy on data management wanted to provide a way to share data within a community while recognizing intellectual property rights, allow for the preparation and submission of publications, and protect proprietary or confidential information. They have made it clear to grant recipients that they must facilitate and encourage data sharing.
The reviewers have required grant applicants to include a 2-page supplementary document entitled, “Data Management Plan”. The grant applicants must describe how any data resulting from the NSF-funded research will be disseminated and shared via NSF’s policies. The authors of the NSF’s data management plan (DMP) policy have recognized that each of the seven directorates have different cultures and requirements for data sharing. Therefore, the administrators at the NSF have given each directorate leeway to determine the best data management practices for each domain, including whether or not researchers must deposit data in a public data archive (Hswe and Holt, 2010).
These policy makers have defined the following areas as items that may be included in a data management plan.
- The types of data, samples, physical collections, software, curriculum materials, and other materials to be produced in the course of the project.
- The standards to be used for data and metadata format and content (where existing standards are absent or deemed inadequate, this should be documented along with any proposed solutions or remedies).
- Policies for access and sharing including provisions for appropriate protection of privacy, confidentiality, security, intellectual property, or other rights or requirements.
- Policies and provisions for re-use, re-distribution, and the production of derivatives.
- Plans for archiving data, samples, and other research products, and for preservation of access to them (National Science Foundation, 2011).
The authors of the data management plan policy have allowed for exceptions to the policy. They stated that grant applicants may include a data management plan that includes “the statement that no detailed plan is needed, as long as the statement is accompanied by a clear justification” (National Science Foundation, 2011).
Due to the fact that grant administrators at the NSF have only required the data management plans from grant applicants beginning in January 2011, unlike with the NIH, examples of data management plans from successful grant applications are not yet available.
Librarians and archivists in the United States have drawn heavily upon the work performed by the employees and researchers of the Joint Information Systems Committee (JISC) and the Digital Curation Centre (DCC) in the United Kingdom. Most academic and research librarians at major research universities and related institutions have provided a plethora of online templates, tools, and resources for NSF grant applicants to use. While there is some variation in minor details, most of the data management plans created by information professionals contain the same elements. The Interuniversity Consortium for Political and Social Research (ICPSR) (2012) and the California Digital Library (2012) are among those institutions and individuals that have developed extensive data management plan guidance for researchers.
Information professionals at ICPSR compiled their recommended elements for a data management plan that researchers may draw from when compiling a plan for either the NSF or NIH. They recommended that researchers include a description of the data; a survey of existing data; the existing formats of the data; any and all relevant metadata; data storage methods and backups; data security; the names of individuals responsible for the data; intellectual property rights; access and sharing; the intended audience; the selection and retention period; any procedures in place for archiving and preservation; ethics and privacy concerns; data preparation and archiving budget; data organization; quality assurance; and, legal requirements (ICPSR, 2011).
Researchers and employees of the California Digital Library created the “Data Management Plan Tool” (2012) based on prior work by the Digital Curation Centre (2012) to allow researchers to quickly create a legible plan suitable to their particular funder’s requirements. For example, the authors of the tool took into account each NSF directorate’s requirements and created a separate template based on those requirements. They included funding agencies such as the Institute of Museum and Library Services (IMLS), the Gordon and Betty Moore Foundation, the National Endowment for the Humanities (NEH), and, of course, the NSF. They did not include a template for the NIH. The authors created the templates so that outputs in the final document created by the researcher may include information about data types, metadata and data standards, access and sharing policies, redistribution and re-use policies, and archiving and preservation policies. They designed the templates to output only the fields the researcher completes, so while there are standard templates based on requirements, the output may vary based on the information provided by the researcher (California Digital Library, 2012).
Carlson (2012) created a data curation profile toolkit for librarians and archivists to use when interviewing researchers about their data. While Carlson did not create this toolkit in support of the NSF requirements, reference librarians may find it a useful resource for questions to draw upon when they collaborate with a scientist. The author designed the toolkit as a semi-structured interview to assist librarians in conducting a data curation assessment with a researcher. Carlson created a user guide, an interviewer’s manual, an interview worksheet, and a data curation profile template. He designed the questions to elicit the information required to curate data; most of the information required from the researcher maps to the recommended elements of the ICPSR Data Management Plan, above.
In conclusion, information professionals have been working hard to assist researchers in developing appropriate planning tools with which the researchers may steward the data. However, many researchers are unaware of these services, or consider them to be yet another bureaucratic hurdle (Research Information Network, 2008). It remains to be seen whether or not data creators will use the services information professionals have made available. It also remains to be seen whether or not the data management plans required and approved by the National Institutes of Health and the National Science Foundation will be adequate for long-term data stewardship, at least by the standards of information professionals.
This section briefly discusses the automation of preservation policies and the application of policies to data curation.
How can information professionals tame the data deluge while stewarding data? One way is for these professionals to take human-readable data stewardship policies and implement them at the machine-level (Rajasekar, et al., 2006; Moore, 2005). This “policy virtualization” is discussed in a previous section, “Managing Data: the Emergence & Development of Digital Curation & Digital Preservation Standards”, and an example is presented.
Reagan W. Moore has stated that the challenge to virtualizing human-readable policies to machine-readable code is that most groups cannot prove that they are doing what they say they are doing (personal communication, January 6, 2012). This is a known problem, as Waters and Garrett (1996) stated in the Executive Summary of the their seminal report that archives must be able to prove that “they are who they say they are by meeting or exceeding the standards and criteria of an independently-administered program”.
Moore & Smith (2007) automated the validation of Trusted Digital Repository (TDR) assessment criteria. They created four levels of assessment criteria mapped to TDR Assessment Criteria: Enterprise Level Rules, such as descriptive metadata; Archives Level Rules, such as consistency rules for persistent identifiers; Collection Level Rules, such as flags for service level agreements; and, Item Level Rules, such as periodic rule checks for format consistency. The authors implemented these rules using iRODS with DSpace.
The researchers successfully demonstrated that preservation policies could be implemented automatically at the machine level, and that an administrator could audit the system and prove that the TDR assessment criteria have been successfully implemented. In other words, Moore & Smith (2007) were able to prove that they are preserving what they have said they are preserving by virtualizing human-readable policies to machine-readable code. One application of Moore, et al.’s work includes the SHAMAN (2011) project. These researchers have also successfully implemented an automated preservation system by virtualizing policies using iRODS (Moore, et al., 2007).
Another method is to encode all metadata with the object itself. Gladney and Lorie (2005) and Gladney (2004) have proposed the creation of durable objects in which all relevant information is encoded with the object itself. This was briefly discussed in a previous essay, “Managing Data: Preservation Standards and Audit and Certification Mechanisms (e.g., “policies”)”.
Beagrie, Semple, Williams, and Wright (2008) outlined a model of digital preservation policies and analyzed how those policies could underpin key strategies for United Kingdom (UK) Higher Education Institutions (HEI). They mapped digital preservation links to other key strategies for other higher education institutions, such as records management policies. They also examined current digital preservation policies and modeled a digital preservation policy. The authors proposed that funders use their study to evaluate the implementation of best practices within UK HEIs.
Similarly, Jones (2009) examined the range of policies required in HEIs for digital curation in order to support open access to research outputs. She argued that curation only begins once effective policies and strategies in place. She wanted to map then current curation policies to pinpoint the areas that need further development and support so that open access to research will be supported. The author wrote that the implementation of curation policies in UK HEIs is patchy, although there have been some improvements. She concluded that for effective digital curation of open access research to occur, a robust infrastructure must be in place; financing and actual costs must be determined; and, the differing roles and responsibilities must be defined and put in place.
As noted earlier in this paper, research data has slightly different policy-requirements than general digital library collections, such as ePrint archives. Green, MacDonald, and Rice (2009) addressed those policy differences and created a planning tool and decision-making guide for institutions with existing digital digital repositories who may add research data (sets) to their collections.
The authors based the guide on the OAIS Reference Model (CCSDS, 2002), the Trusted Digital Repository Assessment Criteria (CCSDS, 2011) and the OpenDOAR Policy Tool (Green, MacDonald, and Rice, 2009). They addressed policies related to datasets, primarily social science, but they included policies for content such as grey literature, video and audio files, images, and other non-traditional scholarly publications. The authors designed the guide with the idea of supporting sound data management practice, data sharing, and long-term access in a simplified format.
Thus, sound, strategically applied policies must underpin the efforts to steward data for the indefinite long-term, whether they are applied at the machine-level, or via human effort.
Research data management is in flux, much like early digital libraries. In spite of all of this work to create standards, and various funder requirements, some data will be lost. The questions are: how much data will be lost; by whom; whether or not the data is replaceable; and, how valuable is having the actual data set itself, versus knowing the reported results of any published analysis of the lost data set(s)? It is also likely that some data sets will languish, unused but very carefully curated.
Having said that, much less data will be lost than if no repository and policy standards, and funder requirements, had been created and required in the first place. Standards and funder requirements can only do so much; the data creators themselves have to want to ensure the data is shareable and accessible for the long-term, and the infrastructure to do so must be in place for them to do so. This infrastructure includes not only the physical hardware and software, but also defined policies, standards, metadata, funding, and, roles and responsibilities, among others.
First among this infrastructure must be explicit incentives for researchers to take the time to annotate and clean up the data and any related software and scripts for re-use — or to take the time to ensure someone else does it for them. Information professionals must provide the data stewardship services, but it is up to the data creators to provide the data.
The final conclusion is that researchers want to focus on creating and analyzing data. Some researchers care about the long-term stewardship of their data, while others do not. It remains to be seen whether or not funders’ requirements for data sharing will impact how much data is actually made available for re-purposing, re-use, and, preservation.
Effective data stewardship requires not just technical and standards-based solutions, but also people, financial, and managerial solutions. As the old proverb states, “You can lead a horse to water, but you cannot make him drink” (Speake & Simpson, 2008).
Anderson, C. (2008, October 23). The end of theory: the data deluge makes the scientific method obsolete. Wired, 16.07. Retrieved November 18, 2010, from http://www.wired.com/science/discoveries/magazine/16-07/pb_theory
Ball, A. (2012). Review of Data Management Lifecycle Models. Project Report. Bath, UK: University of Bath. Retrieved March 10, 2012, from http://opus.bath.ac.uk/28587/1/redm1rep120110ab10.pdf
Ball, A. (2010). Review of the State of the Art of the Digital Curation of Research Data. Project Report. Bath, UK: University of Bath, (ERIM Project Document erim1rep091103ab12). Retrieved January 25th, 2012, from http://opus.bath.ac.uk/19022/
Beagrie, C. & JISC. (2010). Keeping Research Data Safe Factsheet Cost issues in digital preservation of research data. Charles Beagrie Ltd and JISC. Retrieved September 29, 2010 from http://www.beagrie.com/KRDS_Factsheet_0910.pdf
Beagrie, C., Chruszcz, J. & Lavoie, B. (2008). Keeping Research Data Safe. JISC. Retrieved September 9, 2009, from http://www.jisc.ac.uk/publications/publications/keepingresearchdatasafe.aspx
Beagrie, N., Chruszcz, J. & Lavoie, B. (2008). Executive summary. In Keeping research data safe. JISC. Retrieved January 24, 2009, from http://www.jisc.ac.uk/publications/publications/keepingresearchdatasafe.aspx
Beagrie, N., Semple, N., Williams, P. & Wright, R. (2008). Digital Preservation Policies Study Part 1: Final Report October 2008. Salisbury, UK: Charles Beagrie, Limited. Retrieved January 24, 2012 from http://www.jisc.ac.uk/media/documents/programmes/preservation/jiscpolicy_p1finalreport.pdf
Blue Ribbon Task Force on Sustainable Digital Preservation and Access. (2008, December). Sustaining the digital investment: issues and challenges of economically sustainable digital preservation. San Diego, CA: San Diego Supercomputer Center. Retrieved January 24, 2009, from http://brtf.sdsc.edu/biblio/BRTF_Interim_Report.pdf
Borgman, C.L. (2008). Data, disciplines, and scholarly publishing. Learned Publishing, 21, 29-38. Retrieved January 25, 2012, from http://www.ingentaconnect.com/content/alpsp/lp/2008/00000021/00000001/art00005
Borgman, C.L. (2010). Research Data: Who will share what, with whom, when, and why? Fifth China-North America Library Conference, September 8-12, 2010, Beijing, China. Retrieved December 15, 2010, from http://works.bepress.com/borgman/238
Burgess, C. (2011, January 31). Your Name, Your Privacy, Your Digital Exhaust. Infosec Island. Retrieved March 7, 2011, from http://infosecisland.com/blogview/11450-Your-Name-Your-Privacy-Your-Digital-Exhaust.html
California Digital Library. (2012). DMPTool. Retrieved February 12, 2012, from https://dmp.cdlib.org/
Carlson, J. (2012). Demystifying the data interview: developing a foundation for reference librarians to talk with researchers about their data. Reference Services Review, 40(1), 7-23. Retrieved February 9, 2012, from http://dx.doi.org/10.1108/00907321211203603
CCSDS. (2011). Audit and certification of trustworthy digital repositories recommended practice (CCSDS 652.0-M-1). Magenta Book, September 2011. Washington, DC: National Aeronautics and Space Administration (NASA).
CCSDS. (2002). Reference model for an Open Archival Information System (OAIS) (CCSDS 650.0-B-1). Washington, DC: National Aeronautics and Space Administration (NASA). Retrieved April 3, 2007, from http://nost.gsfc.nasa.gov/isoas/
Cragin, M.H., Palmer, C.L., Carlson, J.R. & Witt, M. (2010). Data sharing, small science, and institutional repositories. Philosophical Transactions of the Royal Society, 368, 4023-4038.
Curry, A. (2011). Rescue of Old Data Offers Lesson for Particle Physicists. Science, 331, 694-695.
Digital Curation Centre. (2012). DMPOnline. Retrieved February 12, 2012, from http://www.dcc.ac.uk/dmponline
Editor. (2009). Data’s shameful neglect. Nature, 461, 145.
Editor. (2005). Let data speak to data. Nature, 438, 531.
Evans, J.A. & Foster, J.G. (2011). Metaknowledge. Science, 331, 721-725.
Foster, N.F. & Gibbons, S. (2005). Understanding Faculty to Improve Content Recruitment for Institutional Repositories. D-Lib Magazine, 11(1). Retrieved March 8, 2012, from http://www.dlib.org/dlib/january05/foster/01foster.html
Fry, J., Lockyer, S., Oppenheim, C., Houghton, J., & Rasmussen, B. (2008). Identifying benefits arising from the curation and open sharing of research data produced by UK Higher Education and research institutes (Final report). London: JISC. Retrieved January 25, 2012, from http://ie-repository.jisc.ac.uk/279/
Gallagher, S. (2012, January). The Great Disk Drive in the Sky: How Web giants store big—and we mean big—data. Ars Technica. Retrieved March 7, 2012, from http://arstechnica.com/business/news/2012/01/the-big-disk-drive-in-the-sky-how-the-giants-of-the-web-store-big-data.ars
Gantz, J. & Reinsel, D. (2011). Extracting Value from Chaos. IDC #1142. Retrieved February 21, 2012, from http://idcdocserv.com/1142
Ginsparg, P. (2011). ArXiv at 20. Nature, 476, 145-147.
Gladney, H.M. & Lorie, R.A. (2005). Trustworthy 100-Year digital objects: durable encoding for when it is too late to ask. ACM Transactions on Information Systems, 23(3), 229-324. Retrieved December 17, 2011, from http://eprints.erpanet.org/7/
Gladney, H.M. (2004). Trustworthy 100-Year digital objects: evidence after every witness is dead. ACM Transactions on Information Systems, 22(3), 406-436. Retrieved July 12, 2008, from http://doi.acm.org/10.1145/1010614.1010617
Gray, N., Carozzi, T., & Woan, G. (2011). Managing Research Data — Gravitational Waves. Draft final report to the Joint Information Systems Committee (JISC). University of Glasgow: Research Data Management Planning (RDMP). Retrieved March 3, 2011, from https://dcc.ligo.org/public/0021/P1000188/006/report.pdf
Green, A., Macdonald, S., & Rice, R. (2009). Policy-making for research data in repositories: a guide. Edinburgh, UK: University of Edinburgh.
Hanson, B., Sugden, A., & Alberts, B. (2011). Making Data Maximally Available. Science, 331, 649.
Harley, D., Acord, S.K., Earl-Novell, S., Lawrence, S., & King, C.J. (2010). Assessing the Future Landscape of Scholarly Communication: An Exploration of Faculty Values and Needs in Seven Disciplines – Executive Summary. UC Berkeley: Center for Studies in Higher Education. Retrieved January 23, 2012, from http://escholarship.org/uc/item/0kr8s78v
Hey, T. and Trefethen, A. (2003). The Data Deluge: An e-Science Perspective. In F. Berman, G. Fox, and A. Hey (Eds.), Grid Computing – Making the Global Infrastructure a Reality (pp. 809-824). Chichester, England: John Wiley & Sons. Retrieved January 23, 2012, from http://eprints.ecs.soton.ac.uk/7648/
Hilbert, M. & López, P. (2011). The World’s Technological Capacity to Store, Communicate, and Compute. Science Express, 332(6025), 60-65.
Hough, M.G. (2009). Keeping it to ourselves: technology, privacy, and the loss of reserve. Technology in Society, 31, 406-413. Retrieved February 1, 2010, from http://libproxy.lib.unc.edu/login?url=http://dx.doi.org/10.1016/j.techsoc.2009.10.005
Hswe, P. & Holt, A. (2010). Guide for Research Libraries: The NSF Data Sharing Policy. E-Science. Association of Research Libraries. Retrieved January 6, 2012, from http://www.arl.org/rtl/eresearch/escien/nsf/index.shtml
Interagency Working Group on Digital Data. (2009). Harnessing the power of digital data for science and society. Report of the Interagency Working Group on Digital Data to the Committee on Science of the National Science and Technology Council. Washington, DC: Office of Science and Technology Policy. Retrieved April 9, 2009, from http://www.nitrd.gov/about/Harnessing_Power_Web.pdf
Inter-university Consortium for Political and Social Research (ICPSR). (2012). Guide to Social Science Data Preparation and Archiving: Best Practice Throughout the Data Life Cycle (5th ed.). Ann Arbor, MI. Retrieved January 5, 2012, from http://www.icpsr.umich.edu/icpsrweb/content/ICPSR/access/deposit/guide/
Inter-university Consortium for Political and Social Research (ICPSR). (2011). Elements of a Data Management Plan. Data Deposit and Findings. Ann Arbor, MI: University of Michigan, Institute for Social Research. Retrieved March 10, 2012, from http://www.icpsr.umich.edu/icpsrweb/content/ICPSR/dmp/elements.html
Jones, S. (2011). Summary of UK research funders’ expectations for the content of data management and data sharing plans. University of Glasgow: Digital Curation Centre (DCC). Retrieved January 26, 2012, from http://www.dcc.ac.uk/webfm_send/499
Jones, S. (2009). A report on the range of policies required for and related to digital curation. DCC Policies Report, v. 1.2. University of Glasgow: Digital Curation Centre. Retrieved January 26, 2012, from http://www.dcc.ac.uk/webfm_send/129
Kling, R. & McKim, G.W. (2000). Not just a matter of time: field differences and the shaping of electronic media in supporting scientific communication. Journal of the American Society for Information Science and Technology, 51(14), 1306-1320.
Lazorchak, B. (2011). Digital Preservation, Digital Curation, Digital Stewardship: What’s in (Some) Names? Retrieved March 11, 2012, from http://blogs.loc.gov/digitalpreservation/2011/08/digital-preservation-digital-curation-digital-stewardship-what’s-in-some-names/
Lord, P. & Macdonald, A. (2003). Data curation for e-Science in the UK: An audit to establish requirements for future curation and provision (E-Science Curation Report). London: JISC. Retrieved January 26th, 2012 from http://www.jisc.ac.uk/media/documents/programmes/preservation/e-science reportfinal.pdf
Lynch, C. (2008). How do your data grow? Nature, 455, 28-29.
Lynch, C. (2008). The institutional challenges of cyberinfrastructure and e-research. Educause Review, 43(6). Washington, DC: Educause. Retrieved January 22, 2009, from http://www.educause.edu/EDUCAUSE+Review/EDUCAUSEReviewMagazineVolume43/TheInstitutionalChallengesofCy/163264
Lyon, L. (2007). Dealing with Data: Roles, Rights, Responsibilities and Relationships. Consultancy Report. University of Bath: UKOLN. Retrieved January 10, 2012, from http://opus.bath.ac.uk/412/
Manyika, J., Chui, M., Brown, B., Bughin, J., Dobbs, R., Roxburgh, C. & Byers, A.H. (2011). Big data: the next frontier for innovation, competition, and productivity. Report. Seoul: McKinsey Global Institute. Retrieved June 1, 2011, from http://www.mckinsey.com/Insights/MGI/Research/Technology_and_Innovation/Big_data_The_next_frontier_for_innovation
Moore, R., Rajasekar, A., & Marciano, R. (2007). Implementing Trusted Digital Repositories. In Proceedings of the DigCCurr2007 International Symposium in Digital Curation, University of North Carolina – Chapel Hill, Chapel Hill, NC USA, 2007. Retrieved September 24, 2010, from http://www.ils.unc.edu/digccurr2007/papers/moore_paper_6-4.pdf
Moore, R. (2005). Persistent collections. In S.H. Kostow & S. Subramaniam (Eds.), Databasing the brain: from data to knowledge (neuroinformatics) (pp. 69-82). Hoboken, NJ: John Wiley and Sons.
Moore, R. & Smith, M. (2007). Automated Validation of Trusted Digital Repository Assessment Criteria. Journal of Digital Information, 8(2). Retrieved March 2, 2010, from http://journals.tdl.org/jodi/article/view/198/181
Narayanan, A. & Shmatikov, V. (2007). How To Break Anonymity of the Netflix Prize Dataset. Retrieved March 7, 2012, from http://arxiv.org/abs/cs/0610105
National Aeronautics and Space Administration. (2010). The National Aeronautics
and Space Administration (NASA) Earth Observing System (EOS) Data Processing Levels. NASA Science Earth. Retrieved March 14, 2012, from http://science.nasa.gov/earth-science/earth-science-data/data-processing-levels-for-eosdis-data-products/
National Institutes of Health. (2010). Data Sharing Policy. NIH Grants Policy Statement (10/10) – Part II: Terms and Conditions of NIH Grant Awards, Subpart A: General – File 6 of 6. Retrieved March 7, 2012, from http://grants.nih.gov/grants/policy/nihgps_2010/nihgps_ch8.htm#_Toc271264951
National Institutes of Health. (2003). NIH Data Sharing Policy and Implementation Guidance. Grants Policy. Retrieved March 7, 2011, from http://grants.nih.gov/grants/policy/data_sharing/data_sharing_guidance.htm#fin
National Research Council. (2010). Steps toward large-scale data integration in the science summary of a workshop. Reported by S. Weidman and T. Arrison, National Research Council. Washington, D.C.: The National Academies Press.
National Science Board. (2011). Digital Research Data Sharing and Management. NSB-11-79, December 14, 2011. Arlington, VA: National Science Board. Retrieved January 18, 2012, from http://www.nsf.gov/nsb/publications/2011/nsb1124.pdf
National Science Foundation. (2011). NSF 11-1 January 2011 Chapter II – Proposal Preparation Instructions. Grant Proposal Guide. Retrieved January 16, 2011, from http://www.nsf.gov/pubs/policydocs/pappguide/nsf11001/gpg_2.jsp
National Science Foundation. (2011). Dissemination and Sharing of Research Results. NSF Data Sharing Policy. Retrieved January 15, 2011, from http://www.nsf.gov/bfa/dias/policy/dmp.jsp
National Science Foundation. (2010). Data Management for NSF Engineering Directorate Proposals and Awards. Engineering (National Institutes of HealthENG), the National Science Foundation. Retrieved September 2, 2010 from http://nsf.gov/eng/general/ENG_DMP_Policy.pdf
National Science Foundation. (2005). Long-lived digital data collections enabling research and education in the 21st century (NSB-05-40). Arlington, VA: National Science Foundation. Retrieved May 5, 2008, from http://www.nsf.gov/pubs/2005/nsb0540/
National Science Foundation Cyberinfrastructure Council. (2007). Cyberinfrastructure vision for 21st century discovery (NSF 07-28). Arlington, VA: National Science Foundation. Retrieved November 12, 2007, from http://www.nsf.gov/pubs/2007/nsf0728/index.jsp
Nelson, B. (2009). Data sharing: empty archives. Nature, 461, 160-163.
Pienta, A.M., Alter, G. & Lyle, J. (2010). The Enduring Value of Social Science Research: The Use and Reuse of Primary Research Data. Paper presented at the workshop on “the Organisation, Economics and Policy of Scientific Research”, held in in April 23-24, 2010, Torino, Italy. Retrieved January 5, 2012, from http://deepblue.lib.umich.edu/handle/2027.42/78307
Rajasekar, A., Wan, M., Moore, R. & Schroeder, W. (2006). A prototype rule-based distributed data management system. Paper presented at a workshop on “next generation distributed data management” at the High Performance Distributed Computing Conference, June 19-23, 2006, Paris, France.
Research Information Network, Institute of Physics, Institute of Physics Publishing, & Royal Astronomical Society. (2011). Collaborative yet independent: information practices in the physical sciences. A Research Information Network Report. London, UK: Research Information Network, December 2011. Retrieved January 26, 2012, from http://www.iop.org/publications/iop/2012/page_53560.html
Research Information Network. (2011). Data centres: their use, value, and impact. A Research Information Network report. London, UK: Research Information Network, September 2011.
Research Information Network. (2009). Patterns of information use and exchange: case studies of researchers in the life sciences. A Research Information Network Report. London, UK: Research Information Network, November 2009. Retrieved January 25, 2012, from http://www.rin.ac.uk/our-work/using-and-accessing-information-resources/patterns-information-use-and-exchange-case-studie
Research Information Network. (2008). Stewardship of digital research data: A framework of principles and guidelines. A Research Information Network report. London, UK: Research Information Network, January 2008.
Research Information Network. (2008). To Share or not to Share: Publication and Quality Assurance of Research Data Outputs. A Research Information Network report. London, UK: Research Information Network, June 2008.
Schofield, P.N., Bubela, T., Weaver, T., Portilla, L., Brown, S.D., Hancock, J.M., Einhorn, D., Tocchini-Valentini, G., Hrabe de Angelis, M., Rosenthal, N. & CASIMIR Rome Meeting participants. (2009). Post-publication sharing of data and tools. Nature, 461, 171-173.
Science and Technology Council. (2007). The digital dilemma strategic issues in archiving and accessing digital motion picture materials. The Science and Technology Council of the Academy of Motion Picture Arts and Sciences. Hollywood, CA: Academy of Motion Picture Arts and Sciences.
SHAMAN. (2011). Automation of Preservation Management Policies. SHAMAN – WP3-D3.4 (Report). Seventh Framework Programme and European Union.
Soehner, C., Steeves, C., Ward, J. (2010, August). E-science and data support services: a study of ARL member institutions. Washington, D.C.: Association of Research Libraries. Retrieved November 18, 2010, from http://www.arl.org/bm~doc/escience_report2010.pdf
Solove, D.J. (2007). “I’ve got nothing to hide” and other misunderstandings of privacy. San Diego Law Review, 44, 745-772.
Speake, J. & Simpson, J. (2008). Oxford Dictionary of Proverbs. Oxford, UK: Oxford University Press.
Stewardship. (2012). ForestInfo.org. Dovetail Partners, Inc. Retrieved March 9, 2012, from http://bit.ly/zmNzy1
Stewardship. (2012). Free Merriam-Webster Dictionary. An Encyclopaedia Brittannica Company. Retrieved March 9, 2012, from http://www.merriam-webster.com/dictionary/stewardship
Sullivan, B. (2012, March 6). Govt. agencies, colleges demand applicants’ Facebook passwords. MSNBC. Retrieved March 7, 2012, from http://redtape.msnbc.msn.com/_news/2012/03/06/10585353-govt-agencies-colleges-demand-applicants-facebook-passwords
Swan, A. & Brown, S. (2008). The skills, role and career structure of data scientists and curators: an assessment of current practice and future needs report to JISC. Truro, UK: Key Perspectives, Ltd. Retrieved January 18, 2012, from http://www.jisc.ac.uk/publications/reports/2008/dataskillscareersfinalreport.aspx
Sweeney, L. (2002). K-anonymity: a model for protecting privacy. International Journal on Uncertainty, Fuzziness and Knowledge-based Systems, 10(5), 557-570.
Tickletux. (2007). Did Bill Gates say the 640k line ? Retrieved from http://imranontech.com/2007/02/20/did-bill-gates-say-the-640k-line/
Toronto International Data Release Workshop Authors. (2009). Prepublication data sharing. Nature, 461, 168-170.
UKRDS. (2008). UKRDS interim report UKRDS the UK research data service feasibility study (v0.1a.030708). London: Serco Ltd. Retrieved April 9, 2009, from http://www.ukrds.ac.uk/UKRDS%20SC%2010%20July%2008%20Item%205%20(2).doc
Van den Eynden, V., Corti, L., Woollard, M., Bishop, L. & Horton, L. (2011). Managing and Sharing Data: Best Practices for Researchers, 3rd edition. University of Essex: UK Data Archive. Retrieved January 5, 2012, from http://www.data-archive.ac.uk/media/2894/managingsharing.pdf
Walters, T. & Skinner, K. (2011). New roles for new times: digital curation for preservation. Report prepared for the Association of Research Libraries. Washington, D.C.: Association of Research Libraries. Retrieved April 2, 2011, from http://www.arl.org/bm~doc/nrnt_digital_curation17mar11.pdf
Waters, D. and Garrett, J. (1996). Preserving Digital Information. Report of the Task Force on Archiving of Digital Information. Washington, DC: CLIR, May 1996.
The following tables (figures) of organizations, individuals, roles, sectors, and types involved with data management are from the Interagency Working Group on Digital Data (2009).
- Entities by Role
- Entities by Individual
- Entities by Sector
- Individuals by Role
- Individuals by Life Cycle Phase/Function
- Entities by Life Cycle Phase/Function
If you would like to work with us on a big data or digital stewardship project, please see our informatics consulting page.