Curate Data Question
You have characterized the growing data deluge in scholarship well (though heavily focused towards the sciences). Unfortunately, there is likely to be much more data, today and in the future, than there is funding to permanently curate this data. What criteria should we use to decide what to keep, and for how long? How should we make decisions among data collections that all have legitimate and worthy claims to preservation and curation? What data should be privileged over other data, and why? Please speak to both the underlying theoretical and philosophical issues here and also to possible process and organizational/institutional approaches to making operational choices?
In addition, what should happen when data is condemned to oblivion, either thorough some decision making process at the dataset/data collection level, or because of larger scale events, such as the defunding of a repository?
Ward, J.H. (2012). Doctoral Comprehensive Exam No.5, Managing Data: the Data Deluge and the Implications for Data Stewardship. Unpublished, University of North Carolina at Chapel Hill. (pdf)
This work is licensed under a Creative Commons Attribution-NoDerivatives 4.0 International License.
Note: All errors are mine. I have posted the question and the result “as-is”. The comprehensive exams are held as follows. You have five closed book examinations five days in a row, one exam is given each day. You are mailed a question at a set time. Four hours later, your return your answer. If you pass, you pass. If not…well, then it depends. The student will need to have a very long talk with his or her advisor. I passed all of mine. — Jewel H. Ward, 18 December 2015
Curate Data Response
The dilemma faced by data stewards with regards to the amount of data and its intrinsic value (or lack of) versus available funding is not a new problem. Librarians and archivists have faced this same problem for centuries. The solution to this problem is for data managers to develop a rigorous collection development policy and apply that to the data sets and data collections in question, while taking political issues into consideration. This is exactly what must be done when deciding what to cull and what to keep with regards to objects in the physical world. The difference between preserving a digital data set and a physical data set (say, a paper notebook of temperatures at a given location at a given time of day), however, is that data must be migrated and refreshed over time. This means that data collection and preservation must occur almost at the time of creation, rather than decades or centuries later.
There lies the crux of the problem. It is easier to decide what to preserve out of what has survived at the end of someone’s career, for example, because an archivist will be able to determine if this person’s scholarly contribution was significant enough to keep or save various aspects of this person’s scholarly work and personal effects. It is much more difficult to decide to develop a collection “looking forward”. An archivist will not know if this person’s contributions will be important enough to warrant the cost of providing storage for decades of this person’s research, for example. However, even the movie industry has had to shift from the “save everything” policy of film-based movie making, to culling material up front in order to save money on storage costs (Science and Technology Council, 2007). Therefore, data stewards will have to make some up front decisions that will involve both procedural and theoretical issues.
A data steward appraising a data set or collection for inclusion into a repository should use the standard toolkit for archivists (acquisition, appraisal, accession, etc.). Some of the criteria that might be used as part of a digital data set collection development policy might include the following.
- Is the data set replaceable? For example, is it observational data, which cannot be replaced? Or, is it experimental data, which may be regenerated?
- How re-usable is the data? Does is have appropriate metadata, annotations, and is it of a reasonable quality to provide replicable research? Does it require special software or hardware to render and make readable and usable?
- If the data is re-usable, has it been re-used, by whom, and how often?
- If the data set were to be deleted because it is replaceable, how expensive would it be to re-gather this data? Would it be less expensive to store it?
- What type of research does the data support? Have the research results from the data set been highly cited or never cited? Has there been a high demand for the data set, or no demand at all?
- How expensive will it be to maintain the data set, and does the organization have the resources to maintain it so that it is accessible and re-usable? Does the data set require any special software, scripts, programs, or hardware to run?
- What are the national, international, institutional, local laws or policies, domain, or other regulations with regards to the disposition of this data set? For example, does this data set have to be maintained indefinitely, for 10 years, or can it be deleted as soon as a project has been completed?
As part of establishing a collection development policy, some philosophical and methodological considerations must be made. For example, if a data set consists of observational data, which is not replaceable, then saving that type of data should take priority over saving experimental data, which is presumably replaceable. An assistant professor’s basic data sets that he or she has used or re-used for a course have a certain level of value, but it is likely that a Nobel Prize winner’s data set will take priority over the former. One might also assume that course data sets are likely to be replaceable, whereas a Nobel Prize winner’s data set may not be. However, if the course data sets belong to the Nobel Prize winner, well, then one might make the argument to save both, assuming funding is available to maintain both data sets. If a data steward can choose only one, however, then the data set used for the research results that brought the researcher a Nobel Prize award are more likely to be of long-term value.
However, with regards to deciding which collections “have legitimate and worthy claims to preservation and curation” and which do not, and what data should be privileged over other data, the answer, again, lies in the creation of a rigorous collection development policy with regards to data sets. Some of the considerations to be made, in addition to the ones listed above, might be one or more of the following. The considerations below are from the perspective of an Information and Library Science (ILS) trained practitioner working in an academic library, not from the perspective of a domain scientist seeking to preserve his or her data set(s).
- Does this data set support the mission of the repository? That is, a data manager should not be adding Physics data sets to a Social Science data archive, unless there is a legitimate reason for doing so.
- What will this data set add to or detract from the existing collection? What is the answer to the “so-what” factor? (Why should the data steward add this data set to the digital repository?)
- What is the quality of the data itself? What is the value of the data set, both current and projected? How replaceable is the data, if it isn’t saved?
- What is the quality of the metadata and any additional annotations or included information?
- How much time and effort will it take to add this data set to the current repository? Is it worth the effort, if it will take a lot of effort to clean up the data to make it re-usable?
- What are the Intellectual Property and Copyright issues associated with this data set? Are there any other legal issues to consider? Will the repository own the “rights” to the data, will the researcher(s), or is it public domain?
When this author worked as a Program Manager of a digital archive at the University of Southern California, we had a standard set of questions and an entire matrix we would use to determine whether or not a collection should be added to the digital archive. The above criteria reflect some of the collection development decisions that we made. If we thought a collection was worthwhile, but that it would not be a good fit (whatever the reason), it was not uncommon to refer the potential donor to another organization that might be a better fit for the content. That should also be the case with regards to most data sets.
Perhaps data center X or library Y cannot archive certain digital material, but the data steward may point the donor to a more appropriate archive. In many cases, however, data sets that are valuable to researchers will be lost, regardless of the availability of funding. This also happens with information in the physical word. There is a point at which data stewards have and will have to accept that not all data can be saved, and that there will be some loss.
Other criteria to consider with regards to which data to save and which not to save involves political issues. The points above refer to most collection development. However, some collection development decisions have nothing to do with logic and a rigorous collection development policy, and everything to do with politics, whether local, institutional, or national and international. It will be the case that the following considerations will be made, either consciously and explicitly, or unconsciously and implicitly, when deciding which data sets to collect.
- Does the data set belong to a major donor, alumni or other affiliate of the organization for which the data steward works?
- Does the data set belong to anyone who is “buddies” with anyone in the chain of command above the data steward?
- Does the data set belong to anyone with power that either the data steward or others within the organization (large or small) that maintains the repository would like to please or make feel important by archiving his or her data sets?
- Are there any other political issues not mentioned above that the data steward ought to consider when deciding whether to accept or reject a particular data set, regardless of the logic of including the data set in the repository, and regardless of the personal opinion of the data steward as to the wisdom of including said data set?
In theory, the decision as to whether or not one should “save” one data set over another ought be made based on a well-thought out, rigorous and logical collection development policy based on standard archival processes applied to the digital realm. In most cases, the decision to save or not save a particular data set will be made based on a standard set of policies. In reality, however, politics may play a role in deciding what is saved or not saved. Those politics may exist within the department, within the institution (especially large ones), within the domain, or elsewhere. It is unfortunate that because of this very human tendency, some valuable data sets will be lost, and some not very useful data sets will be carefully curated. One can hope, however, that a responsible data steward will be able to point a potential donor, or send a potentially valuable data set, to another institution or repository whose administrators can provide a home for the valuable data set.
With regards to “what should happen when data is condemned to oblivion, either thorough some decision making process at the dataset/data collection level, or because of larger scale events, such as the defunding of a repository”, it depends. If the policy for the data set is that the data is to be deleted at X time, then it should be deleted. If the data has intrinsic value but the repository can no longer maintain it at the collection level, or because a repository is defunded, then the current repository manager may be able to locate a new “home” elsewhere for the data collection by contacting a network of colleagues. Another option is to contact non-ILS practitioners who may have an interest in the data and may be able to maintain it out of their own funding stream. If none of the above will apply to the situation at hand, then a practitioner may simply have to delete the data, regardless of the their personal opinion. This is also often the case in the physical world, with regards to books, paper archives, photographs, etc.
Librarians often have to cull their collections. They often cull books or move them to off-site storage. If, say, 20% of the books haven’t been checked out in 10 years, then perhaps those books should be culled or moved to off-site storage before the library pays to add more square footage and shelving. The challenge with data sets is that, unlike books, there are not likely to be more copies available for users. However, books also go out-of-print, so, that too, has an analogy in the physical world. Data stewards may have to make culling data collections part of their repository maintenance policy, just as librarians have had to do with books. It doesn’t mean that the books or data sets that are culled have no intrinsic value, or that only the most popular and most used books and data sets should be stored, but there has to be a realistic collection policy in place. If something isn’t used, then perhaps a new home should be found for it where it will be used, or the data set can be safely culled (deleted).
In conclusion, data stewards are and will face the same type of collection development decisions that “traditional” librarians and archivists have faced for centuries. Funding and resources are limited, therefore, not all information can and will be saved. The decisions regarding what to keep versus what to delete should be based on a well-thought out and rigorous collection development plan, but the reality is that politics may supersede the best-laid plans and policies. Unlike the physical world, however, the decisions to keep or delete data items must be made up front, not after the fact. This does require a shift in thinking, but even so, it still requires a sound methodological approached based on established collection development and archival principles.
If you would like to work with us on a digital curation and preservation project, please see our informatics consulting page.