Future Proofing Civic Data: Exploring the Challenges of Preserving Open Civic Data for the Long Term
Permanent link to this recordhttp://hdl.handle.net/20.500.12613/8173
MetadataShow full item record
AbstractTemple University Libraries received a Knight Foundation Grant, “Knight News Challenge on Libraries” to lead an exploratory research project, Future Proofing Civic Data, investigating the challenges of long-term preservation for open civic datasets. The project team interviewed over a dozen stakeholders about their use cases and needs and looked at several open civic data initiatives in Philadelphia, Boston, San Francisco, and the Pittsburgh area, to compare practices and examine real-life examples. We found that there is still much to do in the community to develop systematic best practices in regard to the long-term preservation of datasets. In this white paper we explore 10 important factors that need to be taken into consideration to tackle this challenge successfully. We also look at how libraries could take the lead, or at least participate in the process. First, awareness of existing digital preservation frameworks is key when putting in place a data curation and preservation plan and developing relevant workflows and budget. The library community has developed strong “best practices” in that realm, and models such as OAIS (Open Archival Information System), TRAC (“Trustworthy Repositories Audit & Certification”), and LOCKSS (“Lots of Copies Keep Stuff Safe) provide robust guidelines that apply to all types of digital materials. We then looked at the selection process for deciding what datasets should be archived. Selection decisions are made based on various criteria, such as the known or expected users of datasets and their needs, what datasets can be archived and made available more easily, and what datasets represent a city or state’s activities more comprehensively. Among other things, single file objects such as CSV or KML files are much easier to archive than more complex formats or API-mediated content. Next we considered concerns related to the description of datasets, examining current metadata practices from a number of open civic data initiatives, and gave suggestions on possible improvements. We then turned our attention to the notion of dataset reliability and authenticity, that is, how do users know that an archived dataset has the same content as the original and can be trusted? We found that datasets require careful stewardship at several levels to remain complete and reliable over time. The loss of reliability or authenticity could be due to a multitude of unintentional causes, or derive from a more intentional temptation to “rewrite history” by one of the parties involved. Versioning is another important factor, as datasets may evolve over time and several versions might be generated for a single dataset, either through regularly scheduled harvests or occasional data restructuring. Versioning may require the development of policies and procedures to ensure that the collection of successive versions is done in an orderly and systematic manner, and that change requests and deletions are handled uniformly. To enable the successful discovery of archived datasets, we need to answer two questions: (1) how will users searching for open civic data know that preserved historical copies of the data exist?, and (2) how can they distinguish between the current active copy of a dataset and the archived versions? The software interface must facilitate a seamless navigation between active copy and archived versions. We looked at intellectual property rights and other legal issues, and the potential need to develop agreements between data creating agencies and archiving agencies in order to clarify the rights to preserve and provide long-term access to a dataset. The organizational model and governance structure chosen for the overall civic data initiative also have consequences for the ability to ensure successful long-term preservation functions. In particular, involving a multiplicity of partners and stakeholders is the best way to ensure that diverse voices are heard and that the project is run with a maximal level of transparency. Furthermore, open communication flows are also essential to ensure that preservation-related policies are applied optimally. This includes communication among the archiving agencies, the civic data creators, and the civic data portal managers. One more important notion when looking at digital preservation endeavors is that technology is only a small part of successful long-term digital preservation, and thinking proactively about organizational commitment and economic sustainability is essential. Finally, we described two prototypes that we developed to explore concrete technical solutions to archive datasets, using OpenDataPhilly as a testbed. Archive-It, or Prototype 1, uses the Internet Archive’s web crawling platform, which takes scheduled virtual captures of websites over time. Dat, or Prototype 2, is a secure and distributed package manager that does versioning of datasets locally, or shares and syncs dataset versions through a peer-to-peer network. Each prototype has pros and cons. We believe that there are clear and advantageous opportunities for libraries, both academic and public, to take a role in supporting the long-term preservation of open civic data, especially given libraries' pre-existing expertise and collection practices. It comports with libraries’ commitments to serve its users, provide research resources, and provide access to information. Furthermore, libraries can also get involved meaningfully in open civic data initiatives in other capacities, such as helping with outreach and community engagement, developing metadata standards and providing search optimization techniques for discovery.
CitationTemple University Libraries. (2017). Future Proofing Civic Data: Exploring the Challenges of Preserving Open Civic Data for the Long Term [White paper]. Temple University.
ADA complianceFor Americans with Disabilities Act (ADA) accommodation, including help with reading this content, please contact firstname.lastname@example.org
Except where otherwise noted, this item's license is described as Attribution-NonCommercial-ShareAlike CC BY-NC-SA