Long-Term Preservation of High-Value Digital Content
Friday, March 15, 2019
Posted by: Craig Van Dyck
It’s 2040 and you want to read a decades-old book that analyzes Finnegan’s Wake. You can’t find the text of the book on the Web, and it’s not available on services like iTunes or Amazon. Libraries and bookstores have no print copies. There is no print-on-demand option available. The book has effectively disappeared.
This is the scenario that long-term digital preservation protects against.
Digital versions of content are often now the primary medium that users rely on, more than print. When print was the primary version, the community safely assumed that there were multiple copies of the content around the world, usually printed on long-lasting acid-free paper and often available for long periods in major libraries.
Now, with digital content in the forefront, libraries often do not own a digital copy. Users access the content on platforms, such as Apple, Amazon, Ebsco, Highwire, or the publisher’s own platform. Who, then, is responsible for safeguarding the long-term survival of the digital content and access to it?
Publishers themselves are not necessarily considered to be reliable long-term protectors of digital content. A publisher might lose interest in the content as its marketability declines. Or a publisher might go out of business or be combined with another publisher. And publishers normally do not have robust long-term digital practices in place.
This issue is of high importance in scholarly publishing because scholarly research has a long shelf life. Scholarly content tends to be of high value and costly, and digital versions have strongly supplanted print as the primary resource for end users.
As a result, librarians at universities have lobbied publishers to participate in trusted third-party long-term digital preservation systems. This article reviews the state of play for digital preservation of high-value content and raises open issues that the community confronts.
It is Not Only About Scientific Journals
Due to the importance that academic libraries place on digital preservation, scholarly journal publishers have stepped up by depositing their content into third-party systems such as CLOCKSS. There is excellent coverage for journals generated by large and medium-sized publishers. However there is a “long tail” of very small publishers (usually “Open Access”) who have not gotten the message. There is an ongoing effort to encourage these small publishers to participate in a preservation system.
When it comes to books, the picture is different. Even among scholarly book publishers (who often also publish journals), the coverage is not yet as strong as it should be. And there is little participation in digital preservation at more general-interest publishers. For example, book publishers for authors such as Malcolm Gladwell, magazines like The New Yorker, and newspapers like The Washington Post and The New York Times have not joined preservation systems.
The primary impetus for long-term digital preservation has come from libraries. As a result, for content whose market is strongly library-focused – like scholarly journals – preservation has become ubiquitous. But for other kinds of content, for which libraries are not the primary market, preservation is lagging. This is a problem because readers rely on the ability to go back to older content. For digital reading to become a full experience, readers should be able to access content today that they have accessed in previous years.
Digital Preservation Is a Wide World
Digital preservation is also important for museums and national libraries that are digitizing physical collections. The digital versions of these cultural artifacts are a long-term record of human heritage.
In addition, many digital artifacts are created at universities for teaching purposes and research. In the arts, the use of born-digital works is increasing. In the commercial world, hospitals have electronic patient records and financial institutions have key business data.
Such collections of high-value digital information should be preserved for the long term.
Preservation Is More Than Storage
There is a broad spectrum of preservation approaches. The National Digital Stewardship Alliance publishes “Levels of Digital Preservation,” which shows four levels of preservation for each of five key categories.
Some people think that if they have more than one copy they are in good shape. But that is far from satisfactory from a preservation viewpoint. Multiple copies in geographically distributed locations under multiple different governances are needed to avoid a single point of failure. And there must be a way to validate that the preserved bits remain healthy and are not victims of bit rot or other mishaps. Finally, there must be excellent security to avoid inadvertent or hostile access to the content and damage to or unauthorized use of the data.
Digital Preservation Is Not Free
There are costs related to preserving digital content for the long term. The beneficiaries of long-term preservation are authors and readers. In the scholarly community, the costs are borne by publishers and university libraries, who in effect act as proxies for authors and readers.
Digital content is becoming more dynamic and complex. Publications have developed as the Web has evolved. Authors are creating works that integrate multimedia, third-party features, and real-time functionality. Some of these features are a challenge for a preservation system. If the content is ever-changing, what should be preserved?
Preservation systems confront these challenges and collaborate with publishers, libraries, and authors to address them. The Mellon Foundation supports a few projects that aim to improve our understanding of the technical, operational, and financial aspects of these new requirements. Guidelines for authors are needed so they can be aware of what works versus what might be unsupportable.
The goal is a sustainable, robust, integrated, and heterogeneous environment that ensures that high-value content will not disappear and will always be available for users.
This article is brought to you through a partnership with Amnet, a technology-led provider of services and solutions, catering to the needs of businesses for content transformation, design, and accessibility. The points of view expressed are those of the author and do not necessarily represent the perspectives of Amnet or of BISG.
Craig Van Dyck is the Executive Director of CLOCKSS Archive, a leading long-term digital preservation system for scholarly literature. Before CLOCKSS, he was at Wiley from 1996-2015 as Vice President, Content Management; and at Springer-Verlag New York from 1986-96, as Senior VP and Chief Operating Officer.
Craig served as Chairman of the Enabling Technologies Committee of the Association of American Publishers from 1995-1998, and was instrumental in the development of the Digital Object Identifier (DOI) system, and of CrossRef. He has served on the Boards of Directors of the International DOI Foundation, CLOCKSS, ORCID, CrossRef, and the Society for Scholarly Publishing, and was a member of the Portico Advisory Committee. Craig’s portfolio has always included industry collaboration to improve the infrastructure of scholarly communications.
A not-for-profit 501(c)(3) organization governed by libraries and publishers, CLOCKSS has twelve copies of all of its preserved content at leading academic libraries around the world. CLOCKSS uses the open source LOCKSS software to ensure the validity of the bits, with its unique polling-and-repair capability. CLOCKSS preserves 32 million digital journal articles, 85,000 books, and an evolving collection of supplementary materials and metadata. The archive is growing by 4 to 5 million items each year.