Relevance to archiving and digital preservation
Open standards for structured data are hugely beneficial when it comes to archiving and digital preservation.
Open data standards last longer.
Data in open formats tends to take longer before becoming subject to technical obsolescence. Widely used formats tend to have longer-term support in software applications. For example, they tend to survive longer as supported import or export options.
Open data standards remove vendor lock-in.
Open standards increase the range of software applications that can understand the data. This in turn helps ensure data remains readable and usable for longer and with lower risks. For example, if an individual vendor stops support, then other vendors are still available. Likewise, open standards are typically not encumbered by licensing and other issues that can be used by vendors to prevent or control their use.
Open data standards support data migrations.
There tend to be more migration pathways for moving data from old data formats to new data formats if the data formats are open and more widely adopted. It is easier to understand the data format and create new migration processes/tools if needed.
Open data standards support Data Integrity.
It is easier to check conformance of data against open specifications. It is easier to check data quality, for example by checking that data it is complete and correct against controlled terminologies. It is easier to cross-reference and check data from different angles, for example that study records (e.g. in the TMF) match the study protocol and that study data is supported by corresponding source records.
Open data standards support access and reuse.
It is easier to find data if it follows a standardised metadata model which makes it clear what the data contains and enables searching using standardised terminologies. It is easier to reuse data, for example there will typically be a wider range of software applications that understand the data.
Open data standards retain data functionality.
Structured data can be filtered, searched, transformed, visualised and interacted with in a way that helps maintains the original functionality of the data when it was in a live system. This contrasts with storing data in PDF documents which loses this functionality and makes data static and harder to use.
Open standards are well-documented and transparent.
This ensures that an organisation can more easily understand how to use their data independently of vendors. The organisation can contract third-parties to do data migrations or transformations who can then leverage open standards to better understand the data and how it should be processed.
These benefits are aligned with digital preservation good practice, for example guidelines on choosing long-term data formats. A good example is the Library of Congress Sustainability Factors. The LoC criteria include assessment of data formats against criteria such as disclosure, adoption, transparency, self-documentation, external dependencies, patents and technical protection measures.
Likewise, the factors above are aligned with a risk-based approach to digital preservation, for example as used by the National Archives and Records Administration (NARA) in their file format risk assessment matrix and preservation action plans. Similar to the LoC, NARA have a set of criteria used to assess the risk for file formats. These include disclosure, adoption, transparency, self-documentation, external dependencies, patents, encryption and rights management, and format age.
Open data standards, for example those from CDISC, stack up well against the LoC and NARA criteria. Whilst the main application of open data standards such as those from CDISC is for the ‘live’ part of the clinical data lifecycle, there are considerable benefits for the ‘archive’ stage of the data lifecycle too.
By using open data standards when archiving and preserving data, an organisation can gain multiple benefits. These include:
- More vendor independence, such as not being forced to archive data within a live system because that’s the only software that understands the data.
- Higher levels of Data Integrity assurance, such as better QC when data is exported from a live system and periodic checks when data is within an archive.
- Less frequent data migrations, which in turn reduces costs and risks of Data Integrity loss. When data migrations do occur, then using open data formats makes them quicker and smoother.
- Easier access and reuse, including inspection readiness. This includes the ability to interact with dynamic data in a way that meets regulatory requirements.
- Whole study archiving where all aspects of a study are consolidated into a single archive and linked together. This helps an organisation know that they have retained everything they should. They can search/find what they need (protocols, data, records) quickly and easily.