The Data Standards Landscape for Clinical Trial data
"Open data standards are of major importance from an archiving and digital preservation perspective."
Clinical trial data includes documentation of how a trial has been conducted as captured in an eTMF and ISF. But the scope of clinical data is broader than that, for example as defined in ICH E6 (R3) essential records.
Under ICH E6 (R3), essential records include: “data and relevant metadata in the data acquisition tools” and “source records” which are defined as “This may include trial participants' medical/health records/notes/charts; data provided/entered by trial participants (e.g., electronic patient-reported outcome (ePROs)); healthcare providers’ records from pharmacies, laboratories and other facilities involved in the clinical trial; and data from automated instruments, such as wearables and sensors.”
This aligns with other definitions of source data, for example as described in the EMA 2023 guidelines on computerised systems and electronic data in clinical trials where source data includes “hospital records, clinical and office charts, laboratory notes. Other examples are emails, spreadsheets, audio and/or video files, images, and tables in databases”.
Thankfully, there are a range of initiatives to describe this data in a machine readable and interoperable form. Importantly, this takes us away from a paper-based mindset of using documents such as PDF files to record and transfer data. Instead, data standards take us towards a world of structured data that is well defined, searchable, reusable and transferrable. Organisations working in this area include TransCelerate, CDISC, HL7 FHIR and others.
As explored later, open data standards are of major importance and benefit from an archiving and digital preservation perspective. They should be embraced.
In some areas of clinical trial data, data standards have been in place for some time and are a regulatory requirement, for example Study Data submitted to regulators using CDISC CDASH, SDTM and ADaM formats. These build on a set of underpinning formats such as ODM, Dataset-XML and Define-XML. Submissions are made using eCTD. Much of this nicely illustrated in the CDISC standards diagram below.
Clinical trial data refers to all the information for a given clinical trial that is needed to “permit and contribute to the evaluation of the conduct of a trial and the reliability of the results produced” [ICH E6 (R3)].
"The initiatives above will take the community a long way towards a unified, consistent and data driven approach to all aspects of a Study."
More recent developments include the TMF Reference Model (TMF RM) being incorporated into CDISC. This is now being followed by plans to create V4 of the TMF RM that will be consistent with the approach of the other CDISC standards, for example through the use of Controlled Terminology (CT) to properly define all the elements of the RM, a revised Data Exchange format that builds upon the previous effort of the Exchange Mechanism Standard (EMS) in this area, and the ability to access the TMF RM directly from software applications through use of the CDISC Library API. Despite being in existence for several years and being targeted directly at interoperability, the eTMF EMS has seen limited adoption. Instead, TMF has remained largely document focussed. This is likely to change now that TMF is part of CDISC. For example, the regulators may start to request that TMFs need to be created in a CDISC standardised data format. Given the history of other CDISC standards, this could well be a case of when rather than if. This would give a major boost to making TMF more data driven and machine readable, which is the path that other CDISC standards have followed.
The other area of development worth noting are the data standards emerging around ICH M11. ICH M11 aims to standardise the description of clinical trial protocols. A clinical protocol describes the processes and procedures directing the conduct and analysis of a clinical trial of medicinal product(s) in humans. ICH M11 aims to support consistency across sponsors and for the electronic exchange of protocol information. The FDA are adopting the Technical Specification for ICH M11. CDISC, TransCelerate and HL7 FHIR Vulcan are all working together on how ICH M11 can be supported by fully machine readable data standards such as USDM (Unified Study Definitions Model), which in turn is part of the CDISC Digital Data Flow (DDF) initiative in partnership with TransCelerate. This is a complex landscape of collaborating organisations, joint standards development, and alignment with regulators and the ICH. A good summary was presented at the 2024 CDISC Interchange in Berlin by Peter Van Reusel, Chief Standards Officer, CDISC.
One of the important objectives of DDF and USDM is to allow data models and protocol specifications to be passed between systems. This allows systems to become automatically aligned with each other and a Study Definition. The concept is to allow Study Definitions to be passed to downstream systems, e.g. to EDC, eTMF, CTMS etc., which can read the definition and configure themselves to support the required protocol steps and data model. Data archiving and preservation systems are of course ‘downstream’ and will also benefit from being configurable using Study Definitions. That would allow them to be configured automatically to support all the artefacts of a Study – not just the TMF, but study data, source data and study protocol definitions – and to do so for both the Site and the Sponsor side of retention. In other words, configured to support whole study archiving.
Developments to the TMF RM are also being aligned with USDM and ICH M11, for example so the artifacts collected in a TMF match with the protocol steps that create them. This will likely see a common Controlled Terminology across USDM, ICH M11 and the TMF RM.
Open data standards for source data, for example from instruments and diagnostic medical devices, are manifold and are often domain specific. Two examples include DICOM for medical imaging, and BAM and VCF for genomics data. These standards are open and facilitate interoperability of data between systems. The Allotrope Foundation is working on further open data standards with data models available for 50+ areas such as microscopy, spectroscopy, chromatography and cell analysis. It should be noted that adoption of such data standards takes time and requires support from manufacturers and suppliers.
The initiatives above will take the community a long way towards a unified, consistent and data driven approach to all aspects of a Study. This includes machine readable data standards across the board, for example Study Protocols (e.g. USDM), Study Data (e.g. CDASH, SDTM, ADaM) and associated Study Records (e.g. TMF).