DataSHIELD 2024 Programm

DataSHIELD Conference 2024 - Programme

Current information

Provisional programme - subject to change

DAY 1 - Tuesday, September 24 2024

9:00 - 12:00

DataSHIELD Beginners' Workshop (optional)

Andrei Morgan (INSERM, Paris), Angela Pinot de Moira (Imperial College London), Tim Cadman (University Medical Center Groningen), Demetris Avraam (University of Copenhagen)

12:00 - 13:30

Registration & Lunch

13:30 - 13:50

Welcome

Becca Wilson (University of Liverpool) & Andrei Morgan (INSERM, Paris),

Jan Hasenauer (University of Bonn),
Alexander Effland (TRA1, University of Bonn) & Waldemar Kolanus (TRA3, University of Bonn)

13:50 - 14:35

Keynote talk: Lessons learned from the LifeCycle Project - EU Child Cohort Network

Vincent Jaddoe (Erasmus University Rotterdam)

Abstract

Many European pregnancy and childhood cohorts have been established over the last 30 years. The enormous wealth of data of these cohorts has led to important new biological insights and important impact for health from early life onwards. The impact of these cohorts and their data could be further increased by combining data from different cohorts. Combining data leads to the possibility of identifying smaller effect estimates, and the opportunity to better identify risk groups and risk factors leading to disease across the lifecycle across countries. Also, it enables research on better causal understanding and modelling of life course health trajectories.

The EU Child Cohort Network, established by the Horizon2020-funded LifeCycle Project, brings together nineteen pregnancy and childhood cohorts, together including more than 250,000 children and their parents. A large set of variables has been harmonised and standardized across these cohorts. The harmonized data are kept within each institution and can be accessed by external researchers through a shared federated data analysis platform using the R-based platform DataSHIELD, which takes relevant national and international data regulations into account. The EU Child Cohort Network has an open character. All protocols for data harmonization and setting up the data analysis platform are available online. The EU Child Cohort Network creates great opportunities for researchers to use data from different cohorts, during and beyond the LifeCycle Project duration. It also provides a novel model for collaborative research in large research infrastructures with individual-level data. The LifeCycle Project will translate results from research using the EU Child Cohort Network into recommendations for targeted prevention strategies to improve health trajectories for current and future generations by optimizing their earliest phases of life.

14:35 - 15:30

Longitudinal Studies

Presentations (15 minutes each) followed by a joint Q&A session

Understanding social inequalities in childhood asthma: quantifying the mediating role of modifiable early-life risk factors in seven birth cohorts in the EU Child Cohort Network
Angela Pinot de Moira (Imperial College London)

Abstract

Background

Children growing up in socioeconomically disadvantaged circumstances are more likely to develop asthma. Examining how the social patterning of asthma and its risk factors varies across country contexts may increase our understanding of the pathways to inequalities and potential targets for intervention. This study aimed to examine the social patterning of asthma and the early-life factors that could act as mediators of inequalities, and quantify the extent of mediation by early-life factors across country contexts.

Methods

We used harmonised individual participant data for up to 107,884 mother-child dyads from seven birth cohorts across six European countries (Denmark, France, the Netherlands, Norway, Spain, UK). Maternal education during pregnancy was used as the primary exposure measure of early-life socioeconomic circumstances (SECs); household income was used as an alternative measure of SECs in sensitivity analysis. The outcome was current asthma at school-age (4 to 9 years), defined as any two of: i) parental-reported doctor diagnosis, ii) symptoms of wheeze in the past year, iii) asthma medication use in the past year. All analyses were conducted using the federated analysis platform DataSHIELD. Inequalities were examined using adjusted Poisson regression models fitted separately for each cohort and combined using random-effects meta-analysis. The mediating effects of modifiable early-life risk factors (maternal smoking during pregnancy, preterm birth, caesarean section delivery, low birthweight and breastfeeding duration) were examined using counterfactual mediation analysis.

Results

The percentage of mothers with a low-medium education level ranged from 35.9% (MoBa – Norway) to 87.1% (ALSPAC – UK), and the prevalence of school-age asthma ranged between 5.6% (MoBa – Norway) and 17.1% (ALSPAC – UK). Disparities in asthma by SECs were evident in most cohorts: children of mothers with low-medium education had a 17% increased risk of developing asthma in meta-analysis (95% CI: 8%-27%, I2=21.5%), with cohort-specific adjusted risk ratios ranging between 1.07 (0.97–1.18, DNBC - Denmark) and 1.61 (1.08–2.40, EDEN - France). The early-life risk factors investigated as potential mediators were similarly socially patterned, but with greater heterogeneity between cohorts (I2 range = 66.2-95.3%). The mediation analysis suggested that these factors play a relevant role in mediating observed inequalities, which was consistent across cohorts (proportion mediated: 0.08-0.72). Similar results were observed for household income.

Conclusions

There was a consistent tendency for children from less advantaged SECs to be at greater risk of asthma in the European cohorts examined. Our results suggest that greater perinatal support may help to reduce these inequalities.

Estimating causal effects in the framework of potential outcomes and federated individual patient data
Bodil Svennblad (University of Uppsala)

Abstract

Population average causal effects in the framework of potential outcomes are defined as contrasts of the expected outcome had the total population followed different specific exposure strategies. For time to event data, with time invariant baseline confounders, Austin introduced an algorithm for estimating the population average risk difference at a specific time point, t, using predicted values from a fitted Cox model adjusting for the baseline confounders.
For time varying confounders, possibly affected by prior exposure, simply adjusting for them will fail to give unbiased estimates of the causal effect. Under some untestable assumptions, the causal effect can be estimated with the (parametric) g-formula introduced by Robins.
We have developed a modified version of the Austin algorithm suitable for federated individual patient data. The algorithm is further generalized to allow for time-varying covariates, possibly affected by prior exposure, at specific time points during follow-up, e.g a second cycle of questionnaires sent to study participants. The generalization includes landmark analysis, joint distribution covariate models as well as simulations and is shown to be a special case of the parametric g-formula.
Using the two cohorts Swedish Mammography Cohort and Cohort Of Swedish Men, in the SIMPLER infrastructure, as an example with covariates gathered through food questionnaires at baseline and after 12 years, we focus on the population average absolute risk difference at a specific time point, t. We will show the idea of the algorithm, explain why it can be viewed as a special case of the parametric g-formula, discuss the limitations introduced by the data being federated and show how it can be implemented using already available functions in DataSHIELD.

References:

Austin PC. Absolute risk reductions and numbers needed to treat can be obtained from adjusted

survival models for time-to-event outcomes. Journal of Clinical Epidemiology 2010;63:46–55.

Robins JM. A new approach to causal Inference in mortality studies with sustained exposure periods -Application to control of the healthy worker survivor effect. Mathematical Modelling 1986;7:1393-1512 (errata in Computers and Mathematics with Applications 1987;14:917-921).

DataSHIELD in NFDI4Health: Updates and challenges
Sofia Maria Siampani (Max Delbrück Center, Berlin)

Abstract

Background: NFDI4Health (National Research Data Infrastructure for Personal Health Data) is a German initiative focused on creating a sustainable infrastructure for managing and utilizing personal health data for research purposes. Over the last four years, we have collaborated with 15 studies and 13 institutes, paving the way for secure data analysis in nutritional epidemiology and chronic diseases with DataSHIELD.

Objective: In NFDI4Health we aim at offering an infrastructure for federated analysis of multiple studies using DataSHIELD, enabling the sustainable reuse of research data from population-based studies.

Methods: To support the integration of DataSHIELD, we have focused on developing a robust and secure infrastructure. We are hosting a central R server, which aims to provide convenience for analysts and enhance security for Data Holding Organizations (DHOs). Additionally, we are expanding DataSHIELD nodes across Germany. We have also welcomed feedback from the DHOs and, combined with our own experiences, identified key areas for improvement to ensure the sustainability of DataSHIELD within NFDI4Health.

Results: The central R server is now ready for users, with analysis set to commence once data harmonization is completed. Currently, we have eight DataSHIELD nodes in place, with five more anticipated.
We encountered questions regarding the security of DataSHIELD. To address these questions, we consider engaging a third-party security company to perform a security analysis and obtain certification.
Additionally, we recognized that DHOs might lack the funds or resources to support the installation and maintenance of the necessary infrastructure. To address this, we have developed reimbursement models for Opal/DataSHIELD setup and other essential processes such as metadata collection and data harmonization. These models incentivize data contributors and ensure the sustainability of the pipeline and infrastructure.
We also identified common needs with other German initiatives that use DataSHIELD and will collaborate with them to streamline efforts and avoid redundancy.

Conclusions: The DataSHIELD infrastructure has been successfully implemented in the NFDI4Health consortium. Moving forward, we aim to increase the number of DataSHIELD users utilizing the central R server and expand the number of DataSHIELD nodes. We will focus on advancing the service by addressing the challenges we encountered and incorporating feedback from stakeholders, ensuring the long-term success of DataSHIELD integration within the consortium framework.

15:30 - 16:00

Coffee break

16:00 - 16:45

Keynote talk: Swarm Learning in medical data analysis

Joachim L. Schultze (German Center for Neurodegenerative Diseases, Bonn / University of Bonn)

16:45 - 17:25

DataSHIELD software development I

Presentations (15 minutes each) followed by a joint Q&A session

Software demonstration: ds-tidyverse
Tim Cadman (University Medical Center Groningen)

Abstract

Over recent years many DataSHIELD packages have been developed to enable increasingly sophisticated data analysis. However data manipulation remains very difficult in DataSHIELD, and some very basic operations are either not possible or very convoluted. This adversely affects the user experience and limits the accessibility and attractiveness of DataSHIELD as a federated analysis solution. Outside of DataSHIELD, the Tidyverse collection of R packages has been developed to allow neat, effective and efficient manipulation of data. To address the limitations of data manipulation in DataSHIELD, we have therefore developed the ds.tidyverse package. This contains DataSHIELD implementations for many of the most commonly used Tidyverse functions (e.g. “select”, “mutate”, “filter”, “case_when”) embedded with the standard DataSHIELD disclosure controls. In this presentation we will demonstrate ds.tidyverse and illustrate the improvements it brings compared to the current DataSHIELD functionality.

Adopting the Stats Barn framework in the DataSHIELD development lifecycle – a pathway for DataSHIELD package certification
Becca Wilson (University of Liverpool)

Abstract

Statistical disclosure control (SDC) is a commonly applied approach that contributes to safe outputs in the context of the Five Safes Framework [1]. However, it is complex to enact and difficult to consistently describe within the DataSHIELD context of 100s of analytics functions each requiring different and/or combined approaches to ensure released outputs are non-disclosive. Tools like DataSHIELD that operate a rules based, automated SDC currently lack a sustainable framework to describe output checking that can be consistently applied. Data controllers also need a framework through which to understand the risks associated with DataSHIELD outputs from any given analysis class which will help inform the thresholds for DataSHIELD disclosure settings and feed into data protection impact assessments and other governance processes relating to the deployment and operation of DataSHIELD with real world data.

To date DataSHIELD developers are using a variety of methods to describe and disseminate the disclosure prevention methodologies within their functions:

• Publishing their automated SDC methods in peer review publications that describe their DataSHIELD function/package
• Describing their SDC methods within their software documentation
• Including their package on the disclosure checks description page on the DataSHIELD wiki https://wiki.datashield.org/statdev/disclosure-checks
• Evidenced by the package software validation tests

I propose convergence on new developments in the field of statistically disclosure control, by implementing the ‘stats barn’ conceptual framework into the DataSHIELD development lifecycle. The stats barn framework defines the risk and minimum output checking requirements for categories of statistical functionality [2] based on current best practice of SDC deployed in manual output checking.

Adoption of the framework in DataSHIELD will facilitate:
1. a consistent and scalable process by which the automated disclosure checks and SDC methods within DataSHIELD packages can be described at function level
2. defining the minimum software tests developers will be required to demonstrate the correct application of automated disclosure control and output checks within a DataSHIELD package

Combined, these will provide a sustainable pathway towards the development of a formal DataSHIELD package certification that will evidence the extent to which automated checks in a DataSHIELD package aligns, exceeds or falls short of best practice in manual output checking.

[1] Desai, T., Ritchie, F., & Welpton, R. (2016). Five Safes: Designing data access for research. http://www1.uwe.ac.uk/bl/research/bristoleconomicanalysis/economicsworkingpapers/economicspapers2016.aspx
[2] Ritchie, F., Tilbrook, A., Green, E., White, P., Derrick, B., & Kendall, C. (2023). The SACRO guide to statistical output checking (Version 1). Zenodo. https://doi.org/10.5281/zenodo.10282526

17:25 - 18:00

Special Feature: A dinosaur’s eye view of the DataSHIELD project

Paul Burton - Former Principal Investigator and lead of the DataSHIELD project (Newcastle University)

from 18:00

Welcome Reception

Walking dinner/finger food and drinks provided

DAY 2 - Wednesday, September 25 2024

9:00 - 9:45

Keynote talk: An Industry Perspective of Federated Analysis as an Innovative Approach for Accessing Real World Data

Eric Boernert (Roche, Basel)

9:45- 10:25

Federated analysis of healthcare data

Presentations (15 minutes each) followed by a joint Q&A session

Practical guidance to interpretation of federated real world data analyses: A holistic simulation study investigating statistical inference in presence of heterogeneity in data distributions across hospitals
Dominik Heinzmann (Roche, Basel)

Abstract

Federated learning and analytics with medical data is advancing fast, many new algorithms are developed and deployed. Applying these algorithms to real world medical data and being able to properly interpret the results is challenging due to the fact that data generation processes across sites can be quite heterogeneous. For example data at a hospital can come from a mixture of different subpopulations and differences in patient care protocols & schedules. Understanding clinical practice and data heterogeneity is critical to correct statistical inference.

A holistic simulation study enabled through DataSHIELD will be presented where the (heterogeneous) data generation processes across sites are designed to capture real world characteristics. This allows better understanding of the operational characteristics of some of the commonly used regression models. Despite a growing size of federated functions library for DataSHIELD, the simulation study required custom development of state of the art algorithms as DS packages. In particular, we will focus on different approaches to cox regression models including WebDISCO algorithm reformulating the likelihood using Breslow (partial likelihood) representation to make it separable and DC-Cox applying dimensionality reduction approaches to the hospital data and then share those privacy-preserving data with the federation orchestrator.

Finally, practical guidance will be provided on how such simulation studies can support interpreting the results of a federated analysis on real world data in an appropriate way when hospital individual patient data can not directly be accessed.

CardioKit: Detection of Cardiac Anomalies through Distributed Optimization of Electrocardiogram Embeddings
Stephan Jonas (University Hospital Bonn) & Maximilian Kapsecker (Technical University Munich)

Abstract

Keywords: Cardiovascular Disease, Variational Autoencoder, Federated Learning

1 Introduction
The 12-lead electrocardiogram (ECG) provides extensive insights into cardiac health that usually require investigation by a physician. Wearable devices enable continuous single-lead ECG recording beyond the clinical environment. [1] In this context, an interdisciplinary team designed a system, CardioKit, to address the limitations of manual ECG review, such as time consumption and lack of reproducibility. CardioKit employs a privacy-preserving, semi-supervised approach to continuously learn ECG characteristics in an automated manner from decentralized data.

2 Methods
The core of CardioKit is built on Variational Autoencoders (VAEs), which embed the most relevant characteristics of ECG signals into a low-dimensional representation suitable for traditional anomaly detection. By optimizing VAEs for ECG data directly on client devices, such as mobile phones, the system leverages the benefits of statistical learning without requiring data to be shared with a central processing unit. Instead, the model weights are securely transmitted to a collaborative platform in an overall process known as federated learning [2]. This facilitates the development of a globally aggregated model for ECG embedding.
By annotating a few representative ECGs within the embedding space, such as marking anomalies and diseases, CardioKit can extrapolate similar ECGs based on the proximity of learned features. Distributing model interpretations back to client devices provides users with enhanced insights into the reasoning behind outliers. Further fine-tuning the global model on-device using local data improves predictive accuracy by accommodating individual variations, such as different isoelectric baselines.

3 Results
A proof-of-concept was achieved through the partial implementation of the prototype¹. The associated investigation revealed that VAEs are an effective method for encoding ECG signals. Furthermore, the prototype demonstrated the capability of federated learning to orchestrate model training while preserving privacy [3]. Additionally, a user-friendly web application was developed, enabling physicians to conveniently label ECG data.

4 Conclusion
CardioKit advances the computational assessment of cardiovascular risk and supports the quantified self for cardiac health. Researchers and physicians could benefit from the system’s collaborative and explorative nature, which enables efficient ECG annotation and the detection of anomalies and baseline drifts in long-term ECG recordings.

References
[1] Bouzid Z, Al-Zaiti SS, Bond R, Sejdi´c E. Remote and wearable ECG devices with diagnostic abilities in adults: A state-of-the-science scoping review.
Heart Rhythm. 2022;19(7):1192-201.
[2] Kairouz P, McMahan HB, Avent B, Bellet A, Bennis M, Bhagoji AN, et al. Advances and open problems in federated learning. Foundations
and trends® in machine learning. 2021;14(1–2):1-210.
[3] Kapsecker M, Nugraha DN, Weinhuber C, Lane N, Jonas SM. Federated Learning with Swift: An Extension of Flower and Performance Evaluation.
SoftwareX. 2023;24:101533.

¹https://github.com/CardioKit

10:25 - 10:55

Coffee break

10:55 - 11:40

Federated analysis of healthcare data (continuation)

Presentations (15 minutes each) followed by a joint Q&A session

Software demonstration: CCPhos – A DataSHIELD-powered Framework for Data Harmonization, Augmentation, Exploration and Analysis of the German Cancer Consortium’s (DKTK) Clinical Communication Platform (CCP)
Bastian Reiter (Goethe University Frankfurt)

Abstract

The German Cancer Consortium’s (DKTK) Clinical Communication Platform is a federated system of cancer-center-based data warehouses with highly sensitive clinical real-world data (RWD) of cancer patients. The CCP’s federated infrastructure guarantees a maximum level of data security and data sovereignty for the participating medical centers^1,2. While originally designed for recruitment of patients for clinical trials, a recently published paper demonstrates the potential of the CCP’s RWD for clinical oncological research³.

To establish secure, fast and scalable federated data analysis, a DataSHIELD-compliant infrastructure has been installed in the CCP network. With CCPhos (The CCP’s approach of handling oncological real-world data sets), we present a user-centered, comprehensive solution for the challenges in pre-analytic data preparation (i.e. harmonization, augmentation), exploration and analysis.

The CCP’s data model⁴ forms a subset of the oncologic base data set (oBDS⁵) jointly developed by the Association of German Tumor Centers (Arbeitsgemeinschaft Deutscher Tumorzentren) and the Association of Epidemiologic Cancer Registries in Germany (Gesellschaft der epidemiologischen Krebsregister in Deutschland e.V.). As the data collection is conducted by trained cancer registrars within the participation cancer centers, the data is already in a well harmonized state. However, multiple minor inconsistencies remain, bearing a high risk to cumulate and result in invalid and biased statistical analyses.

Furthermore, using data augmentation by means of feature engineering and machine learning algorithms, the full potential of the data could be leveraged. Both aspects are addressed by the functionality implemented in the CCPhos framework. The CCPhos suite consists of two closely interlinked R-packages (dsCCPhos and dsCCPhosClient) and a complementary R Shiny application that aims to facilitate their usage for researchers.

Incorporating DataSHIELD’s interface technology (DSI package⁶), data processing functions are called from an arbitrary client system within the network and executed on remote servers of the participating sites. The proposed functionality entails the following key features: 1) Checking of technical requirements on connected servers; 2) Customizable and traceable data transformation; 3) Data curation; 4) Data augmentation and 5) Explorative analytics.

The overarching goal is to provide researchers with a comprehensive set of tools to obtain valid and conclusive ready-for-analysis data, while offering a maximum of flexibility and transparency in the way these data are obtained.
__________________________________________________________________________________
1. Joos S, Nettelbeck DM, Reil‐Held A, et al. German Cancer Consortium ( DKTK ) – A national consortium for translational cancer research. Mol Oncol. 2019;13(3):535-542. doi:10.1002/1878-0261.12430
2. Lablans M, Schmidt EE, Ückert F. An Architecture for Translational Cancer Research As Exemplified by the German Cancer Consortium. JCO Clin Cancer Inform. 2018;(2):1-8. doi:10.1200/CCI.17.00062
3. Maier D, Vehreschild JJ, Uhl B, et al. Profile of the multicenter cohort of the German Cancer Consortium’s Clinical Communication Platform. Eur J Epidemiol. 2023;38(5):573-586. doi:10.1007/s10654-023-00990-w
4. Deppenwiese N, Delpy P, Lambarki M, Lablans M. ADT2FHIR – A Tool for Converting ADT/GEKID Oncology Data to HL7 FHIR Resources. In: Röhrig R, Beißbarth T, König J, et al., eds. Studies in Health Technology and Informatics. IOS Press; 2021. doi:10.3233/SHTI210547
5. ADT/GEKID. Aktualisierter einheitlicher onkologischer Basisdatensatz der Arbeitsgemeinschaft Deutscher Tumorzentren e. V. (ADT) und der Gesellschaft der epidemiologischen Krebsregister in Deutschland e. V. (GEKID). https://www.basisdatensatz.de/download/Basisdatensatz12.7.pdf
6. Marcon Y, Gaye A, Burton P. DSI R package. https://CRAN.R-project.org/package=DSI

A Secure and Scalable Workflow for Federated Data Analysis in the German Cancer Consortium’s (DKTK) Clinical Communication Platform (CCP)
David Juárez (German Cancer Research Center, Heidelberg)

Abstract

Established as the federated data management of the cancer centers of the German Cancer Consortium (DKTK) [1], the Clinical Communication Platform (CCP) plays a central role in the collection and exchange of clinical routine data and biosamples [2]. In this capacity, the platform serves also as a real-world data-based research environment for DKTK Clinical Data Science Group (CDSG) an interdisciplinary community of clinical researchers and data scientists. To this end, the CCP uses the open-source Bridgehead technology to connect all participating sites within the DKTK and beyond. With its distributed infrastructure, the Bridgehead enables scientists to query, request, and analyze data across the participating centers [3,4].

To enhance the data analysis process with respect to data security and velocity, the CCP integrated DataSHIELD as a means for federated analysis into its platform. This integration is supported by a streamlined analysis workflow, most notably including automatization of a) spawning of DataSHIELD partitions, including OPAL databases within each Bridgehead; b) data integration from FHIR [5,6] into OPAL; and c) secure and seamless firewall traversal for DataSHIELD requests using Samply.Beam [7], a framework for federated, end-to-end encrypted communication within strict network environments.

From the user’s perspective, the workflow begins with a feasibility check via a central web application (CCP Explorer) [8], allowing researchers to explore available data across the CCP Bridgeheads in a federated manner using several aggregated views. Once sufficient data is identified, the project undergoes ethical and scientific review, and the following technical steps are initiated to ensure secure and efficient data analysis:

1. Project Request Submission: The researcher submits a project request through a project management tool, including data selection using the CCP Explorer query (currently HL7 CQL [9]). This request undergoes a scientific review by the DKTK Clinical Data Science Group and
a formal review by the CCP Office to ensure compliance with technical, legal, and ethical requirements.
2. Approval and Data Export: The project management tool sends data import queries to the Bridgeheads of participating sites. The Bridgehead administrators review and approve the request. Upon approval, the Bridgehead at each site creates an OPAL partition for this specific project and user, exports the requested data from the Bridgehead’s FHIR store [5,6] to OPAL (FHIR to SQL [10]), retaining it for a limited duration. To authenticate DataSHIELD requests, a unique, ephemeral token bound to this project is generated and sent to the researcher.
3. Authentication and Authorization: Once data is imported into each OPAL, authentication scripts including project-specific authorization tokens are generated and sent to the researcher.
4. Secure Access: Researchers access their local Bridgehead, which is equipped with RStudio, DataSHIELD, and OPAL, using their existing DKTK credentials (federated authentication using OpenID Connect) into their own project partition.
5. Data Harmonization, Augmentation and Analysis: After authentication researchers may enter an interactive R-Studio session to access, process and analyze respective data. To address pre-analytical data preparation challenges, data harmonization and cleansing is performed using the CCP-customized CCPhos-suite of R packages built for application within the DataSHIELD framework (CCPhos [11]). The HTTP request generated by DataSHIELD is relayed via Samply.Beam to the RServer of each site, where it is then processed.

To address data privacy and protection, a comprehensive data protection concept was developed and coordinated with all DKTK sites.

This federated approach, supported by the DataSHIELD infrastructure, exemplifies the principle of "bringing the analysis to the data". It provides a robust mechanism for secure and compliant multi-site data analysis, enabling DKTK researchers to conduct comprehensive analyses while maintaining patient privacy and data sovereignty. Moreover, the integration of the CCP’s federated data warehouse system with the comprehensive and adaptable, federated DataSHIELD-based analysis environment will facilitate and accelerate future use scenarios of real-world clinical cancer data.
__________________________________________________________________________________
REFERENCES
[1] S. Joos, D.M. Nettelbeck, A. Reil-Held, K. Engelmann, A. Moosmann, A. Eggert, W. Hiddemann, M. Krause, C. Peters, M. Schuler, K. Schulze-Osthoff, H. Serve, W. Wick, J. Puchta, and M. Baumann, German Cancer Consortium (DKTK) - a national consortium for translational cancer research, Universität, Freiburg, 2019.
[2] A. Borg and M. Lablans, Clinical Communication Platform (CCP-IT): Datenschutzkonzept, http://www.unimedizin-mainz.de/typo3temp/secure_downloads/19402/0/3826542f323206d948a330d5705d0463564669b1/Datenschutzkonzept_CCP-IT__10.10.2014.pdf [cited 2015 October 8].
[3] M. Lablans, E.E. Schmidt, and F. Ückert, An Architecture for Translational Cancer Research As Exemplified by the German Cancer Consortium. JCO Clin Cancer Inform (2017), 1–8.
[4] M. Lablans, D. Kadioglu, M. Muscholl, and F. Ückert, Exploiting Distributed, Heterogeneous and Sensitive Data Stocks while Maintaining the Owner's Data Sovereignty. Methods Inf Med 54 (2015), 346–352.
[5] Samply Open Source Community, Blaze, https://github.com/samply/blaze#readme [cited 2022 October 13].
[6] M. Lambarki, J. Kern, D. Croft, C. Engels, N. Deppenwiese, A. Kerscher, A. Kiel, S. Palm, and M. Lablans, Oncology on FHIR: A Data Model for Distributed Cancer Research. Stud Health Technol Inform 278 (2021), 203–210.
[7] Samply Open Source Community, Samply.Beam README.md [cited 2023 May 12].
[8] Samply Open Source Community, samply/lens: A reusable toolkit for rich federated data exploration., https://github.com/samply/lens [cited 2024 June 13].
[9] J. Kern, N. Deppenwiese, C. Engels, A. Kiel, M. Lambarki, and M. Lablans, Complex queries on distributed FHIR data: the limits of FHIR Search, German Medical Science GMS Publishing House, 2021.
[10] Samply Open Source Community, samply/exporter: Exports data from the datawarehouses of the bridgehead in different formats, https://github.com/samply/exporter [cited 2024 June 13].
[11] B. Reiter, D. Maier, M. Lambarki, D. Juárez, J. Skiba, P. Delpy, T. Kussel, M. Lablans, and J. Vehreschild, CCPhos – A DataSHIELD-powered Framework for Data Harmonization, Augmentation, Exploration and Analysis of the German Cancer Consortium’s (DKTK) Clinical Communication Platform (CCP): (Abstract Submitted to DataSHIELD Conference 2024).

11:40- 12:25

Keynote talk: How can DataSHIELD contribute to health economics studies?

Elisa Sicuri (Barcelona Institute for Global Health - ISGlobal)

Abstract

Health economics, a discipline positioned at the intersection of health and social sciences, encompasses a broad spectrum of individual-level studies, many of which are multicentre and global. Economic evaluations of healthcare interventions, whether conducted in the context of clinical randomised trials or observational studies, often face challenges related to data access. Tools like DataSHIELD can play a crucial role in addressing these challenges. When studies require merging health- and non-health-related data, such as socio-economic or demographic information from different sources, the utility of tools such as DataSHIELD becomes even more pronounced. This presentation highlights examples of health economics studies where DataSHIELD can significantly enhance research capabilities, with a particular emphasis on studies in low- and middle-income countries.

12:25 - 13:30

Lunch (provided at university canteen)

13:30 - 13:50

Lightening Talks

Presentations (5 minutes each) followed by a joint Q&A session

Trusted Research Environments and Governance of Personal Health Data in Chile: Foundations for a Population Health Laboratory
Miguel Cordero (Universidad del Desarrollo, Santiago de Chile)

Abstract

The development of secure research environments and robust governance for the scientific use of personal health data is crucial to advancing population health sciences and biomedical discovery. In Chile, fragmentation of information between public and private providers, limited interoperability among health systems, and disparities in access to care are hindering the integration of personal health data for analysis. These challenges are impacting the ability to conduct large-scale, accurate research that is necessary for improving healthcare outcomes and public health.

The Institute of Science and Innovation in Medicine (ICIM) at the Faculty of Medicine, Clínica Alemana, Universidad del Desarrollo, is addressing these challenges by creating the Population Health Laboratory (LSP). This multidisciplinary initiative is focusing on developing critical infrastructure that supports the safe and efficient use of large volumes of personal health data. During early phases of implementation our research team is studying public willingness to share personal health data for research purposes and documenting metadata from key sources of publicly available health statistics. Preliminary findings indicate that while many adults’ express openness to sharing health data, concerns around privacy and data security remain significant barriers. These public perspectives are playing a crucial role in shaping the design and implementation of the LSP's data governance strategies.

Furthermore, the LSP is implementing a digital infrastructure based on common data models and metadata cataloguing, aimed at integrating routinely collected health data into surveillance systems to answer key epidemiological questions. This infrastructure is being organized into five processing levels within a Data Lake House framework, ensuring traceability, consistency, standardization, and functional adaptation of the data for research use. Data scientist are working on Nivel 0, where data is remaining in its raw form with associated metadata, and moving through levels of processing until Nivel 4, where data is being adapted for specific research contexts. Each level is emphasizing the importance of traceability and documentation of data transformations.

In addition to these efforts, the LSP is designing a proof of concept at the Universidad del Desarrollo (UDD) to create a secure research environment for personal health data. This initiative is being sustained by the implementation of Standard Operating Procedures (SOPs) that ensure proper data management and compliance with advanced data protection standards. The proof of concept is focusing on developing a trusted research environment (TRE) that integrates anonymized individual health data with aggregated data from health institutions across Chile. This TRE is being designed with a metadata-driven platform, management under cybersecurity standards, explicit definitions of user roles, and authentication protocols to achieve comprehensive traceability of data processing and analysis. As part of this infrastructure, is planned that DataSHIELD will be integrated at a proof of concept within the next year, enabling secure federated analysis across partner sites, allowing researchers to perform complex analyses using individual patient data guarantee privacy.

Looking ahead, the LSP is planning to expand the use of these secure environments for population health research, leveraging lessons learned and aligning with international developments in the secure management of personal health data. Public perspectives on data sharing and privacy will continue to be integrated into the LSP's approach, ensuring that the laboratory not only advances population health research but also maintains public trust and adherence to the highest standards of data protection and governance.

DataSHIELD stakeholder expectations: useability, hopes and next steps
Becca Wilson (University of Liverpool)

Abstract

Successive and cumulative social science data collection including semi-structured interviews, micro interviews and ECOUTER [1, 2] stakeholder engagement sessions has occurred throughout the period 2016-2023 – coinciding with rapid expansion and adoption of DataSHIELD in European cohort consortia and initiation of the DataSHIELD community. This qualitative analysis explores questions of DataSHIELD’s usage, benefits, limitations and users’ hopes for further development. They reflect the development DataSHIELD has undergone and continues to undergo - providing insight into successes to date, remaining challenges, and possible/hoped for improvements of the useability, deployment, and sustainability of DataSHIELD.

The insights gained from thematic analysis of these data enable us to draw conclusions and present key themes concerning the values and principles underlying DataSHIELD, expectations and limitations of its useability and social, cultural, and political influences on of DataSHILED as a community project (governance) from the perspective of different stakeholder groups (users, developers/coders, and data custodians).

References
1. Wilson RC, Butters OW, Clark T et al. (2016). Digital methodology to implement the ECOUTER engagement process [version 1; referees: 2 approved]. F1000Research, 5:1307 (doi: 10.12688/f1000research.8786.1)

2. Murtagh, M.J., Minion, J.T., Turner, A. et al. (2017). The ECOUTER methodology for stakeholder engagement in translational research. BMC Med Ethics 18, 24 https://doi.org/10.1186/s12910-017-0167-z

Measuring the impact of DataSHIELD via research publications
Becca Wilson (University of Liverpool)

Abstract

PUMA (Publications Metadata Augmentation) [1] is an open source software pipeline that aggregates metadata from a variety of third-party providers to power a web based search and exploration tool for lists of publications. We present in this poster the application of PUMA to the DataSHIELD publications database – a log of peer reviewed academic publications we have collated relating to the development and/or application of DataSHIELD software.

We present visualiations of the core publication metadata (i.e. author lists, keywords etc.) including geocoding of first authors and citation counts allowing a characterisation of a project as a whole based on common locations of authors, frequency of keywords, citation profile, author relationships etc. This enriched publications metadata is useful for generating project impact metrics, producing web-based graphics for public dissemination on the DataSHIELD website and will produce an online searchable data set of DataSHIELD publications useful for literature review/meta analysis, DataSHIELD data sources, DataSHIELD use cases, bibliometric analysis or social studies of science.

We plan to embed PUMA into the new DataSHIELD website and establish an online submission mechanism for authors to report their publications for inclusion in the DataSHIELD-PUMA publications database.

[1] Butters OW, Wilson RC, Garner H and Burton TWY. Publications Metadata Augmentation (PUMA) pipeline [version 2; peer review: 2 approved] F1000Research 2021, 9:1095 (https://doi.org/10.12688/f1000research.25484.2)

13:50 - 15:30

Updates from the DataSHIELD open source community

Core DataSHIELD Infrastructure updates
MOLGENIS Armadillo - Mariska Slofstra, Tim Cadman & Dick Postma (University Medical Center Groningen), Opal - Yannick Marcon (Epigeny, France), DataSHIELD - Stuart Wheater (Arjuna Technologies, Newcastle upon Tyne)

DataSHIELD Community: Updates from themes, steering committee and advisory board
Andre Morgan (Inserm, Paris) & Simon Parker (German Cancer Research Center, Heidelberg) with contributions from DataSHIELD community theme leads

Social event & Conference dinner

15:30 - 16:30

Coffee break with snack, way to Arithmeum (ca. 2 km/30 minutes walk)

16:30 - 19:00

Guided tour at Arithmeum (60 minutes), subsequently: option for independent exploration and walking tour to restaurant

from 19:00

Conference dinner at restaurant DelikArt

DAY 3 - Thursday, September 26 2024

8:45 - 9:30

Welcome Coffee

9:30 - 10:15

Keynote talk: Building a modern infrastructure for secure, scalable, collaborative data science

Vishnu Chandrabalan (Lancashire Teaching Hospitals NHS Foundation Trust, Preston)

10:15 - 11:10

DataSHIELD software development II

Presentations (15 minutes each) followed by a joint Q&A session

A User-Friendly Interactive Dashboard for DataSHIELD: Enhancing Data Exploration and Visualization
Andreas Mändle (Leibniz Institute for Prevention Research and Epidemiology - BIPS, Bremen)

Abstract

DataSHIELD offers a way for researchers to conduct analyses on sensitive data through a secure environment that allows for data aggregation and analysis without compromising privacy. While existing Shiny applications offer basic functionality for accessing DataSHIELD, there remains a need
for tools that prioritize ease of use, aesthetic presentation, and interactivity.

Our newly developed Shiny dashboard addresses these needs by offering an enhanced user experience tailored for researchers and stakeholders who demand intuitive, easy-to-use and visually appealing interfaces. This presentation will showcase an exemplary use case for the innovative
features of our dashboard, emphasizing its capacity to facilitate effective data exploration within the DataSHIELD framework.

The key features are:
1. User-Friendly Interface
The dashboard leverages R Shiny's capabilities to provide a user-friendly and interactive platform for researchers. Our dashboard is designed with simplicity in mind, ensuring that users can easily navigate and utilize its features.
2. Enhanced Visualizations
We have incorporated advanced plotting capabilities to produce appealing, interactive visualizations. Users can generate and customize several chart and graph types including alluvial plots, making it easier to get data insights.
3. Interactivity
A key feature of our dashboard is the high level of interactivity it offers. Users can interact with plots and summary tables to explore different facets of their data.
4. Attractive Presentation
Beyond functionality, we have prioritized the aesthetic aspects of our dashboard. The clean and modern design not only improves usability but also ensures that the outputs are visually engaging.
5. High level of data privacy
By integrating DataSHIELD, the dashboard ensures that individual level data remain secure and are never directly accessed or transferred. As an additional layer of security, advanced methods for synthetic data generation based on a non-parametric copula approach ensure that potentially
disclosive outputs, such as scatterplots, provide valuable informational insights and meaningful analyses without exposing confidential information.

Our dashboard represents a significant advancement in the tools available for DataSHIELD users. By combining the strengths of R Shiny's interactive interface, DataSHIELD's privacy-preserving functionalities, and the generation of synthetic data, we aim to empower researchers to gain insight
into complex datasets easily and effectively. In this way, the accessibility of research data is promoted. Designed to accommodate various types of datasets and research needs, the dashboard is highly scalable and can be adapted to a broad spectrum of applications, such as analyzing cohort
data in epidemiological studies.

dsMatchIt
Roy Gusinow (Helmholtz Center Munich / University of Bonn)

Abstract

We present federated matching methods within the DataSHIELD platform through the dsMatchIt package, using propensity score matching to balance covariates across multiple data providers. This approach is essential in federated environments with varying data distributions, such as hospital datasets for post-COVID analyses. By sharing distance measures like propensity scores anonymously, balanced subsets can be selected for further analysis, including average treatment effect (ATE) computation. We introduce a federated averaging of Marginal Mean Weighting through Propensity Score Stratification, which assists in estimating the ATE by predicting outcomes based on treatment and covariates. Post-matching, the ATE is calculated using weighted summations across servers and strata. We also compute cluster-robust standard errors for ATE, ensuring privacy by aggregating components. Covariate balance after matching is assessed with summary statistics and empirical cumulative distribution functions (eCDFs). Visualization tools like jitter plots, quantile-quantile plots, and density plots illustrate the effects of matching. The dsMatchIt package optimally subsamples data across servers, enhancing treatment group comparisons and providing tools for robust, privacy-preserving causal inference in federated learning environments.

Updates from DSFunctionCreator: Working towards coding and package conventions
Florian Schwarz (German Institute of Human Nutrition Potsdam-Rehbruecke)

Abstract

In the last couple of years, the use of DataSHIELD has been expanded significantly in national and international collaborative initiatives (e.g. EUCAN-Connect, Athlete, NFDI4Health) across different regulatory entities. At the same time, also the number of researchers and developers involved in the DataSHIELD Community rose notably aiming to improve and extend DataSHIELD capabilities. On the other hand, concerns from data holding organisations (DHOs) about the security and data protection adherence have thus far remained, of which one aspect relates to missing certification of the software stack by an accredited body. Some important tasks have already been initiated by members of our community, e.g. work on emergency communication strategies. Another key facet belongs to creating coding and package conventions amongst developers for potential audit processes. This includes, but is not limited to, argument names, error messages, the test pipeline, workflow for CI/CD (e.g. GitHub Actions), examples and documentation. A streamlined process has also several other advantages, such as improved user experience through harmonized arguments and error messages (albeit an initial change for existing scripts), or expedited developer-to-developer interaction for reviews/trouble-shooting.

While some of the proposed changes might seem tedious to implement at first, the long-term benefit of a unified approach enhances the sustainability significantly. To aid in this transformative process, the DSFunctionCreator package was created for which I will provide an update regarding its developer support functionality. Discussions (and potentially agreements) on some concrete standards should be initiated at the conference to also lay the groundwork for the next major update of dsBase (7.0.0), which is envisioned to incorporate those changes.

11:10 - 11:40

Coffee break

11:40 - 12:50

DataSHIELD method development

Presentations (15 minutes each) followed by a joint Q&A session

Privacy-preserving gradient boosting in DataSHIELD
Manuel Huth (Helmholtz Center Munich / University of Bonn)

Abstract

Tree-based gradient boosting models have demonstrated significant efficiency and adaptability in Machine Learning over recent years. However, stringent data privacy regulations pose challenges when these models are applied to sensitive personal information, such as health data or insurance records. These regulations often preclude data sharing without explicit patient consent, leading to smaller and potentially biased sample sizes. Federated Learning offers a viable solution to this problem by maintaining data privacy and sharing only aggregated summary statistics with analysts iteratively. When combined with robust security measures like Differential Privacy, which is effective against customized reconstruction attacks that can infer underlying data from the trained model, it becomes a critical tool for analyzing privacy-sensitive data. Making the method accessible to the public raises the need for appropriate software. Yet, advanced federated gradient boosting software packages with Differential Privacy mechanisms remain scarce.
To address this gap, we developed a federated software package for tree-based gradient boosting models, integrated within the DataSHIELD platform. Our package adheres to DataSHIELD’s stringent security protocols and employs Differential-Privacy as an additional security layer. Key features of our package include compatability with continuous and categorical features as well as viability for regression and classification problems. The tree model supports histogram-based as well as random splits, and it effectively reproduces non-federated estimates while ensuring data privacy.
We demonstrate the functionality of our software and evaluate the impact of different privacy budgets using data for the human microbiome. Our work provides a significant advancement in tool availability for privacy-preserving Machine Learning, offering a secure and effective tool for analyzing sensitive data without compromising privacy.

Vertical data analysis using DataSHIELD
Miron Banjac (Barcelona Institute for Global Health - ISGlobal)

Abstract

In this talk we will address the challenges and implementations of statistical analysis and model fitting on federated vertically partitioned data within the DataSHIELD framework, traditionally designed for horizontally partitioned data. Our approach innovatively adapts existing techniques to enable privacy-preserving methods suitable for vertical analysis.

One significant challenge with vertical partitioning is data alignment and record matching. To address this, we employ secure hashing methods, which allow for accurate row matching without information leakage, thereby ensuring data alignment across different partitions.

A central component of our methodology is the use of Block Singular Value Decomposition (Block SVD) to approximate correlation coefficients between variables and conduct Principal Component Analysis (PCA). This technique enables efficient data processing without necessitating data centralization or the sharing of masked or encrypted data, preserving privacy and compliance with data protection regulations.

Additionally, we have developed a distributed block coordinate descent algorithm tailored for fitting various families of Generalized Linear Models (GLMs) on vertically partitioned data. This algorithm updates parameter estimates for each block iteratively, eliminating the need for raw data exchange and thus maintaining data confidentiality.

Our advancements extend the capabilities and scope of DataSHIELD by implementing a robust framework for the analysis and model fitting of vertically partitioned data. This work demonstrates that it is possible to perform sophisticated statistical analyses and model fittings in a federated environment while adhering strictly to non-disclosure mandates and privacy-preserving principles. Through these enhancements, we aim to broaden the applicability of DataSHIELD to diverse data partitioning scenarios, enabling more comprehensive and secure data analyses across various fields. As an example of application, our implementation and results will be illustrated using data belonging to CCShared project.

Privacy-preserving impact evaluation using Differences-in-Differences
Carolina Alvarez (University of Bonn)

Abstract

Difference-in-Differences (DID) is a widely used tool for causal impact evaluation but is constrained by data privacy regulations when applied to sensitive personal information, such as individual-level performance records or healthcare data, that must not be shared with data analysts. Obtaining consent can reduce sample sizes or exclude treated/untreated groups, diminishing statistical power or making estimation impossible. Federated Learning, which shares aggregated statistics to ensure privacy, can address these concerns, but advanced federated DID software packages remain scarce. We developed a federated version of the Callaway and Sant'Anna DID, implemented within the DataSHIELD platform. Our package adheres to DataSHIELD's security measures and adds extra protections, enhancing data privacy and confidentiality. It reproduces point estimates, asymptotic standard errors, and bootstrapped standard errors equivalent to the non-federated implementation. We demonstrate this functionality on simulated data and real-world data from a malaria intervention in Mozambique. By leveraging federated estimates, we increase effective sample sizes leading to reduced estimation uncertainty, and enable estimation when single data owners cannot share the data but only have access to the treated or untreated group

Developing the dsMediation, a DataSHIELD package for causal mediation analysis: Challenges and Potentials
Demetris Avraam (University of Copenhagen)

Abstract

Causal mediation analysis is becoming increasingly important in clinical, epidemiological and observational life course studies. It is used as a statistical method to investigate the mechanisms that underly an observed relationship between an exposure and an outcome. The development of a DataSHIELD package for privacy-preserving federated mediation analysis has been recently requested by different projects and consortia. The plan is to develop a package with a complete set of functionalities for different statistical approaches, tests and diagnostics for causal mediation analysis. In this talk, I will present the dsMediation DataSHIELD package, which is under development, and includes methods based on four main mediation approaches: simulation, weighting, imputation, and regression. I will discuss the challenges and potentials of such developments in DataSHIELD.

12:50 - 13:00

Closing remarks

Becca Wilson (University of Liverpool) & Andrei Morgan (INSERM, Paris),
Jan Hasenauer (University of Bonn)

from 13:00

Departure

14:00 - 17:00

DataSHIELD Advanced Users' Workshop (optional)

Lunch provided for workshop attendees before workshop (13:00- 14:00)

Juan Ramón González (Barcelona Institute for Global Health - ISGlobal), Xavier Escribà Montagut (BigOmics Analytics, Lugano)