Risteys FinnGen R7 - Documentation

Mortality

The goal of the analysis is to calculate the association between an exposure endpoint and death.

Data pre-processing

Start of follow-up: 1998-01-01 – we choose this date because we have complete coverage for all registries
End of follow-up: death or 2019-12-31
If the date of diagnoses for the exposure endpoint happens before 1998-01-01 we assume that it happened on 1998-01-01.
Only calculated if there are at least 10 deaths among individuals diagnosed with the exposure endpoint

Case-cohort design

To improve computational speed, we used a case-cohort design.

Briefly, from the original cohort, we selected a subcohort at the start of follow-up. The subcohort can include individuals that died. The size of the subcohort is 10,000 individuals. The final population includes all the individuals in the subcohort and all the individuals that died outside the subcohort.

Cox regression

To perform the analyses, we used a Cox regression with a time-varying covariate, weighted by the inverse of the sampling probability to account for the case-cohort design. Robust standard error was used. The model is defined as:
Surv(time,death) ~ exposure_endpoint + birth_year + sex

time is calculated as (date end of follow-up – date entry in the study) as defined in Data pre-processing (except for individuals diagnosed with the exposure endpoint where time is split from entry till diagnosis and from diagnosis till the end of follow up, see below).
exposure_endpoint is treated as a time-varying covariate. This means that an individual is unexposed (value of the variable is set to 0) from 1998-01-01 until the diagnoses of the exposure endpoint and exposed (value of the variable is set to 1) after that. That is, if an individual experiences an exposure endpoint, it will have two rows in the dataset.

Lagged hazard ratios are computed with the following follow-up time windows: < 1 year, between 1 and 5 years, between 5 and 15 years.

The Cox regression is implemented using the lifelines library.

Absolute Risk (AR)

The absolute risk represents the probability of dying. It is defined as AR = 1 - survival_probability. The survival probability is derived using the Breslow’s method assuming these values for the other covariates in the model:

year of birth: 1959
sex ratio: 50%

Survival analyses between endpoints

Associations between endpoints are calculated loosely following the approach described in the NB-COMO study. The goal of the analysis is to study the association between an exposure endpoint and an outcome endpoint. E.g., what’s the association between a diagnosis of type 2 diabetes (exposure endpoint) and cardiovascular diseases (outcome endpoint).

Data pre-processing

Start of follow-up: 1998-01-01 – we choose this date because we have complete coverage for all registries
End of follow-up: diagnose of the outcome endpoint or death or 2019-12-31
Prevalent cases (i.e. individuals that have been diagnosed with the outcome endpoint before 1998-01-01) were removed from the study. We consider only incident cases.
If the date of diagnoses for the exposure endpoint happens before 1998-01-01 we assume that it happened on 1998-01-01.
Only consider endpoint pairs:
- with at least 10 individuals for each cell of the 2x2 contingency table between endpoint pairs.
- with at least 25 individuals having the outcome endpoint.
- where endpoints are not “overlapping”. That is, endpoints are not descendants of one another endpoint in the tree hierarchy or have overlapping underlying ICD codes.

Case-cohort design

To improve computational speed, we used a case-cohort design.

Briefly, from the original cohort, we selected a subcohort at the start of follow-up. The subcohort can include outcome endpoints. The size of the subcohort is always 10,000 individuals randomly selected for each analysis. The final population includes all the individuals in the subcohort and all the individuals that experience the outcome endpoints outside the subcohort.

Cox regression

To perform the analyses, we used a Cox regression with a time-varying covariate, weighted by the inverse of the sampling probability to account for the case-cohort design. Robust standard error was used. The model is defined as:
Surv(time,outcome_endpoint) ~ exposure_endpoint + birth_year + sex

time is calculated as (date end of follow-up – date entry in the study) as defined in Data pre-processing (except for individuals diagnosed with the exposure endpoint where time is split from entry till diagnosis and from diagnosis till the end of follow up, see below).
exposure_endpoint is treated as a time-varying covariate. This means that an individual is unexposed (value of the variable is set to 0) from 1998-01-01 until the diagnoses of the exposure endpoint and exposed (value of the variable is set to 1) after that. That is, if an individual experiences an exposure endpoint, it will have two rows in the dataset.

Lagged hazard ratios are computed with the following follow-up time windows: < 1 year, between 1 and 5 years, between 5 and 15 years. If an outcome endpoint happens outside the time-widow, the individual experience the disease is kept, but the outcome endpoint is not considered (i.e. variable is set to 0).

The Cox regression is implemented using the lifelines library.

Notes

Due to the sensitive nature of the data, the age when entering and leaving the study has an accuracy of 1 year.