NMR metabolomics
In this work, we expand our previous GWAS of 123 human metabolic traits in ~25,000 individuals4 to include additional cohorts and a more comprehensive panel of metabolic traits. Up to 233 serum or plasma metabolic traits were quantified in 33 cohorts (total sample size up to 136,016) using an updated quantification version of the same NMR metabolomics platform17 as in the previous study. The NMR metabolomics platform provides data of lipoprotein subclasses and their lipid concentrations and compositions, apoAI and apoB, cholesterol and triglyceride measures, albumin, various fatty acids and low-molecular-weight metabolites—for example, amino acids, glycolysis-related measures and ketone bodies. In this work, the metabolic traits were quantified in the following cohorts (described in detail in Supplementary Notes and Supplementary Table 1): Avon Longitudinal Study of Parents and Children (ALSPAC), China Kadoorie Biobank (CKB), Estonian Genome Center of University of Tartu Cohort (EGCUT), The Erasmus Rucphen Family study (ERF), European Genetic Database (EUGENDA), FINRISK 1997 (FR97), FINRISK 2007 (FR07, that is, DILGOM), The INTERVAL Bioresource (INTERVAL), CROATIA-Korcula Study (KORCULA), LifeLines-DEEP (LLD), Leiden Longevity Study (LLS), eight subcohorts from the London Life Sciences Prospective Population Study (LOLIPOP), The Metabolic Syndrome in Men study (METSIM), The Netherlands Epidemiology of Obesity Study (NEO), The Netherlands Study of Depression and Anxiety (NESDA), Northern Finland Birth Cohort 1966 (NFBC1966), NFBC1986, The Netherlands Twin Register (NTR), Oxford Biobank (OBB), Orkney Complex Disease Study (ORCADES), PROspective Study of Pravastatin in the Elderly at Risk (PROSPER), three subcohorts from the Rotterdam Study (RS), TwinsUK (TUK), and The Cardiovascular Risk in Young Finns Study (YFS). Most of the cohorts consisted of individuals of European ancestry (six Finnish and 21 non-Finnish), and six cohorts had individuals of Asian ancestry (one Han Chinese and five South Asian). All participants gave informed consent and all studies were approved by the ethical committees of the participating centres.
Detailed description of the NMR method is given in the Supplementary Notes.
Genome-wide association study
A GWAS was performed for 233 metabolic traits (Supplementary Table 2) in each of 33 cohorts (Supplementary Table 1), leading to inclusion of up to 136,016 individuals with both NMR metabolic trait measurements and genome-wide SNP data available. Pregnant individuals or those using lipid-lowering medication were excluded from the study. SNPs were imputed using the Haplotype Reference Consortium release 1.1 or the 1000 Genomes Project phase 3 release, and GWAS was performed under the additive model separately in each cohort (details in Supplementary Table 3). Before analyses, the metabolic trait distributions were adjusted for age, sex, principal components and relevant study-specific covariates (see Supplementary Table 3), and inverse rank normal transformation of trait residuals was performed. The cohorts were combined in fixed-effect meta-analysis with METAL69, and the SNPs were filtered to those present in at least seven cohorts. The NMR metabolic traits are highly correlated and using the Bonferroni correction to account for multiple testing would result in an overconservative threshold for genome-wide significance. We therefore used the number of principal components (28) explaining >95% variation in the metabolic traits defined in the largest cohort, INTERVAL, to correct for multiple testing, and our genome-wide significance threshold was set to P < 1.8 × 10−9 (standard genome-wide significance level, P < 5 × 10−8, divided by 28). After the primary GWAS, fasting- and sample type-stratified analyses were performed for the 233 metabolic traits. In these analyses 26 of the cohorts were classified as fasted (n = 68,559), six cohorts were classified as non-fasted (n = 58,112), seventeen cohorts were classified as having serum samples (n = 90,223) and sixteen cohorts had plasma samples (n = 45,793; see Supplementary Table 1). To define associated loci across the metabolic traits, we defined a 500-kb window flanking each SNP meeting the significance threshold, pooled together these windows from all metabolic traits for each chromosome, and iteratively merged the windows. As this approach can lead to inclusion of multiple independent signals within these loci, we further defined potential independent signals that reside within the defined loci based on pairwise LD (r2 cut-off of 0.3, defined in INTERVAL and FINRISK97) of all the lead SNPs within each locus. Regional association plots were created in LocusZoom, v. 1.4. We assigned the associated lead SNPs to the most likely causal genes based on two criteria: (1) we prioritized genes with clear biological relevance to the associated metabolic traits; and (2) if no biologically plausible causal gene was detected and the lead SNP was a functional variant (missense, splice region or stop gained) or in high LD (r2 > 0.8 in INTERVAL) with such a variant, the gene with the functional variant was assigned as the most likely candidate gene. If criteria 1 and 2 were not fulfilled, the nearest gene was indicated as the candidate gene.
Ancestry-specific analyses
We conducted ancestry-stratified analyses within our primary discovery meta-analysis for South Asian (five cohorts, 11,340 participants), East Asian (one cohort, 4,435 participants), all European (27 cohorts, 120,241 participants), Finnish (six cohorts, 27,577 participants) and non-Finnish European (21 cohorts, 92,664 participants) participants. For these ancestry-specific analyses, we used the standard threshold for genome-wide significance (P < 5 × 10−8). To also compare to participants with African ancestry, we conducted an African-specific subgroup analysis using the UK Biobank dataset (March 2021 release). Using self-reported ethnicity information (Field 21000: Ethnicity background) from the baseline questionnaire, 1,405 participants with African ancestry were identified as having Caribbean (code 4001), African (code 4002), or any other Black background (code 4003). Variant QC was performed by excluding SNPs with minor allele frequency <1%, INFO score <0.4, and variants in complex LD regions. LD thinning was performed with r2 < 0.1, a window size of 1,000 and a step size of 80. Related individuals were identified and excluded using relatedness data provided by the UK Biobank (Field 22021: Genetic kinship to other participants). Outliers of the first 6 genetic principal components computed on the unrelated samples were removed from the analysis. NMR metabolic traits were adjusted for age, sex, fasting status and 10 genetic principal components, and trait residuals were inverse rank normal-transformed. Associations between SNPs and metabolic traits were tested using PLINK 2.0.
Replication in publicly available data
UK Biobank SNP–metabolic trait summary statistics were downloaded (https://gwas.mrcieu.ac.uk/datasets/?gwas_id__icontains=met-d) from the IEU Open GWAS Project70. These summary statistics were derived from the publicly available March 2021 release of the UK Biobank data in which the metabolic traits were measured with a similar NMR technology (newer version of the Nightingale Health platform) as in our study. The data were used to compare the association of our lead SNP–metabolic trait pairs within the 276 associated regions. Two thresholds were used to define an association in the UK Biobank data: the standard genome-wide significance level (P < 5 × 10−8) and the suggestive level of significance (p < 1 × 10−5).
Heritability and variance explained
We used GCTA-GREML71 v. 1.94 to estimate common variant heritability for each trait using an independent dataset, specifically the UK Biobank phase 1 NMR release. This research was conducted using the UKBB Resource under application number 7439. We randomly selected 10,000 unrelated UK Biobank participants of European ancestry with available NMR data and filtered imputed variants to minor allele frequency >0.005, missingness <0.1 and Hardy–Weinberg equilibrium P value <10−6. We removed technical variation from the traits using methods described previously72, and adjusted the traits for age, sex, lipid-lowering medication usage and the first 10 genetic principal components of ancestry. Traits were rank inverse normal-transformed prior to GREML analysis. Variance explained by the lead SNPs for each trait was estimated as described before73.
Comparing to previous associations
We performed an extensive comparison of our metabolic trait associations to previous GWASs of metabolic traits. Our comparisons were divided into three groups: (1) comparison to results of previously published large GWAS of circulating NMR traits4,5; (2) comparison with loci associated with clinical lipids (including those from the UK Biobank September 2019 version 3 release)21,25,26,27,74; and (3) comparison with an extensive list of associations from previous metabolite and metabolomic studies11,13,53,75,76,77,78,79,80,81,82,83,84,85,86,87. The comparisons were performed by indicating: (1) co-located known variants; (2) any known associations within a 500-kb flank of a lead SNP; or (3) known associations in LD (r2 > 0.3, defined in INTERVAL) with a lead SNP. Since we used the UK Biobank for replication, we did not compare the associations to those from studies that used UK Biobank NMR metabolomics as a single cohort without validation cohorts67,88.
In addition to comparing to previous metabolic trait associations, we screened previous disease and trait associations (P value cut-off 5 × 10−8) of the lead SNPs using PhenoScanner, v244,45, and NHGRI-EBI GWAS Catalog46 (associations downloaded on 30 March 2023 using the gwasrapidd R package, v. 0.99.1489). In addition, we screened the FinnGen43 data freeze 7 summary statistics of 3,095 disease endpoints for overlapping associations (P value cut-off 5 × 10−8). Associations with gene expression and protein levels were screened using PhenoScanner, v244,45.
Metabolic effects of lipoprotein loci
To compare the metabolic effects of lipoprotein, lipid and apolipoprotein-associated variants, the effect estimates were visualized as colour-coded heat maps. To allow comparison of SNP effects, the estimates were scaled relative to the highest absolute value of the estimate for each SNP. In this analysis, we included lead SNPs at the 276 initially defined regions that were associated with any of the lipoprotein lipids or apolipoproteins at genome-wide significance and nominally associated (P < 0.05) with apoB. We used these criteria to restrict the analysis to SNPs associated with apoB, because apoB is known to be a causal part of lipoprotein metabolism for cardiovascular disease30,31,32. To exclude signals with similar effects across the metabolic traits due to the same causal gene, we included only a single SNP from the initially defined genomic regions that had multiple independent signals if the patterns of metabolic traits associations were similar (R > 0.5). In the heat maps each line represents a single SNP, each column corresponds to a single metabolic measure, and the scaled effect estimates for the SNP-metabolite associations are visualized with a colour range. Directions of effects are shown in relation to the allele associated with increased apoB. To group SNPs with similar effects together, dendrograms were constructed based on hierarchical clustering of the scaled SNP effects. Heat maps were constructed using the heatmap.2 function of the gplots v. 3.0.3 R package. Pearson correlations were assessed in R, v. 4.0.0.
Intrahepatic cholestasis of pregnancy
We assessed overlap of our metabolic trait associations with ICP using summary statistics from the FinnGen study43 data freeze 7 (O15_ICP; 1,460 cases, 172,286 controls). ICP cases were defined through hospital discharge registry, ICD10 code O26.6 and ICD9 codes 6467A and 6467X. Using the nearest genes at each associated locus, we performed gene ontology (GO) enrichment analysis to search for enriched biological process and molecular function GO terms90,91. We assessed colocalizations of association signals using the hypothesis prioritization for multi-trait colocalization (HyPrColoc) R library, v. 1.0, in which an efficient deterministic Bayesian algorithm is used to detect colocalization across vast numbers of traits simultaneously92. We searched for colocalization at single causal variants and shared regional associations. To visualize SNP effects across lipid and lipoprotein traits, heat maps were constructed using the heatmap.2 function of the gplots v. 3.0.3 R package. The following SNPs were included in the heat maps: GCKR-rs1260326, ABCB11-rs10184673, ABCB1-rs17209837, CYP7A1-rs9297994, SERPINA1-rs28929474 and HNF4A-rs1800961. Effects of the metabolic trait-associated SNPs were scaled relative to an odds ratio of 1.5 for ICP.
Mendelian randomization
Two-sample Mendelian randomization was performed using 20 NMR non-lipid metabolic traits (including amino acids (alanine, glutamine, glycine, histidine, isoleucine, leucine, valine, phenylalanine and tyrosine), ketone bodies (acetate, acetone and 3-hydroxybutyrate), and glycolysis/gluconeogenesis (glucose, lactate, pyruvate, glycerol and citrate), fluid balance (albumin and creatinine) or inflammation-related (glycoprotein acetylation) metabolic traits) as exposures and 460 Phecodes and 52 quantitative traits from the UK Biobank21 as outcomes. We defined two sets of instruments for the analyses that are referred to as full and strict instruments. As initial instruments we used the 334 lead variants (a single instrument SNP per each defined associated locus) associated with these traits (‘full instruments’). To avoid potential bias due to pleiotropy, we also selected a subset of 193 variants (‘strict instruments’) that had fewer than 5 associations across all 233 metabolic traits. Our threshold of 5 associations was based on empirical assessment of the distribution of per-variant trait associations. To investigate the sensitivity of the Mendelian randomization analyses to the choice of threshold, we also tested using fewer than 3 associations and fewer than 7 associations. We defined disease outcomes in UK Biobank using a curated list of major Phecodes available in the PheWAS R package93,94. To restrict our analysis to major disease outcomes, we discarded any sub-categories (that is, Phecodes with 4 or more characters) and removed outcomes with fewer than 100 events across up to 367,542 unrelated UK Biobank participants with European ancestry. The resulting 460 diseases were grouped into 15 broad domains: circulatory system, dermatologic, digestive, endocrine/metabolic, genitourinary, haematopoietic, infectious diseases, mental disorders, musculoskeletal, neoplasms, neurological, pregnancy complications, respiratory, sense organs and symptoms. We also analysed 52 quantitative traits available in UK Biobank, including blood pressure, lung function measures, blood cell traits and clinical chemistry biomarkers. In our replication analysis (acetone as the exposure and hypertension as the outcome), we used essential hypertension from the FinnGen study43 data freeze 7 as the outcome (hypertension essential, I9_HYPTENSESS; 70,651 cases, 223,663 controls). Cases were defined through hospital discharge registry, ICD10 code I10, ICD9 codes 4019X and 4039A, ICD8 codes 40199, 40299, 40399, 40499, 40209, 40100, 40291, 40191 and 40290.
We performed univariable Mendelian randomization using the inverse variance-weighted method for each instrument95. We also performed sensitivity analyses using Mendelian randomization–Egger regression to account for unmeasured pleiotropy96 and weighted median regression to assess robustness to invalid genetic instruments97. Our primary analyses were based on fixed-effect models, but as sensitivity analyses we used random-effect models to account for between-variant heterogeneity, which we quantified using the I-squared statistic. The Mendelian randomization analyses were performed using the MendelianRandomization package v. 0.5.198 or the TwoSampleMR package v. 0.5.399. Single-SNP Mendelian randomization estimates were based on the Wald ratio. We considered the fixed-effects inverse variance-weighted method as the main Mendelian randomization model but report the results of all models in Supplementary Table 15. To account for multiple testing, associations with P < 4.88 × 10−6 were considered significant (Bonferroni correction to account for testing of 20 metabolic traits with 512 outcomes).
FinnGen study
In the present study, we used GWAS summary statistics of 3,095 disease endpoints from FinnGen data freeze 7. Full description of the FinnGen study43 and data analysis steps is provided in the Supplementary Notes. FinnGen contributors are listed in Supplementary Table 18.
Statistics and reproducibility
The meta-analyses were conducted independently by two investigators in two different centres (University of Oulu, Finland and University of Cambridge, UK), and the summary statistics were compared to verify consistency of results.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
Sarah Carter is a health and wellness expert residing in the UK. With a background in healthcare, she offers evidence-based advice on fitness, nutrition, and mental well-being, promoting healthier living for readers.