Modern healthcare systems generate a vast amount of high-dimensional clinical data (HDCD), such as spirogram measurements, photoplethysmograms (PPG), electrocardiogram (ECG) recordings, CT scans, and MRI imaging, that cannot be summarized as a single binary or a continuous number (cf. “has asthma” or “height in centimeters”). Understanding the connection between our genomes and HDCD not only improves our understanding of diseases but is also crucial to the development of disease treatments.
HDCH are stored in electronic health records and large biobank projects, such as UK Biobank in the United Kingdom, BioBank Japan in Japan, and All of Us in the United States. These projects obtain participant consent before de-identifying data and sharing a portion of this valuable resource with qualified scientists. The goal is to enhance the prevention, diagnosis, and treatment of various life-threatening illnesses.
The genomics team at Google Research has made progress utilizing HDCD for characterizing diseases or biological traits like optic nerve head morphology and chronic obstructive pulmonary disease (COPD). In an effort to better understand the genetic architecture of these particular traits, we previously performed genome-wide association studies (GWAS) on the trait predictions generated by supervised machine learning (ML) models. However, obtaining large enough volumes of data that contain disease labels to train supervised ML models is not always possible. Furthermore, simple disease labels cannot fully capture the biology embedded in the underlying data, and we lack statistical methods to directly utilize HDCD in genetic analysis like GWAS.
To overcome these limitations, in “Unsupervised representation learning on high-dimensional clinical data improves genomic discovery and prediction”, published in Nature Genetics, we introduce a principled method to study the underlying genetic contributors to the general organ functions that are reflected in the HDCD. REpresentation learning for Genetic discovery on Low-dimensional Embeddings (REGLE) is a computationally efficient method that requires no disease labels, and can incorporate information from expert-defined features (EDFs) when they are available.