Using machine learning models for understanding the role of the non-coding genome in brain development and autism

Drs. Sara Mostafavi (Assistant Professor, Statistics; CIFAR AI Chair) and Daniel Goldowitz (Professor, Medical Genetics) successfully secured DSI postdoctoral matching funds for their project "Using machine learning models for understanding the role of the non-coding genome in brain development and autism". The goal of this project is to develop novel computational and statistical models to better understand  epigenome pertubations and the role these play in neural development and diseases like autism. Specifically, it will look to apply advances in Convolutional Neural Network (CNN) to epigenomic data and implement model intrepretation strategy to derive biological insights from these learnt models. The summary of the project follows below.

Parallel advances in high-throughput sequencing and high performance computing now allow us to produce a tremendous amount of genome-wide biological data at the genome, epigenome, and transcriptome levels at multiple cellular resolutions. By combining these data, we have an unprecedented opportunity to derive a mechanistic understanding of biological systems and identify causal factors that lead to human disease. Indeed, in theory, these data now allow us to infer the cellular function of non-coding regions across the human genome, which play a particularly important role in brain development and disease.However, to realize this opportunity, we need powerful computational and statistical methodologies for deriving novel biological insights from these high-throughput datasets.

The overall goal of this current project is to develop robust computational methodology that allow us to model the cellular impact of mutations (variation) in the non-coding genome, with the ultimate goal of identifying variations in the DNA sequence that underlie brain development and autism. We will achieve this objective as follows. We will use unique epigenome-wide data being generated by Goldowitz lab at key developmental stages to train a Convolutional Neural Network (CNN). Specifically, we will train the CNN model to predict epigenomic features that have brain development stage-specificity across the genome (in 200bp intervals) from DNA-sequence alone. In other words, given a 200bp DNA sequence, the model will predict the epigenomic activity of that sequence at several brain developmental stages. We will apply this trained model to DNA-sequence regions associated with autism, to infer the impact of variation in these regions on epigenomic profiles across development.

This project will expand upon current work from Mostafavi’s lab, which builds similar types of CNNs for predicting the impact of non-coding regions across immune cells. However, adapting this approach to brain requires addressing several non-trivial data-centric challenges. First, data from brain is much sparser, with only a limited set of brain regions and cell types that can be measured at very early developmental stages. Addressing this challenge requires building robust imputation models that can take advantage of overall structure across different types of epigenomic data that are measured, and handling this sparsity during the model building phase. Second, most high-resolution data for brain development stems from mouse. To utilize suchdata, we will have to carefully evaluate and compare cross-species prediction accuracy. Indeed, as a component of this work, we will investigate the utility of new domain adaptation models from machine learning in making full use of mouse data.

In summary, our work in this domain will enable us to overcome informatics/computational challenges inherent to these data and, as a result, to better understand brain development. Furthermore, the methodology that will be developed can be applied in other context where multiple types of data are available for analysis of a complex human trait or phenotype.