Using machine learning models for understanding the role of the non-coding genome in brain development and autism
Sara Mostafavi, Daniel Goldowitz
Parallel advances in high-throughput sequencing and high performance computing now allow us to produce a tremendous amount of genome-wide biological data at the genome, epigenome, and transcriptome levels at multiple cellular resolutions. By combining these data, we have an unprecedented opportunity to derive a mechanistic understanding of biological systems and identify causal factors that lead to human disease. However, to realize this opportunity, we need powerful computational and statistical methodologies for deriving novel biological insights from these high-throughput datasets. Thus, this project seeks to develop robust computational methodology that allow us to model the cellular impact of mutations (variation) in the non-coding genome, with the ultimate goal of identifying variations in the DNA sequence that underlie brain development and autism. We will use unique epigenome-wide data being generated by Goldowitz lab at key developmental stages to train a Convolutional Neural Network (CNN). Specifically, we will train the CNN model to predict epigenomic features that have brain development stage-specificity across the genome (in 200bp intervals) from DNA-sequence alone. In other words, given a 200bp DNA sequence, the model will predict the epigenomic activity of that sequence at several brain developmental stages. We will apply this trained model to DNA-sequence regions associated with autism, to infer the impact of variation in these regions on epigenomic profiles across development.