A platform for interactive, collaborative, and repeatable genomic analysis
Computer systems – both hardware and software – currently represent an active barrier to the scientific investigation of genomic data. Answering even relatively simple questions requires assembling disparate software tools (for alignment, variant calling, and filtering) into an analytics pipeline, and then solving practical IT problems in order to get that pipeline to function stably and at scale. This project will employ a whole system approach for providing a framework for genomic analysis. By building on an existing botany-based analysis pipeline and exploiting emerging high-density “rack-scale” computer hardware, the project will refactor and extend existing genomic analysis software in order to provide a platform that moves many traditionally long-running analytical tasks to run fast enough to enable interactive analysis. This will facilitate sharing of datasets and analysis code across the research community and will provide sufficient capture of data and analysis provenance to encourage reproducibility of published results.