Research Projects

2023 Projects:

Technology pipeline for the development of Machine-Learned Interatomic Potentials

Cellular communication and decision making relies primarily on proteins; their ability to catalyze chemical reactions and interact with each other in a specific and tunable manner. Although immense technological innovations in the fields of biophysics and computer science over the last six decades have provided unprecedented insight into protein structure and function, there still remains the under explored and under characterized “dark matter” of the protein universe: Intrinsically disordered proteins (IDPs) are central integrators and decision makers in cellular systems because their lack of a single structure lets them act like Swiss army knives with multiple functional “tools” that can be deployed as most appropriate for cellular challenges. However, the fact that they are dynamic, interconverting structural ensembles has hampered their characterization. Even molecular dynamics simulations fail to accurately capture IDP ensembles because the available empirical force fields have been developed and optimized to reproduce dynamics of folded proteins. The recent combination of machine-learning and data science technology has led to the development of machine-learned interatomic potentials (MLIPs). MLIPs provide a realistic route to revolutionize the field of IDP characterization by enabling accurate, large-scale atomistic simulations and modeling. This project aims to establish and benchmark a technology pipeline for the development of MLIPs for IDPs, thereby establishing a pathway to close a huge methodology gap that currently prevents significant progress in many areas of biochemistry and biomedicine.

Awarded to: Joerg Gsponer

Postdoctoral fellow: TBA

Innovative deep-learning based program for cervical cancer screening

Screening using cytological images plays a critical role in early detection and treatment for cervical cancer. Current solutions often lack accuracy or necessitate extensive expert annotations, and a universal tool adaptable to multiple institutions remains elusive due to the diversity of imaging techniques and the complexity of the disease. This project aims to develop an end-to-end automatic deep learning-based cervical cancer screening pipeline that requires less labeling and addresses challenges in multi-institutional learning. The project team will investigate label-efficient cell segmentation, feature learning, and cytological image classification solutions using state-of-the-art self-supervised learning strategies and transfer learning from foundation models. Furthermore, the team will explore federated learning methods to facilitate learning and deployment, ensuring the adaptability of the proposed tool to various institutions.

Awarded to: Xiaoxiao Li and Gang Wang

Postdoctoral fellow: SangMook Kim

Anytime-Valid PAC-Bayes for Industrial Applications

Companies often need to track a variety of key business and performance metrics and take action if they change significantly. It’s also important, though, to avoid unnecessary action because of purely random fluctuations; the standard approach to identify which changes are “real” is statistical hypothesis testing. Traditional methods, however, are designed for “looking” only once; checking continuously breaks their guarantees on false positive rates. A growing number of companies – including Adobe, Microsoft, Amazon, and Netflix – have now adopted safe "anytime-valid" inference tools which remain valid when tested continuously. Currently, the anytime-valid tools deployed by these companies allow experimenters to, for instance, continuously monitor simple low-dimensional linear models which are amenable to well understood statistical tests. But companies are becoming increasingly interested in going beyond these simple models, and to incorporate recent breakthroughs in deep learning and AI into their products and analyses (e.g., object recognition in autonomous cars). There is thus growing need for an anytime-valid metric for generalization performance, which can be used as a decision criterion for determining when to retrain and redeploy sophisticated, non-linear, and high-dimensional models. In collaboration with researchers at Carnegie Mellon University, this project aims to extend the anytime-valid inference theory into the PAC-Bayes setting, with a focus on developing novel theory for concrete applications to be used in an industrial setting.

Awarded to: Trevor Campbell, Danica Sutherland

 

2022 Projects:

Automating machine learning of interatomic potentials for green technologies

Machine learning approaches for the construction of interatomic potentials were introduced in 2007 and 2010. Despite significant evolution in methodology, the development of robust machine-learned potential (MLP)models remains an intricate and time consuming process reserved for a small group of domain experts. As a result, their application remains largely limited to systems already treated by developers. In principle, a key advantage of MLPs over traditional empirical potentials is the possibility of rapidly and automatically generating custom, reactive, interatomic potentials. These can achieve high accuracy in specific use cases that would be challenging for empirical models. Examples include catalysis by established and emerging complex materials such as high-entropy oxides, complex surface interactions as in coatings or active materials for energy applications in batteries and fuel cells. For industrial applications, rapid turnaround between project conception and initial simulation results is necessary for the potential of MLPs to be realized. This project aims to develop and standardize methodology to allow for the rapid generation of new MLPs for interfacial systems as outlined above, thereby bridging the gap between emergent academic methods and practical applications.

Awarded to: Christoph Ortner, Joerg Rottler, Chad Sinclair

Postdoctoral fellow: Teemu Jarvinen

 

 

2021 Projects:

Quantifying the cascade effects of mining on terrestrial and aquatic ecosystems in the North American context

Human-made land developments can contribute to multi-dimensional environmental disturbances (e.g., wildfires and post-fire cascades of flood and drought cycles). The research team has been collaborating on research that combines hydrology and statistical science to investigate the hydrological causes and consequences of environmental disturbances, under a changing climate and land development, in key study regions within the North American context. This project will extend existing work by focusing on the influence of mining on hydrological variability and ultimately on multi-dimensional environmental disturbances. Additionally, it will contribute to ongoing research to investigate how future projections of climate change variability will influence hydrological cycles within key mining regions within the Canadian context.

Awarded to: Nadja Kunz, Ali Ameli, Will Welch, Jiguo Cao

Postdoctoral fellow: Asad Haris

Blessings and curses of overparameterized learning: Optimization and generalization principles

Deep neural networks are often exceedingly overparameterized and are often trained without explicit regularization. What are the principles behind their state-of-the-art performance? How do different optimization algorithms and learning schedules affect generalization? Does overparameterization speed up convergence? What is the role of the loss function on training dynamics and accuracy? How do the answers to these questions depend on the data distribution (eg. data imbalances)? This project aims to shed light on these questions by adapting a joint optimization and statistical viewpoint. Despite admirable progress over the past couple of years, existing theories are mostly limited to simplified data and architecture models (eg. binary classification, linear or linearized architectures, random feature models, etc.). The goal of the project is to extend the theory to more realistic architectures (eg. tangent-kernel regime, shallow and deep neural networks) and data models (eg. multiclass, imbalanced).

Awarded to: Mark Schmidt and Christos Thrampoulidis

Postdoctoral fellow: Liam Madden

Optimal placement of low-cost air quality sensors in Metro Vancouver to better predict air quality exposure

Air quality in cities varies both spatially and temporally. A limited number of existing fixed-site air quality monitoring stations fail to capture this variability. A dense network of low-cost air quality sensors can be used to create pollution maps, identify hotspots, drive public policy decision-making, and improve health outcomes attributable to air pollution. However, it is not possible to install and operate sensors in every neighborhood due to limited resources. It is also known that there is a risk of redundant information if sensors are not optimally placed. As part of this project, we will build an optimization model to determine the minimum number of sensors required to support air quality related decision-making while maximizing coverage area.

Awarded to: Drs. Amanda Giang & Naomi Zimmerman

Postdoctoral fellow: Surya Dhulipala

Robust, transferable and interpretable natural language processing of psychiatric clinical notes

Suicide is the second-leading cause of death in adolescents in Canada. The Child and Adolescent Psychiatric Emergency (CAPE) unit at BC Children's Hospital (BCCH) provides stabilization and emergency intervention for youth in psychiatric crisis across BC and the Yukon. Approximately 45% of patients are admitted annually to CAPE for suicidality. Patients admitted to CAPE receive extensive diagnostic evaluation and assessment at admission and discharge, mostly recorded as free-text clinical notes which are difficult to incorporate in large-scale analyses. The primary aim of this project is to investigate whether NLP approaches can be successfully applied to CAPE clinical notes to extract clinically-relevant information and identify suicidality. Both traditional and neural approaches will be studied with a focus on interpretability and transferability of the resulting models.

Awarded to: Drs. Elodie Portales-Casamar & Giuseppe Carenini

Postdoctoral fellow: Ahmed Abura'ed

2020 Projects:

A deep learning approach to analyzing retinal imaging for medical diagnosis and prediction

Deep learning (DL) techniques have seen tremendous interest in medical imaging, particularly via the use of convolutional neural networks (CNNs) for the development of automated diagnostic tools. The facility of their non-invasive acquisition makes various types of retinal imaging particularly amenable to such automated approaches. Moreover, recent work are showing that traits not present or quantifiable in fundus images—patient data (e.g., age, sex, smoking status, history of cardiovascular disease)—can be extracted from fundus images using CNN. This relies on access to massive datasets for training and validation, composed of hundreds of thousands of images. However, data residency and data privacy restrictions stymie the applicability of this approach in medical settings where patient confidentiality is a mandate. Therefore, the aims in this research project are two-fold: (1) “Go beyond eye diseases”: i.e. use DL techniques to diagnose and predict a wide variety of medical disorders from retinal fundus images; (2) “Use small data”: i.e. utilize generalization principles to achieve (1) with relatively modest-sized data sets. For example, preliminary work by the project team has demonstrated the ability to of their DL model to extract "invisible" traits from fundus images, such as sex of the patient, from relatively small datasets (e.g., >3600 images). The next step is to extend this work to enable the extraction of a wide variety of clinically-relevant classification questions. These could be diagnosis and prediction of neurological (e.g., Alzheimer's disease), cardiogenic diseases (e.g., stroke), and malignancies (e.g., cancers of various types).

Awarded to: Ipek Oruc & Ozgur Yilmaz

DSI Postdoctoral Fellow: Gülcenur Özturan

Personalized risk assessment in pediatric kidney transplantation using metabolomics data

Kidney transplantation is the most effective treatment for end-stage kidney failure and improves both survival and quality of life. It is not, however, a cure and most young people will experience complications that precipitate allograft failure. At present, children are all treated with a standard protocol for immune suppression, which ignores the wide heterogeneity in both immune responses and susceptibility to complications. As a result, some children suffer complications for excessive immune suppression whereas others may suffer rejection from insufficient immunosuppression. We aim to study how the metabolism state of the kidney recipient affects the evolution of the immune response to the allograft after transplant. Our goal is to identify a metabolomic signature using pre-transplant serum samples and machine learning techniques to support a precision-medicine approach to immunosuppressive treatment that can be tailored to the alloimmune risk-characteristics of each patient. Providing a personalized risk assessment would permit tailoring of treatment to optimize management of immunosuppression and avoid complications related to unnecessary treatment.

Awarded to: Gaby Cohen-Freue & Tom Blydt-Hansen

DSI Postdoctoral Fellow: Arafeh Bigdeli

The use of methods of deep learning for the virtual screening for COVID19 therapeutic candidates

The COVID-19 pandemic caused by a recently emerged novel coronavirus (SARS-CoV-2) has dramatically reshaped society. There is no yet approved therapeutic or effective treament available to combat this disease; nor is a vaccine available. This project aims to rapidly respond to this health crisis by using deep learning techniques to rapidly identify therapeutic targets to accelerate development of inhibitors of the SARS-CoV-2 virus. The overall goal is to develop robust virtual screening protocols relying on deep docking algorithms to create custom-scoring functions to be used for ultra-large virtual screening against SARS-CoV-2 3CL Main Protease, among other emerging viral targets. The developed scoring functions will be used in combination with recently emerged Deep Docking protocol capable of processing billions of molecules structures against biological targets of interest. The most attractive features of deep models are that they favour very large and correlated inputs, allows simultaneous optimization for multiple dependent variables, and does not rely on strict features selection (characteristics that are all typical for Big Data, such as Enamile REAL Space database currently used in Cherkasov’s lab and consisting of 13 billion of molecular structures).

Awarded to: Artem Cherkasov & Faraz Hach

Visual analytics support for the HEiDi virtual physician COVID-19 deployment

In the current extreme situation of the COVID-19 pandemic, health systems face unprecedented medical and social challenges that data science can help address. The HealthLink BC Emergency iDoctor-in-assistance (HEiDi) project is one example that is addressing this challenge. Specifically, HEiDi is augmenting the 811 service by delivering health care guidance to the public through telephone access to nursing advice by integrating virtual physicians (VPs) into the triage process, to help balance the enormous increase in load due to this crisis. To optimize the delivery of this program, Drs. Ho and Munzner will apply visual anaytics to better understand the multi-dimension data that is being generated and collected by HEiDi. Their goals are to help health system experts to observe and improve the clinical pathway of 811 service, to foster the effectiveness and efficiency of VPs, and to improve the patient experience. This project will also gain an initial understanding of the stakeholders and ecosystem, providing a basis for a systematic and methodology-driven data science research multi-year project. In conclusion, data science support for the 811 ecosystem will both address today’s urgent problems and serve to develop methods and tools that can be used in future extreme situations.

Awarded to: Kendall Ho & Tamara Munzner

DSI Postdoctoral Fellow: Juergen Bernard

2019 Projects:

Using machine learning models for understanding the role of the non-coding genome in brain development and autism

Parallel advances in high-throughput sequencing and high performance computing now allow us to produce a tremendous amount of genome-wide biological data at the genome, epigenome, and transcriptome levels at multiple cellular resolutions. By combining these data, we have an unprecedented opportunity to derive a mechanistic understanding of biological systems and identify causal factors that lead to human disease. However, to realize this opportunity, we need powerful computational and statistical methodologies for deriving novel biological insights from these high-throughput datasets. Thus, this project seeks to develop robust computational methodology that allow us to model the cellular impact of mutations (variation) in the non-coding genome, with the ultimate goal of identifying variations in the DNA sequence that underlie brain development and autism. We will use unique epigenome-wide data being generated by Goldowitz lab at key developmental stages to train a Convolutional Neural Network (CNN). Specifically, we will train the CNN model to predict epigenomic features that have brain development stage-specificity across the genome (in 200bp intervals) from DNA-sequence alone. In other words, given a 200bp DNA sequence, the model will predict the epigenomic activity of that sequence at several brain developmental stages. We will apply this trained model to DNA-sequence regions associated with autism, to infer the impact of variation in these regions on epigenomic profiles across development.

Awarded to: Sara Mostafavi & Dan Goldowitz

DSI Postdoctoral Fellow: Chendi Wang

Quantifying individual differences from complex datasets in developmental psychology

Developmental psychology fundamentally relies on the robust measurement of individual differences - capturing variables at an individual participant level - in order to formulate hypotheses about what children know, how their psychology changes over time, and to characterize the best predictors of their long-term educational, health, and psychological outcomes. But, newborns, infants, and young children are notoriously difficult experiment participants, leading to two major challenges in data analysis: (1) developmental data can be complex and multivariate but simultaneously limited by the need for short testing sessions that collect sparse data (e.g., neural recordings on the scalp for durations of 15 minutes or less); and (2) data are often censored or missing, as many children may stop paying attention to the task, may need to be fed, might create too many motion artifacts, etc. The goal of this project is to capitalize on hundreds of existing data points from the UBC Department of Psychology developmental psychology labs - including eye-tracking and neuroimaging data from infants and toddlers - to leverage existing solutions in machine learning and statistical survival analysis and develop novel analytical pipelines for measuring individual differences in these heterogeneous datasets.

Awarded to: Darko Odic, Cristina Conati, Lang Wu

DSI Postdoctoral Fellow: Cory Bonn

2018 Projects:

Automated diagnosis and prognostication of severity in COPD via deep learning frameworks using multi-modal data

Chronic Obstructive Pulmonary Disease (COPD) is a progressive, debilitating, chronic respiratory disease. It is currently the 4th leading cause of mortality and is responsible for 100,000 hospitalizations and 10,000 deaths annually in Canada, and 3 million deaths worldwide. Although our understanding of COPD pathogenesis has improved substantially over the past 20 years, there is a notable lack of treatments that can modify disease progression and reduce mortality. Furthermore, current methods to clinically diagnose COPD are non-specific and insufficient to advance knowledge. This project will build on the recent successes of advanced machine learning (ML) techniques applied to automated image analyses of medical scans, in various medical fields, to improve COPD diagnosis and prognostication. Specifically, this project will implement and test new frameworks based on deep learning (DL) to automate staging of COPD disease severity and to predict disease progression by using multi-modal and/or heterogeneous data (e.g., non-imaging based and imaging-based data). The outcome of this project is the development of new machine learning tools to better support clinicians treating COPD patients.

Awarded to: Leonid Sigal and Roger Tam

DSI Postdoctoral Fellow: Lisa Tang

Large-scale Bayesian modelling of drug resistance and evolution in human cancers at single-cell resolution

Recent advances in next generation sequencing (NGS) technologies have led to the ability to measure gene expression and DNA mutations across thousands of cells in cancer tumors at the single-cell level. This allows us to quantify the effect of chemotherapeutic drugs on the way tumors mutate and answer questions about why particular groups of cells (known as clones) evade treatment and cause relapse. However, the vast quantities of data produced by such measurements combined with the low signal-to-noise ratio makes analysis and interpretation particularly difficult. This project aims to develop a suite of state-of-the-art Bayesian methods (e.g., sequential Monte Carlo (SMC) and black-box variational inference) for learning from single-cell cancer genomics data with a focus on scalable inference to help address these challenges. Development of these tools will enable precision medicine by equipping clinicians the ability to better predict which treatment(s) will work best, and adjust appropriately, for each individual cancer patient.

Awarded to: Alex Bouchard-Côté and Sohrab Shah

DSI Postdoctoral Fellow: Kieran Campbell

Leveraging more accurate and flexible discourse structures in question-answering and summarization

Existing systems for critical NLP tasks like question-answering and summarization are still unable to accurately uncover and effectively leverage the discourse structure of text; i.e., how clauses and sentences are related to each other in a document. This is a serious limitation in that relationships between clauses and sentences carries important information, which allows the text to express a meaning as a whole, beyond the sum of its parts. The goal of discourse parsing is to automatically determine the coherence structure of text. In essence, a discourse parser takes a document as input and returns its discourse structure, or tree, showing how clauses and sentences are related to each other, via the use of various discourse relations. In this project, Dr. Carenini's team seek to improve discourse parsing performance and to apply discourse parsing outputs to improve the performance of other NLP tasks, with a specific focus on state-of-the-art approaches to Q&A systems and text summarization.

Awarded to: Giuseppe Carenini

Trainees: Patrick Huber (PhD candidate), Wen Xiao (MSc candidate)

This project is sponsored by the DSI-Huawei Research Program

Computer vision and machine learning techniques for video and facial understanding

In this project, Drs. Sigal and Schmidt are pursuing a number of research goals at the intersection of computer vision and machine learning. In part one, the team will advance automatic video summarization by exploring novel richer joint video-linguistic and graph-structured representations to facilitate video retrieval, summarization and--potentially--action recognition tasks. In part two, the team will develop generative models that are able to effectively "imagine" what images of faces or objects would look like in a canonical (e.g., frontal face image in face recognition), or more broadly, any view or unobstructed configuration. In the last section of this project, the team aims to develop much faster methods for deep neural networks underlying computer vision systems (such as those applied in part one and two) by tuning the “parameters” of deep neural networks and tuning the “hyper-parameters”—this includes choosing the structure of the network and other design choices by developing an automated technique. The outcomes of this project will result in significant improvements to computer vision performance and runtime and applications for surveillance applications.

Awarded to: Mark Schmidt & Leonid Sigal

Trainees: Mohit Bajaj (MSc candidate), Polina Zablotskaia (MSc candidate)

This project is sponsored by the DSI-Huawei Research Program

Knowledge Graphs – Mining, Cleaning and Maintenance

Extraction of knowledge from information sources ranging from unstructured and semi-structured, to structured has gained significant interest both in academia and in the industry. This is fueled by applications such as question answering and computational fact checking. Knowledge graphs (KG) have lately emerged as a de facto standard for knowledge representation, whereby knowledge is expressed as a collection of “facts", represented in the form of (subject, predicate, object) triples where subject and object are entities and predicate is a relation between those entities. This collection can be conveniently stored, queried, and maintained as a graph, with the entities modeled as vertices and relations as links or edges. In this project, Dr. Lakshmanan and his team will mine a large KG from information sources, with an emphasis on publicly available documents—including structured sources such as tables. They will also develop techniques for cleaning the KG and for maintaining it against updates. Finally, they will exploit the resulting KG in applications of question answering and computational fact checking, both of which will leverage the pattern search capabilities of a knowledge graph.

Awarded to: Laks Lakshmanan

Trainees: Michael Simpson (PDF), Sarah Habashi (MSc candidate)

This project is sponsored by the DSI-Huawei Research Program

Leveraging eye-tracking data to improve reliable detection of Alzheimer’s Disease and related patient’s states

Reliable detection of disease in the early stages of Alzheimer’s Disease (AD) continues to be a challenge. This project led by Drs. Conati and Field aims to investigate the value of eye-tracking data as one of the sources of information to build machine learning detectors of AD. In addition, the team will investigate eye-tracking based detectors of AD-related states such confusion and distress during naturally occurring tasks. This project aims to 1) validate the concept of using spontaneous speech and eye tracking data as clinical markers for early detection of AD through the use of machine-learning algorithms; and 2) investigate ways to increase detection accuracy by exploring both alternative machine learning settings, as well as alternative diagnostic tasks used for data collection. Eventually, the goal is to develop software that can detect states of confusion or distress in AD patients during day-to-day activities (e.g., reading an article) and automatically trigger interventions aimed at reducing the levels of discomfort and stress.

Awarded to: Cristina Conati and Thalia Field

DSI Postdoctoral Fellow: Oswald Barral

This project is sponsored by PHIX and VCHRI.

Using contact networks, administrative, and linked genomic data to understand tuberculosis transmission in BC

Tuberculosis (TB) is still a problem in British Columbia, with approximately 250 cases diagnosed each year. In order to meet the WHO’s goal of achieving TB pre-elimination by 2030, TB rates in BC need to decline at a faster rate, and a change in how we manage TB prevention and care in the province is needed. Fortunately, all TB-related laboratory, epidemiology, clinical, and public health activities are centralized at the BC Centre for Disease Control (BCCDC). This provides a unique opportunity to harness this data to understand TB transmission in BC, which can in turn inform public health policy and action. This project, led by Drs. Jennifer Gardy and Matias Salibian-Barrera seeks to develop and implement a predictive analytics platform into the TB Services Program within the BCCDC. Specifically, it will explore whether features such as i) contact or transmission network properties (static or over time), ii) clinical/epidemiological/demographic attributes of early cases within a network, and/or iii) genomic data can be used to predict whether a newly diagnosed case is likely to lead to a sustained outbreak. In addition, the team will explore whether patterns of patient interaction with the healthcare system can be used to infer potentially undiagnosed TB infections. More...

Awarded to: Jennifer Gardy and Matías Salibián-Barrera

DSI Postdoctoral Fellow: Ben Sobkowiak

2017 Projects:

Data science over graphs, streams, and sequences: From the analysis of fake news to prediction and intervention

Fake news and misinformation have been a serious problem for a long time and the advent of social media has made it more acute, particularly in the context of the 2016 U.S. Presidential election. This illustrates how social networks and media have started playing a fundamental role in the lives of most people--they influence the way we receive, assimilate, and share information with others. As such, online lives of users in social media tend to leave behind a trail of data which can be harnessed for driving applications that either did not exist or could not be launched effectively before. This project will develop a multi-prong approach to detect fake news. It will develop text mining techniques to analyze "tweets" and explore the possible use of new statistical tools for the analysis of social network and media data. In particular, Twitter data will be used to learn user intents and specifically analyze the language for discriminate features of fake news by comparing with genuine ones. This project aims to devise strategies to combat and contain fake news propagation, reducing its consequence and harm for the effective functioning of a civil society.

Awarded to: Laks Lakshmanan and Ruben Zamar

DSI Postdoctoral Fellow: Ezequiel Smucler

Modeling multiple types of "omics" data to understand the biology of human exposure to pollution and allergens

Inhaled environmental and occupational exposures such as air pollution and allergens are known to have a profound effect on our respiratory and immunological health. This collaborative project seeks to better understand how the human body responds adversely to these perturbants by developing and applying new computational models for analyses of integrated molecular data sets, collectively known as 'omics profiling (e.g., genomics, proteomics, metabolomics, epigenomics, transcriptomics, and polymorphisms). Joint analyzes of these high-dimensional data sets, derived from experimental controlled human exposures to inhaled particles and allergens, may unlock insight to better treat asthma, COPD, and other respiratory diseases.

Awarded to: Chris Carlsten and Sara Mostafavi

DSI Postdoctoral Fellow: Zahra Jalali

This project is sponsored by PHIX and VCHRI.

From heuristics to guarantees: the mathematical foundations of algorithms for data science

Many of the most successful approaches commonly used in data-science applications (e.g., machine learning) come with little or no guarantees. Notable examples include convolutional neural networks (CNNs) and data-fitting formulations based on non-convex loss functions. In both cases, the training procedures are based on optimizing over intractable problems. While these methods are undeniably successful in a wide variety of machine learning and signal-processing tasks (e.g., classification of images, speech, and text), the robustness that comes with theoretical guarantees are paramount for more critical applications such as in medical diagnoses or in unsupervised algorithms embedded into electronic devices (e.g., self-driving car). This project aims to build theoretical foundations for key algorithmic approaches used in data-science applications.

Awarded to: Michael Friedlander and Ozgur Yilmaz

DSI Postdoctoral Fellow: Halyun Jeong

A platform for interactive, collaborative, and repeatable genomic analysis

Computer systems – both hardware and software – currently represent an active barrier to the scientific investigation of genomic data. Answering even relatively simple questions requires assembling disparate software tools (for alignment, variant calling, and filtering) into an analytics pipeline, and then solving practical IT problems in order to get that pipeline to function stably and at scale. This project will employ a whole system approach for providing a framework for genomic analysis. By building on an existing botany-based analysis pipeline and exploiting emerging high-density “rack-scale” computer hardware, the project will refactor and extend existing genomic analysis software in order to provide a platform that moves many traditionally long-running analytical tasks to run fast enough to enable interactive analysis. This will facilitate sharing of datasets and analysis code across the research community and will provide sufficient capture of data and analysis provenance to encourage reproducibility of published results.

Awarded to: Loren Rieseberg and Andy Warfield

DSI Postdoctoral Fellow: Jean-Sébastien Légaré

Application of deep learning approaches in modelling cheminformatics data and discovery of novel therapeutic agents for prostate cancer

The recent explosion of chemical and biological information calls for fundamentally novel ways of dealing with big data in the life sciences. This problem can potentially be addressed by the latest technological breakthroughs on both software and hardware frontiers. In particular, the latest advances in artificial intelligence (AI) enable cognitive data processing at very large-scale by means of deep learning (DL). This project will develop a deep neural network (DNN) environment with a re-enforced learning component that will utilize GNU power to capture all available information on 100s of millions of existing small molecules (including their interactions with proteins and other cell components). The ultimate goal is to develop an “all chemistry on one chip” expert system that can accurately generate structures of a small molecule with user-defined biological, physical and chemical properties. Such a cognitive AI platform can be integrated with already existing technologies of high-throughput synthesis (click-chemistry) to yield a paradigm-shifting ‘molecular printer’ that will revolutionize life science.

Awarded to: Artem Cherkasov and Will Welch

DSI Postdoctoral Fellow: Michael Fernandez Llamosa

This project is sponsored by PHIX and VCHRI.

Using text analysis for chronic disease management

The diagnosis, management, and treatment of chronic diseases (e.g., diabetes, chronic obstructive pulmonary diseases, and heart failure) have traditionally been focused on longitudinal histories and physical examinations as primary tools of assessment, and augmented by laboratory testing and imaging. Equally important to history taking and physical examinations is the objective assessments and understanding of the contribution of the patients' states of mind to their disease states. This is historically only documented qualitatively but highly challenging to measure quantitatively. However, recent advances in data science techniques such as natural language processing are providing new opportunities. Speech and text analysis is an emerging strategy to carry out analysis of cognition, sentiments, physical symptoms and social influences for such potential quantification. Thus, this project seeks to integrate speech and text analysis into the longitudinal management of chronic diseases to maintain optimal stability, support recovery and detect deterioration. Furthermore, the project will analyze synergistic measurements of speech and text analysis with physiologic data captured by wearables and sensors used in chronic disease management to gauge the states of stability of patients’ chronic diseases.

Awarded to: Kendall Ho and Giuseppe Carenini

DSI Postdoctoral Fellow: Hyeju Jang

This project is sponsored by PHIX and VCHRI.

User Modeling and Adaptive Support for MOOCS

Massive open on-Line courses (MOOCS) have great potential to innovate education, but suffer from one key limitation typical of many on-line learning environments: lack of personalization. Intelligent Tutoring Systems (ITS) is a field that leverages Artificial Intelligence and Machine Learning to devise educational tools that can provide instruction tailored to the needs of individual learners, as good teachers do. In this project, Drs. Conati and Roll aim to apply some of the concepts and technique from ITS research to MOOCS. Specifically, in previous work they have developed a framework to: i) discover from data, students behaviors that can be detrimental for or conducive to learning with specific educational software; ii) use these behaviors to build classifiers that can detect ineffective learners in real time and provide personalized support accordingly. They have already successfully applied this framework to two different on-line educational tools, and now plan to extend it to make existing MOOCS more reactive to specific student needs.

Awarded to: Cristina Conati and Ido Roll

DSI Postdoctoral Fellow: Sébastien Lallé