Research Projects

2017 Projects:

Data science over graphs, streams, and sequences: From the analysis of fake news to prediction and intervention

Fake news and misinformation have been a serious problem for a long time and the advent of social media has made it more acute, particularly in the context of the 2016 U.S. Presidential election. This illustrates how social networks and media have started playing a fundamental role in the lives of most people--they influence the way we receive, assimilate, and share information with others. As such, online lives of users in social media tend to leave behind a trail of data which can be harnessed for driving applications that either did not exist or could not be launched effectively before. This project will develop a multi-prong approach to detect fake news. It will develop text mining techniques to analyze "tweets" and explore the possible use of new statistical tools for the analysis of social network and media data. In particular, Twitter data will be used to learn user intents and specifically analyze the language for discriminate features of fake news by comparing with genuine ones. This project aims to devise strategies to combat and contain fake news propagation, reducing its consequence and harm for the effective functioning of a civil society.

Awarded to: Laks Lakshmanan and Ruben Zamar

DSI Postdoctoral Fellow: Ezequiel Smucler

Modeling multiple types of "omics" data to understand the biology of human exposure to pollution and allergens

Inhaled environmental and occupational exposures such as air pollution and allergens are known to have a profound effect on our respiratory and immunological health. This collaborative project seeks to better understand how the human body responds adversely to these perturbants by developing and applying new computational models for analyses of integrated molecular data sets, collectively known as 'omics profiling (e.g., genomics, proteomics, metabolomics, epigenomics, transcriptomics, and polymorphisms). Joint analyzes of these high-dimensional data sets, derived from experimental controlled human exposures to inhaled particles and allergens, may unlock insight to better treat asthma, COPD, and other respiratory diseases.

Awarded to: Chris Carlsten and Sara Mostafavi

DSI Postdoctoral Fellow: Zahra Jalali

This project is sponsored by PHIX and VCHRI.

From heuristics to guarantees: the mathematical foundations of algorithms for data science

Many of the most successful approaches commonly used in data-science applications (e.g., machine learning) come with little or no guarantees. Notable examples include convolutional neural networks (CNNs) and data-fitting formulations based on non-convex loss functions. In both cases, the training procedures are based on optimizing over intractable problems. While these methods are undeniably successful in a wide variety of machine learning and signal-processing tasks (e.g., classification of images, speech, and text), the robustness that comes with theoretical guarantees are paramount for more critical applications such as in medical diagnoses or in unsupervised algorithms embedded into electronic devices (e.g., self-driving car). This project aims to build theoretical foundations for key algorithmic approaches used in data-science applications.

Awarded to: Michael Friedlander and Ozgur Yilmaz

DSI Postdoctoral Fellow: Halyun Jeong

A platform for interactive, collaborative, and repeatable genomic analysis

Computer systems – both hardware and software – currently represent an active barrier to the scientific investigation of genomic data. Answering even relatively simple questions requires assembling disparate software tools (for alignment, variant calling, and filtering) into an analytics pipeline, and then solving practical IT problems in order to get that pipeline to function stably and at scale. This project will employ a whole system approach for providing a framework for genomic analysis. By building on an existing botany-based analysis pipeline and exploiting emerging high-density “rack-scale” computer hardware, the project will refactor and extend existing genomic analysis software in order to provide a platform that moves many traditionally long-running analytical tasks to run fast enough to enable interactive analysis. This will facilitate sharing of datasets and analysis code across the research community and will provide sufficient capture of data and analysis provenance to encourage reproducibility of published results.

Awarded to: Loren Rieseberg and Andy Warfield

DSI Postdoctoral Fellow: Jean-Sébastien Légaré

Application of deep learning approaches in modelling cheminformatics data and discovery of novel therapeutic agents for prostate cancer

The recent explosion of chemical and biological information calls for fundamentally novel ways of dealing with big data in the life sciences. This problem can potentially be addressed by the latest technological breakthroughs on both software and hardware frontiers. In particular, the latest advances in artificial intelligence (AI) enable cognitive data processing at very large-scale by means of deep learning (DL). This project will develop a deep neural network (DNN) environment with a re-enforced learning component that will utilize GNU power to capture all available information on 100s of millions of existing small molecules (including their interactions with proteins and other cell components). The ultimate goal is to develop an “all chemistry on one chip” expert system that can accurately generate structures of a small molecule with user-defined biological, physical and chemical properties. Such a cognitive AI platform can be integrated with already existing technologies of high-throughput synthesis (click-chemistry) to yield a paradigm-shifting ‘molecular printer’ that will revolutionize life science.

Awarded to: Artem Cherkasov and Will Welch

DSI Postdoctoral Fellow: Michael Fernandez Llamosa

This project is sponsored by PHIX and VCHRI.

Using text analysis for chronic disease management

The diagnosis, management, and treatment of chronic diseases (e.g., diabetes, chronic obstructive pulmonary diseases, and heart failure) have traditionally been focused on longitudinal histories and physical examinations as primary tools of assessment, and augmented by laboratory testing and imaging. Equally important to history taking and physical examinations is the objective assessments and understanding of the contribution of the patients' states of mind to their disease states. This is historically only documented qualitatively but highly challenging to measure quantitatively. However, recent advances in data science techniques such as natural language processing are providing new opportunities. Speech and text analysis is an emerging strategy to carry out analysis of cognition, sentiments, physical symptoms and social influences for such potential quantification. Thus, this project seeks to integrate speech and text analysis into the longitudinal management of chronic diseases to maintain optimal stability, support recovery and detect deterioration. Furthermore, the project will analyze synergistic measurements of speech and text analysis with physiologic data captured by wearables and sensors used in chronic disease management to gauge the states of stability of patients’ chronic diseases.

Awarded to: Kendall Ho and Giuseppe Carenini

DSI Postdoctoral Fellow: Hyeju Jang

This project is sponsored by PHIX and VCHRI.

User Modeling and Adaptive Support for MOOCS

Massive open on-Line courses (MOOCS) have great potential to innovate education, but suffer from one key limitation typical of many on-line learning environments: lack of personalization. Intelligent Tutoring Systems (ITS) is a field that leverages Artificial Intelligence and Machine Learning to devise educational tools that can provide instruction tailored to the needs of individual learners, as good teachers do. In this project, Drs. Conati and Roll aim to apply some of the concepts and technique from ITS research to MOOCS. Specifically, in previous work they have developed a framework to: i) discover from data, students behaviors that can be detrimental for or conducive to learning with specific educational software; ii) use these behaviors to build classifiers that can detect ineffective learners in real time and provide personalized support accordingly. They have already successfully applied this framework to two different on-line educational tools, and now plan to extend it to make existing MOOCS more reactive to specific student needs.

Awarded to: Cristina Conati and Ido Roll