Data Science

Research in data is typically referred to as data science. Broadly defined, it refers to methods, processes and tools that allow a user to interpret and understand insights extracted from large and complex data collections. It is an interdisciplinary field involving many areas in computer science, statistics and mathematics, including data mining, machine learning, visualization, statistical inference, predictive modeling, data management and high performance computing.

Data science methods, processes and tools can be divided into three broad categories:

Data Processing and Management

Data processing and management include methods and tools for assessing data quality, data cleansing, integration, linking and managing access, which are essential for understanding and leveraging complex data. This data can exist as structured, semi-structured or unstructured. Structured data typically refer to relational data that are stored in databases (i.e., spreadsheets). Semi-structured data refer to data with looser data schemas, which are more flexible for data exchange in the Internet (i.e., XML). Unstructured data nowadays are predominantly images, videos or free-form text.

Data Analysis and Modelling

Methods that allow joint modeling of diverse data types and data from disparate sources are especially important in data science. For example, in precision medicine it is increasingly important to integrate and analyze disparate, large data sets of genomic (i.e., whole genome sequencing, gene expression sequencing), epigenomic and proteomic information to gain a better understanding of disease mechanisms. As a second example, using methods to model the movement of animals across time and space as captured by sensors and cameras enables us to understand changes in migration patterns due to climate and environmental factors.

Data Interpretation

Methods and tools that interpret data and provide insights to users is the last category. Given the volume and variety of data, visualization tools are critical for the domain scientists to display their data, to observe the correlation and discrepancy of different types of data, and to interpret the results of models. Beyond visualization, there are other valuable forms of interpretation. In the case of genomics research, for instance, researchers can advance their understanding of unfamiliar genes by linking to related literature, phenotypic database or by joining appropriate online discussion forums of like-minded researchers.