(Contributed by Ben Voight)
Skill sets in mathematics and computation are becoming increasingly central to scientific endeavor. Skills in mathematics developed in predoctoral training relate to statistical inference prominently, but also to kinetics, equilibria and underlying chemical energetics, modeling of systems, and imaging. Skills in computation relate to the development of algorithms to process large amounts of information, with a highly overlapping skill set formalized by the term data science.
I discuss here only statistical inference, as skills developed in statistical inference have the widest applicability to careers a PhD in biomedicine might pursue. Statistics continues to evolve as a science. It is as much about asking questions as it is about giving one principles to ask those questions. It has permeated virtually every field of science, owing to the growing preponderance of data and need for data scientists to make meaningful conclusions from these data sets. Statistical-minded thinking, moreover, is a natural expression of the scientific method: often, one is faced with one or more competing models and wants to determine which model best explains the (often large) amount of data one has generated or collected. Whether it is hypothesis testing, significance assessment, model comparison, there are a core set of statistical principles that are simply essential for life as a data-driven scientist and leader.
Some overview example of these points:
https://fivethirtyeight.com/. Nate Silver is a quintessential data scientist and practitioner of statistical inference a large range of contexts, for example politics, sports, economics, and science.
My personal advice on the subject is two-fold. First, it is important to have a solid, fundamental understanding of the basics of probability theory, and applications of that theory in statistical inference. In this regard, there are a lot of great online resources to get started:
Basics of probability theory, Statistics, applications:
Second, identify a problem that you are interested in, a data set that you are curious about. Statistics is as much about asking questions as it is about giving one principles to ask those questions. I learned a great deal of the statistical principles I personally know by working through biological and genetics research problems.
It is clear that students who develop computational fluency, in addition to other fluencies during their dissertation work, are able to excel within the biological sciences and also in other, alternative career paths. The key reason is simple: Large amounts of information that are accessible via computer have led to data science as a complementary skill set (and even a profession!) within many fields, including biology. To me, computational fluency in data science is the ability to be accurate, efficient, flexible, and reproducible when working with a large amount of data and in performing scientific inquiries.
It is impossible to talk about fluency without referring to a specific set of skills codified in generic computational toolkits. I outline four areas where basic competencies allow for a broad set of skills translatable within and outside of biological sciences:
(i) Effectively working on a command-line (non graphical user interface) system environment, i.e., UNIX environment.
(ii) Computer programming using a “scripting” language to facilitate the parsing, processing, and analysis of one (or many) data sets or files
(iii) R programming to facilitate parsing of data files, routine to advanced statistical analysis, and visualization of results in interpretable and reproducible forms.
(iv) Reproducibility to ensure that the results one generates can be obtained by others, so that key findings stand the test of time.
A. Operating in a UNIX environment.
The key idea is that everything you can do in a “graphical” view of the world, you can do on the command line, but not everything that you want to do in data science – write programs, analyze data, etc. – can be done in a graphical user interface. If you’ve watched The Matrix, you might appreciate a little more what is involved here. There are several standard books and online materials to help students become familiar with this environment.
B. Programming: in Python
One of the first things I taught myself in graduate school was to learn how to program (script) – this completely changed what types of science I could do, and ultimately, what I now do for a living. Unlike when I was teaching myself this many years ago, the Internet is now populated with a vast number of online resources and coursework that can be used to learn this material. Today, Python programming seems to be the scripting language of choice by most students, as it provides both flexibility and extensive packages which captures the needs for most data scientists. In fact, one particular popular package (SciKit-Learn) allows users to easily apply simple to advanced Machine Learning algorithms, with access to tools to ensure reproducibility and accuracy.
Key package: SciKit-Learn: http://scikit-learn.org/stable/tutorial/
C. Programming: in R
It turns out that Python is a great tool that can allow you to do a lot of things: parse files, read and process data, and create interactive tools to run larger-scale analyses. However, this tool does not address all issues. For one, we often want to utilize some level of statistical savvy to assess significance, compare models, or perform specific types of statistical and computational analysis (e.g. differential gene expression analysis for microarray data). Second, even for data and results we generate in Python, we often need to visualize these data using summaries that are informative, intuitive, and reproducible. While there are many choices here, R is a go-to programming language of choice that can take on many of these challenges.
Visualization in R using the package ggplot2: great data visuals:
Data handling: dplyr, tidyr packages:
D. Reproducibility in Research
One can think of analysis as driven by a question that takes in some set of inputs (data) and generates some output (results) that bring data to bear on the question. Each step should be clear, and with the proper inputs, file, and program management, should be able to be recapitulated by another individual in good faith. Ideally this pipeline should be human understandable – the exact details should be clear and rationale for each step in the process, i.e. should be transparent to a human reading the pipeline.
It’s easy to state “Keep a good notebook.” Ideally, all biologists should. A computational notebook differs from a conventional notebook, in the sense that the pipeline has a version, and may be updated as computer bugs are found, new data is available, new visualizations are articulated, etc. It’s the job of the creator to maintain and ensure that final, publishable quality results can be issued from a specific frozen version of a developed pipeline. The good news is that there are a number of tools that help users create pipelines that have this type of reproducibility intrinsically tied to the scientific study.
In Python: Jupyter notebooks: