Data Analysis with R

Introduction
Analyzing data
- Dataframes (and Tibbles)
Plotting
- Volcano Plot
GO Terms
- Violine and Boxplots
- Heatmaps
Putting it all together

Introduction

R is a language for statistical computing and graphics. In contrast to matlab or mathematica, R is free and open source. Hence, an ecosystem of diverse packages is growing rapidly. A package is a collection of computer code that does one or more things.

Packages can be written by anyone and are designed to make the R experience better. Examples include dplyr and tidyr, two packages that provide functions for data cleaning, processing, and manipulation. Other examples include ggplot2, used to improve R’s plotting capabilities. Packages are written to address specific needs in countless fields such as Finance, Astronomy, Linguistics,and of course, Biology. A lot of biological packages,especially those that deal with high throughput genomic data, can be found at a package repository called Bioconductor.

Useful commands before we get started

Basic navigation commands:

- getwd(): This will show you your working directory
- setwd(): This allows you to change the working directory. 
           For this command, you need to enter the location of
           the desired working directory in between the parenthesis.
- dir():   This lists the contents of the working directory.

Helper commands:

- ?some_command: This yields help pages for a command
- ??some_package: Same but as ?some_command for packages.
- str(some_object): Displays the internal structure of an R object

Masking command:

- #: The hash symbol is used at the beginning of any computer code 
     "sentence" you wish the R program to ignore.  This is helpful for
     annotating your code.

File location

Check your file location, i.e. working directory, using the getwd() command:

#remember to remove the # before running this command.
#getwd()

If you wish, you can change your working directory, as demonstrated below:

#again, remember to remove the # before running this command.
#setwd("C:/Users/jkaddis/Desktop/HIRN_tutorial_2017")

Loading packages

In R, to load a package, you use the library () command. The name of the package appears between the parenthesis. In order to load a package into your R session, you must first install it on your computer. Installation is required only once, loading is required each time you start a new R session.

To install multiple packages at once, use this command:

install.packages(c("tidyverse", "readr", "ggrepel", "ggthemes", "cowplot"),
                 repos="http://cran.us.r-project.org", dependencies = TRUE)

## Error in install.packages : Updating loaded packages

Load these packages to perform the tutorial:

library(tidyverse)
library(readr)
library(ggrepel)
library(ggthemes)
library(cowplot)

Simple math with R

40 + 2

## [1] 42

44 - 2

## [1] 42

6 * 7

## [1] 42

84 / 2

## [1] 42

sqrt(1764)

## [1] 42

6.480741**2

## [1] 42

The alternative syntax for this is:

6.480741^2

## [1] 42

Using variables

In R anyting can be assigned to a variable. This is done as follows:

a <- "Hello World"
b <- 21

Once the assignment is made, you can use the variables instead:

## [1] "Hello World"

(b + b)^2

## [1] 1764

Two more comments to assigning values to variables. Using the “=” sign instead of “<-” is supported but discouraged. Secondly using Rstudio pressing “Alt” and the “minus” sign is a shortcut for “<-”.

Analyzing data

In our multi-session webinar series (later this year) we will give a more detailed introduction to R. In this tutorial, however, we show you how you can analyze data and create compelling graphs with very little background knowledge.

Let’s go straight to some already processed RNA-seq data (In our webinar series we will also tell you how to process RNA-seq data yourself using open source software). The dataset we will work with, GSE79724, can be downloaded from NCBI’s GEO. There are R-packages that facilitate the transfer of data directly from GEO to your computer, but to limit the number of dependencies please download the data provided here and place it into your working directory.

Dataframes (and Tibbles)

Dataframes are R’s main workhorse. The packages combined in tidyverse facilitate handling these. In the process they are converted to tibbles, but that’s a detail we can ignore.

Loading the data

The data we address here comes in the form of csv files. CSV stands for comma separated values. It’s basically a text file that looks like this:

header1, header2, header3

value1a, value2a, value3a

value2a, value2b, value3b

another_value1, another_value2, another_value3

We can load the data into a tibble by using the function read_csv, which is part of the readr package. It’s also possible to achieve the same without using any special packages, i.e. the read.csv base R function; however, the readr functions are much faster. Another text-based format that is often used is tsv, tab separated values. TSV files tend to be more human readable and there are equivalent functions in R to load these files.

To find out more about read_csv we can execute the following command:

?read_csv

The data we want to have a closer look at is in the file data.csv. The right hand panel of Rstudio has a “Files” tab. It displays all the files in the current directory. Clicking on data.csv will show you the contents of the file. For larger files (and slower computers) it’s better not to aim to display the entire file though.

To read the data.csv file that you downloaded:

#Since we have placed the file in our working directory, all we need to 
#do is place the name of the file in quotation marks.  If the file was 
#in some other location, we would need to include the full path to the 
#file within the quotation marks.

# We also need to tell the function that we expects some column names, 
# hence col_names = TRUE

my_data <- read_csv("data.csv", col_names = TRUE)

## Parsed with column specification:
## cols(
##   gene = col_character(),
##   baseMean_all = col_double(),
##   `baseMean_LG3-LG6-LG9` = col_double(),
##   `baseMean_LG2-LG5-LG8` = col_double(),
##   foldChange = col_double(),
##   log2FoldChange = col_double(),
##   pval = col_double(),
##   padj = col_double()
## )

Now that the file is loaded we want to have a look at the data. There are several ways to do this. Rstudio offers an interactive option. In the top right panel there is the Environment tab. Here we see all data and all values that are currently in memory. Clicking on my_data will open a new tab in this panel and allows us to observe the file. We should look, for example, if the file really has column names or if this is not the case. The panel on the right also says that my_data has 25,559 rows and 8 columns. The swiftness of opening and displaying this file obviously depends on the size of the file and specs of the computer.

In any case, there is very little value in displaying several thousand lines as we are not going to visually inspect them anyway. We probably just need to look at a few lines at the top to see if the column headers were recognized as such and to get a feeling for what is in the file. So rather than “displaying” 25,559 lines we should be fine by just having a look at 10 lines or so. This can be achieved by base-R’s head() function. But here we are going to promote the use of the tidyverse functions. They are highly optimized and very fast, and the syntax is fairly intuitive. The core principle follows along these lines:

pick some data
perform a task on it
perform the next task
…
output the results

In this first case we want to take the data and show the first 10 lines:

my_data %>% head(10)

## # A tibble: 10 × 8
##         gene baseMean_all `baseMean_LG3-LG6-LG9` `baseMean_LG2-LG5-LG8`
##        <chr>        <dbl>                  <dbl>                  <dbl>
## 1  LOC643401    129.44063              15.371465              243.50980
## 2     ABCA12    145.15977              28.298929              262.02061
## 3  LOC401442     34.35765               2.480858               66.23444
## 4        MAL     48.23187               3.452497               93.01124
## 5       REG4    407.00315             131.200840              682.80545
## 6       IBSP     25.69337               3.097313               48.28943
## 7   HIST3H2A    250.54501              88.053916              413.03610
## 8      PRRT2     82.47317              25.022757              139.92358
## 9  C14orf176     48.30360              10.187104               86.42009
## 10 LOC644656    162.67651              57.176961              268.17605
## # ... with 4 more variables: foldChange <dbl>, log2FoldChange <dbl>,
## #   pval <dbl>, padj <dbl>

To string the commands together we use the pipe character (%>%). It’s so commonly used in modern R programming that it has its own keyboard shortcut in Rstudio (Shift-Control-M).

The file already contains a lot of interesting values, such as foldChange and pvalue. But for the purpose of this tutorial, we are going to recalculate some of them. First we create a data frame that contains only the gene names, the baseMean_LG3-LG6-LG9, and the baseMean_LG2-LG5-LG8 values.

The process of selecting and manipulating data in a way that suits the analysis is often referred to as data-wrangling. Rstudio offers a very good cheat-sheet for this here. The packages of the tidyverse we are going to use for this are dplyr and tidyr.

Let’s see how this works in practice:

#create a data frame with the selected columns only
my_calc <- my_data %>% 
  select(gene, `baseMean_LG3-LG6-LG9`, `baseMean_LG2-LG5-LG8`)

#the column names contain hyphens, this can lead to some confusion 
#as R might interpret it as a minus sign, hence we need to use the ticks.
#lets see how it worked
my_calc %>% head()

## # A tibble: 6 × 3
##        gene `baseMean_LG3-LG6-LG9` `baseMean_LG2-LG5-LG8`
##       <chr>                  <dbl>                  <dbl>
## 1 LOC643401              15.371465              243.50980
## 2    ABCA12              28.298929              262.02061
## 3 LOC401442               2.480858               66.23444
## 4       MAL               3.452497               93.01124
## 5      REG4             131.200840              682.80545
## 6      IBSP               3.097313               48.28943

Before we move on it makes sense to rename the columns so we don’t have to use ticks all the time:

colnames(my_calc) <- c("gene", "LG369", "LG258")
my_calc %>% head()

## # A tibble: 6 × 3
##        gene      LG369     LG258
##       <chr>      <dbl>     <dbl>
## 1 LOC643401  15.371465 243.50980
## 2    ABCA12  28.298929 262.02061
## 3 LOC401442   2.480858  66.23444
## 4       MAL   3.452497  93.01124
## 5      REG4 131.200840 682.80545
## 6      IBSP   3.097313  48.28943

Firstly, lets calculate the mean value and the standard deviation, SD:

#We reassign the output to the initial variable my_calc.
#An alternative lies in using another package (magrittr) which 
#supports the operator %<>% which feeds back the output into the 
#input variable

my_calc <- my_calc %>% 
  mutate(base_mean_all = (LG369 + LG258)/2)
my_calc

## # A tibble: 25,559 × 4
##         gene      LG369     LG258 base_mean_all
##        <chr>      <dbl>     <dbl>         <dbl>
## 1  LOC643401  15.371465 243.50980     129.44063
## 2     ABCA12  28.298929 262.02061     145.15977
## 3  LOC401442   2.480858  66.23444      34.35765
## 4        MAL   3.452497  93.01124      48.23187
## 5       REG4 131.200840 682.80545     407.00315
## 6       IBSP   3.097313  48.28943      25.69337
## 7   HIST3H2A  88.053916 413.03610     250.54501
## 8      PRRT2  25.022757 139.92358      82.47317
## 9  C14orf176  10.187104  86.42009      48.30360
## 10 LOC644656  57.176961 268.17605     162.67651
## # ... with 25,549 more rows

Of course R has a function mean that can do the same. In order for it to work here, we have to add the function rowwise which tells R that the following operations are done row-wise. This allows the user to do fairly complex operations.

Lets see how this works:

my_calc %>% 
  rowwise() %>% 
  mutate(base_mean_all = mean(c(LG369, LG258))) %>% 
  head()

## # A tibble: 6 × 4
##        gene      LG369     LG258 base_mean_all
##       <chr>      <dbl>     <dbl>         <dbl>
## 1 LOC643401  15.371465 243.50980     129.44063
## 2    ABCA12  28.298929 262.02061     145.15977
## 3 LOC401442   2.480858  66.23444      34.35765
## 4       MAL   3.452497  93.01124      48.23187
## 5      REG4 131.200840 682.80545     407.00315
## 6      IBSP   3.097313  48.28943      25.69337

We also want to calculate the fold-change:

my_calc <- my_calc %>% 
  mutate(foldchange = LG258/LG369)
my_calc

## # A tibble: 25,559 × 5
##         gene      LG369     LG258 base_mean_all foldchange
##        <chr>      <dbl>     <dbl>         <dbl>      <dbl>
## 1  LOC643401  15.371465 243.50980     129.44063  15.841678
## 2     ABCA12  28.298929 262.02061     145.15977   9.259029
## 3  LOC401442   2.480858  66.23444      34.35765  26.698195
## 4        MAL   3.452497  93.01124      48.23187  26.940282
## 5       REG4 131.200840 682.80545     407.00315   5.204277
## 6       IBSP   3.097313  48.28943      25.69337  15.590746
## 7   HIST3H2A  88.053916 413.03610     250.54501   4.690718
## 8      PRRT2  25.022757 139.92358      82.47317   5.591853
## 9  C14orf176  10.187104  86.42009      48.30360   8.483283
## 10 LOC644656  57.176961 268.17605     162.67651   4.690282
## # ... with 25,549 more rows

Obviously, it was also possible to perform the previous two steps in one command.

Now that we calculated the fold change, we can have a look at which genes are up- and which are down-regulated. Let’s take a look at everything that exceeds a 2-fold cut-off:

down_regulated <- my_calc %>% 
  filter(foldchange < 0.5)
up_regulated <- my_calc %>% 
  filter(foldchange > 2)

How many genes are downregulated? For this we count the number of rows (nrow) in the dataframe down_regulated:

nrow(down_regulated)

## [1] 1387

and up regulated:

nrow(up_regulated)

## [1] 3129

and how many genes did we have to start with? Lets see:

nrow(my_data)

## [1] 25559

Ok, now that we know how to select and rename columns, filter by cut-off values, and create new columns we can go back to the original dataset and do some other interesting things.

In case you got lost now it’s the time to catch up again. Let’s have a look at the my_data dataframe again:

my_data %>% head()

## # A tibble: 6 × 8
##        gene baseMean_all `baseMean_LG3-LG6-LG9` `baseMean_LG2-LG5-LG8`
##       <chr>        <dbl>                  <dbl>                  <dbl>
## 1 LOC643401    129.44063              15.371465              243.50980
## 2    ABCA12    145.15977              28.298929              262.02061
## 3 LOC401442     34.35765               2.480858               66.23444
## 4       MAL     48.23187               3.452497               93.01124
## 5      REG4    407.00315             131.200840              682.80545
## 6      IBSP     25.69337               3.097313               48.28943
## # ... with 4 more variables: foldChange <dbl>, log2FoldChange <dbl>,
## #   pval <dbl>, padj <dbl>

Let’s reapply the filtering techinque to this dataframe and create a new dataframe called my_data_p001 that contains only results with a p-value (pval) smaller than 0.05

Can you tell me how many rows this new dataframe has? Lets see:

my_data_p05 <- my_data %>% 
  filter(pval < 0.05)
nrow(my_data_p05)

## [1] 1124

Lets look at the dataframe again:

my_data_p05 %>% head()

## # A tibble: 6 × 8
##        gene baseMean_all `baseMean_LG3-LG6-LG9` `baseMean_LG2-LG5-LG8`
##       <chr>        <dbl>                  <dbl>                  <dbl>
## 1 LOC643401    129.44063              15.371465              243.50980
## 2    ABCA12    145.15977              28.298929              262.02061
## 3 LOC401442     34.35765               2.480858               66.23444
## 4       MAL     48.23187               3.452497               93.01124
## 5      REG4    407.00315             131.200840              682.80545
## 6      IBSP     25.69337               3.097313               48.28943
## # ... with 4 more variables: foldChange <dbl>, log2FoldChange <dbl>,
## #   pval <dbl>, padj <dbl>

Plotting

Volcano Plot

Let’s start of by doing a volcano plot. ggplot2 is a very powerful graphics package in R. It’s also part of the tidyverse package. ggplot2 uses grammar of graphics

Let’s see how this works in practice. First we create a cut-off for our data similar to what we did before:

#create a threshold with an absolute fold change greater than two 
#with a p-value smaller than the bonferroni determined 
#nominal alpha level. 
my_data$threshold <- 
  as.factor(abs(my_data$log2FoldChange) > 2 & 
              my_data$pval < 0.05/nrow(my_data))

Now we can create a simple volcano plot. First we tell ggplot what data we need to use and which column we want to plot and in aes we specify what we want to plot on the x and the y axis. Here is a fantastic cheat sheet.

Lets create the volcano plot:

p <- ggplot(data = my_data, 
             aes(x = log2FoldChange, y = -log10(pval), color = threshold))

Now we can add the style of our plot:

vp <- p + geom_point()
vp

## Warning: Removed 3696 rows containing missing values (geom_point).

Why do we get the warning that almost 4000 rows were removed? A look at my_data reveals that a number of rows have means of 0 and hence missing values for everything that got calculated. Missing values are called NA in R.

Lets look at the missing data:

missing_data <- my_data %>% 
  filter(is.na(threshold))
nrow(missing_data)

## [1] 3696

Now that this is clarified, wouldn’t it be nice to have a title and some proper axis labels rather than the column headers? Lets generate this:

vp <- vp + 
  xlab("log2 fold change") +
  ylab("-log10 p-value") +
  ggtitle("Volcano Plot")
vp

## Warning: Removed 3696 rows containing missing values (geom_point).

Let’s add something else:

vp <- vp + 
  geom_rug()
vp

## Warning: Removed 3696 rows containing missing values (geom_point).

Wouldn’t it be useful to see the gene names next to the dots? Given that we have many thousand dots we don’t expect to be able to read anything but maybe we can add the names to the dots that meet the threshold. For this we use the ifelse function:

?ifelse

We run a test (is threshold true) and then we add the label, if it’s False we add nothing (’’):

vp_labels <- vp +
  geom_text(aes(label = ifelse(threshold == TRUE, gene, '')))
vp_labels

## Warning: Removed 3696 rows containing missing values (geom_point).

## Warning: Removed 3696 rows containing missing values (geom_text).

We can improve on this:

vp_labels <- vp +
    geom_text_repel(aes(label = ifelse(threshold == TRUE, gene, '')))

vp_labels

## Warning: Removed 3696 rows containing missing values (geom_point).

## Warning: Removed 3696 rows containing missing values (geom_text_repel).

Finally, we may want to remove some clutter. If you haven’t heard of Edward Tufte you might want to check him out.

Lets remove the grid in the background and the legend:

vp_labels <- vp_labels +
    theme_tufte() +
    theme(legend.position="none")
vp_labels

## Warning: Removed 3696 rows containing missing values (geom_point).

## Warning: Removed 3696 rows containing missing values (geom_text_repel).

GO Terms

The data shows us which genes are enriched. Let’s add some pathway information to it.

As mentioned in the introduction, there are a lot of packages in R addressed to biologists, the majority of them are found in bioconductor. Some of these packages have complex dependencies and installation can take a few minutes. That’s why we don’t use any of these packages in this tutorial.

The biomaRt package allows us to get the GO terms for the genes in our dataset. The command looks like this:

#library(biomaRt)
#ensembl<- useMart("ensembl",dataset="hsapiens_gene_ensembl")
#gene_goid <- getBM(attributes=c('hgnc_symbol', 'go_id', 'name_1006'), 
#filters = 'hgnc_symbol', values = my_data_p05$gene, mart = ensembl)

This query takes a bit of time to run, so we have saved the results in a gene_goid.tsv and we are just going to load them. You can obtain the gene_goid.tsv file here.

This is a good opportunity to see if you remember how to load a csv file. Let’s load the file into a dataframe called “goid”. Use read_tsv this time:

goid <- read_tsv("gene_goid.tsv", col_names = TRUE)

## Parsed with column specification:
## cols(
##   gene = col_character(),
##   go_id = col_character(),
##   name_1006 = col_character()
## )

goid

## # A tibble: 16,594 × 3
##      gene      go_id
##     <chr>      <chr>
## 1  INPP5D GO:0004445
## 2  INPP5D GO:0005515
## 3  INPP5D GO:0016314
## 4  INPP5D GO:0016787
## 5  INPP5D GO:0017124
## 6  INPP5D GO:0034485
## 7  INPP5D GO:0052659
## 8  INPP5D GO:0002376
## 9  INPP5D GO:0006661
## 10 INPP5D GO:0006796
## # ... with 16,584 more rows, and 1 more variables: name_1006 <chr>

Now we can combine this dataframe with my_data_p05:

my_goid_data <- my_data_p05 %>% 
  inner_join(goid)

## Joining, by = "gene"

my_goid_data %>% head()

## # A tibble: 6 × 10
##     gene baseMean_all `baseMean_LG3-LG6-LG9` `baseMean_LG2-LG5-LG8`
##    <chr>        <dbl>                  <dbl>                  <dbl>
## 1 ABCA12     145.1598               28.29893               262.0206
## 2 ABCA12     145.1598               28.29893               262.0206
## 3 ABCA12     145.1598               28.29893               262.0206
## 4 ABCA12     145.1598               28.29893               262.0206
## 5 ABCA12     145.1598               28.29893               262.0206
## 6 ABCA12     145.1598               28.29893               262.0206
## # ... with 6 more variables: foldChange <dbl>, log2FoldChange <dbl>,
## #   pval <dbl>, padj <dbl>, go_id <chr>, name_1006 <chr>

How many individual GO terms do we have here? Lets see:

nrow(my_goid_data %>% select(go_id) %>% distinct)

## [1] 4586

Lets group our data a bit:

goid_occ <- my_goid_data %>% 
  group_by(go_id) %>% 
  summarise(occurrences = n()) %>% 
  na.omit(go_id)

my_goid_data <- my_goid_data %>% 
  inner_join(goid_occ)

## Joining, by = "go_id"

my_goid_data

## # A tibble: 16,210 × 11
##      gene baseMean_all `baseMean_LG3-LG6-LG9` `baseMean_LG2-LG5-LG8`
##     <chr>        <dbl>                  <dbl>                  <dbl>
## 1  ABCA12     145.1598               28.29893               262.0206
## 2  ABCA12     145.1598               28.29893               262.0206
## 3  ABCA12     145.1598               28.29893               262.0206
## 4  ABCA12     145.1598               28.29893               262.0206
## 5  ABCA12     145.1598               28.29893               262.0206
## 6  ABCA12     145.1598               28.29893               262.0206
## 7  ABCA12     145.1598               28.29893               262.0206
## 8  ABCA12     145.1598               28.29893               262.0206
## 9  ABCA12     145.1598               28.29893               262.0206
## 10 ABCA12     145.1598               28.29893               262.0206
## # ... with 16,200 more rows, and 7 more variables: foldChange <dbl>,
## #   log2FoldChange <dbl>, pval <dbl>, padj <dbl>, go_id <chr>,
## #   name_1006 <chr>, occurrences <int>

If we had more time we could do an enrichment analysis. But for now we are just going to have a look at what happens at different locations.

Which proteins undergo more changes, cytosolic or membrane proteins? GO:0005829 indicates cytosol, and GO:0016020 membrane proteins. Lets limit to data with these terms:

location_data <- my_goid_data %>% 
  filter(go_id == "GO:0005829" | go_id == "GO:0016020")

Violine and Boxplots

The next command might appear a bit esoteric, it calculates the number of proteins that are cytosolic or membrane proteins. The paste command prints outputs of various commands together, see ?paste:

xlabs <- 
  paste(levels(location_data$name_1006),"\n(N=",table(location_data$name_1006),")",
        "\n", location_data$name_1006, sep="")

p3 <- ggplot(data = location_data, aes(factor(name_1006), log2FoldChange))
violinep <- p3 + 
  geom_violin() +
  geom_tufteboxplot() +
  theme_tufte() +
  xlab("") +
  ylab("log 2 fold change") +
  ggtitle("Differential expression by location") +
  scale_x_discrete(labels = xlabs)
violinep

## Warning: Removed 1 rows containing non-finite values (stat_ydensity).

## Warning: Removed 1 rows containing non-finite values (stat_fivenumber).

Heatmaps

Bioconductor offers a number of packages that allows you to draw heatmaps including hierarchical clustering. The function heatmap() that is in base R does so too. Since we only have two groups hierarchical clustering doesn’t make a lot of sense. Hence we can just use ggplot2 again.

First we look for all results that have a GO term related to “glucose”. For this we use the filter command in combination with grepl. grepl is extremely powerful:

hm_data <- my_goid_data %>% 
  filter(grepl("glucose", name_1006)) %>% 
  select(gene, `baseMean_LG3-LG6-LG9`, `baseMean_LG2-LG5-LG8`)
#let's rename the columns again
colnames(hm_data) <- c("gene", "LG369","LG258")

Before we can use ggplot2 to create a heatmap we have to rearrange our data into a tidy format. Having tidy data has, as the name implies, a number of advantages. You can follow the link here to learn more.

lets rearrage our data:

hm_long <- hm_data %>% 
  gather(key, base_means, -gene)
hm_long %>% head()

## # A tibble: 6 × 3
##     gene   key base_means
##    <chr> <chr>      <dbl>
## 1 NKX2-2 LG369 1107.64860
## 2 PIK3R2 LG369 1317.15892
## 3 PIK3R2 LG369 1317.15892
## 4   TFF3 LG369   30.88613
## 5   OAS1 LG369  116.07490
## 6   OAS1 LG369  116.07490

hm_long

## # A tibble: 106 × 3
##      gene   key base_means
##     <chr> <chr>      <dbl>
## 1  NKX2-2 LG369 1107.64860
## 2  PIK3R2 LG369 1317.15892
## 3  PIK3R2 LG369 1317.15892
## 4    TFF3 LG369   30.88613
## 5    OAS1 LG369  116.07490
## 6    OAS1 LG369  116.07490
## 7  NKX6-1 LG369 1001.31079
## 8   PPARG LG369   22.88345
## 9    RHOQ LG369 2817.84798
## 10   GCLM LG369 2409.93897
## # ... with 96 more rows

The following will give us the desired result with 1 observation per row:

p2 <- ggplot(hm_long, aes(x = gene, y = as.factor(key)))
hm <- p2 + 
  geom_tile(aes(fill = base_means)) + 
  scale_fill_gradient(low="white", high="darkblue") + 
  xlab("") + ylab("") +
  ggtitle("Heatmap of genes connected to glucose") +
  theme(axis.text.x =
          element_text(size = 8,
                       angle = 90,
                       hjust = 0.5))
hm

Putting it all together

We created three plots, the violine plot (violinep), the heatmap (hm), and the volcano plot (vp_label). The package cowplot now allows us to combine these figures together with annotation and everything. A lot of people do this in Illustrator or Inkscape. While these vectorgraphics programs yield high quality graphics, working with them can be quite painful. Especially if the figure needs to undergo any sort of modification. cowplot provides a canvas to put on figures and tables and annotations. Its function ggdraw() allows the user to plot things in arbitrary positions. The canvas’s bottom left corner has the x- and y- coordinates of 0, 0 and the top right corner of 1, 1. drow_plot() takes the coordinates plus relative width and height. draw_plot_label adds the labels, it takes the x and y coordinates for each label, and the size as input.

Lets generate our manuscript figure:

ggdraw() +
  draw_plot(vp_labels, 0, 0.5, 0.5, 0.5) +
  draw_plot(violinep, 0.5, 0.5, 0.5, 0.5) +
  draw_plot(hm, 0, 0, 1, 0.5) +
  draw_plot_label(LETTERS[1:3], c(0, 0.5, 0), c(1, 1, 0.5), size = 15)

## Warning: Removed 3696 rows containing missing values (geom_point).

## Warning: Removed 3696 rows containing missing values (geom_text_repel).

## Warning: Removed 1 rows containing non-finite values (stat_ydensity).

## Warning: Removed 1 rows containing non-finite values (stat_fivenumber).

Congratulations! You have now written a program that you can re-use and adopt for different dataset analysis. Google the various packages we introduced in this tutorial to learn more about the functions we did OR did not use.