In this entry you will find examples of the most commonly used plots for data visualization: histograms, density plots, boxplots, ecd plots, scatter plots, barplots and plots for longitudinal data.

For every kind of plot I provide examples using base R code and using the ggpubr library, an R package that provides simple syntax functions to create ggplot2 plots.

Since I encourage the use of ggpubr package, I will not introduce all base R graphical options. I will present very simple base R plots that can be improved and customized using additional base R graphical parameters.

Base R graphical parameters cheatsheet: https://www.r-graph-gallery.com/6-graph-parameters-reminder.html

ggpubr package vignette: https://rpkgs.datanovia.com/ggpubr/index.html

Some of the examples in this entry have been adapted from http://www.sthda.com/english/articles/24-ggpubr-publication-ready-plots/ and related entries.

library(RColorBrewer)  
library(ggpubr)
library(penalized)
library(glmnet)
library(lattice)
library(ggsci)
library(tidyr)

1 Data nki70

We will use data nki70 from library “penalized”, a data frame of 144 lymph node positive breast cancer patients on metastasis-free survival, 5 clinical risk factors, and gene expression measurements of 70 genes found to be prognostic for metastasis-free survival in an earlier study. Use the following code to get the data:

data(nki70)
data<- na.omit(nki70)   # na.omit removes all incomplete cases (rows) from the dataframe
data$event<-factor(data$event, levels=c(0,1), labels=c("no", "yes"))  # event = indicator of metastasis

Look the first rows and columns:

head(data[,(1:8)])  
##         time event  Diam   N       ER        Grade Age      TSPYL5
## 125 7.748118    no <=2cm 1-3 Positive Intermediate  50 -0.18752814
## 127 4.662560   yes <=2cm 1-3 Positive    Well diff  42  0.15099047
## 128 8.739220    no  >2cm 1-3 Positive    Well diff  50  0.11695046
## 129 7.567420    no <=2cm 1-3 Positive Intermediate  43  0.10493318
## 130 7.296372    no <=2cm 1-3 Negative  Poorly diff  47  0.30821656
## 132 6.718686    no <=2cm 1-3 Positive Intermediate  47 -0.09643536

Summary of the first 8 variables;

summary(data[,(1:8)]) 
##       time          event       Diam      N              ER     
##  Min.   : 0.05476   no :96   <=2cm:73   >=4: 38   Negative: 27  
##  1st Qu.: 4.70568   yes:48   >2cm :71   1-3:106   Positive:117  
##  Median : 6.99521                                               
##  Mean   : 7.35130                                               
##  3rd Qu.: 9.98631                                               
##  Max.   :17.65914                                               
##           Grade         Age            TSPYL5        
##  Poorly diff :48   Min.   :26.00   Min.   :-1.08275  
##  Intermediate:55   1st Qu.:41.00   1st Qu.:-0.33085  
##  Well diff   :41   Median :45.00   Median :-0.08905  
##                    Mean   :44.31   Mean   :-0.10853  
##                    3rd Qu.:49.00   3rd Qu.: 0.11728  
##                    Max.   :53.00   Max.   : 0.60179

2 Colors in R

One important element of data visualization is the selection of colors. Here you will find a good description of the different ways you can specify a color in R: https://www.r-graph-gallery.com/ggplot2-color.html

Color palette in ggpubr plots: (extracted from function ggpar{ggpubr} help) ggpubr allowed values include “grey” for grey color palettes; brewer palettes e.g. “RdBu”, “Blues”, …; or custom color palette e.g. c(“blue”, “red”); and scientific journal palettes from ggsci R package, e.g.: “npg”, “aaas”, “lancet”, “jco”, “ucscgb”, “uchicago”, “simpsons” and “rickandmorty”.

3 Histograms

Histograms are used to visualize the variation of a numerical variable.

3.1 Base R Histograms

hist(): base R function to create a histogram, with breaks you can specify the number of intervals, with col you specify the color of the plot.

Example: Histogram of expression values of gene NUSAP1:

hist(data$NUSAP1, breaks=20, col = c("cadetblue1"))

You can improve the plot by providing a title (main) and removing the x label and y label:

hist(data$NUSAP1, breaks=20, col = c("cadetblue1"), main="NUSAP1 gene expression", xlab=NULL, ylab=NULL)

3.2 Histograms with ggpubr

gghistogram(): ggpubr function to create a histogram. Option add = "mean" adds a vertical line at the mean value. You can use also add = "median" or even add = c("mean","median"). Option rug = TRUE adds small segments on the x axis corresponding to each observation.

gghistogram(data, x="NUSAP1", 
            add = "mean", rug = TRUE,
            fill = "cadetblue1",
            bins = 20)

By simply adding the option color = "event"we obtain two different histograms according to metastasis indicator (variable event):

gghistogram(data, x="NUSAP1", 
            add = "mean", rug = TRUE,
            color = "event", fill = "event",
            bins = 20,
            palette = c("aquamarine2", "coral1"))

You can easily perform histograms of several variables by specifying their names in x.

Example: Object p creates two histograms, one for gene NUSAP1 and one for gene MMP9:

p<-gghistogram(data, x=c("NUSAP1","MMP9"), 
            add = "mean", rug = TRUE,
            color = "event", fill = "event",
            bins = 20,
            palette = c("aquamarine2", "coral1"))

By specifying p$variablename you obtain the histogram for each variable:

Histogram of NUSAP1:

p$NUSAP1

Histogram of MMP9:

p$MMP9

You can put both histograms side-by-side by adding the option combine = T:

Example: Histograms for gene NUSAP1 and for gene MMP9 alongside each other:

gghistogram(data, x=c("NUSAP1","MMP9"), 
            add = "mean", rug = TRUE,
            combine = T,
            color = "event", fill = "event",
            bins = 20,
            palette = c("aquamarine2", "coral1"))

Function ggpar() can be used to specify some graphical parameters outside the plot. This is useful to obtain different displays of the same plot.

Example: Histogram of gene NUSAP1 expression levels (p) with some graphical parameters (palette, main and caption) specified outside the plot with function ggpar():

p<-gghistogram(data, x="NUSAP1", 
            add = "mean", rug = TRUE,
            combine = T,
            color = "event", fill = "event",
            bins = 20)

ggpar(p, palette = "jco", main="NUSAP1 expression levels", caption = "Figure 1: NUSAP1 expression levels according to event variable")

4 Density plots

Density plots, as histograms, provide a visual representation of the variation of a numerical variable. A density plot is a smoothed version of a histogram.

4.1 Base R density plots

density(): Base R function that provides an estimation of the density function of a variable.

Example: Density plot for NUSAP1:

plot(density(data$NUSAP1))

You can improve the plot by, for instance, specifying a title, the color line and a vertical line corresponding to the mean expression values using function abline(). Option lty specifies the type of line and lty=3 corresponds to a dotted line:

plot(density(data$NUSAP1), main="NUSAP1 density", col="blue")
abline(v=mean(data$NUSAP1), col="blue", lty=3)

4.2 Density plots with ggpubr

ggdensity(): ggpubr function to create a density plot. Similar graphical options as in gghistogramare available. For instance, option add = "mean"or add = "median" adds a vertical line with x value equal to the mean or the median, respectively. Option rug = TRUE adds small segments on the x axis corresponding to each observation.

Example: Density plot for NUSAP1 with a vertical line corresponding to the median value:

ggdensity(data, x="NUSAP1", 
            add = "median", rug = TRUE,
            fill = "cadetblue1")

By adding color = categorical_variable_nameyou obtain the density plots according to the different categories.

Example: Density plot of NUSAP1 expression levels according to metastasis indicator (“event”):

ggdensity(data, x="NUSAP1", 
            add = "mean", rug = TRUE,
            color = "event", fill = "event",
            palette = c("aquamarine2", "coral1"))

You can perform density plots of different variables by specifying their names in parameter “x”:

Example: Density plots of NUSAP1 and MMP9 gene expression levels (p):

p<-ggdensity(data, x=c("NUSAP1","MMP9"), 
               add = "mean", rug = TRUE,
               color = "event", fill = "event",
               palette = c("aquamarine2", "coral1"))

You can call every density plot as p$variable_name.

Density plot for NUSAP1 gene:

p$NUSAP1

Density plot for MMP9 gene:

p$MMP9

You can put the different density plots together by using the option combine = T:

Example: Density plots of NUSAP1 and MMP9 gene expression levels (p) combined:

ggdensity(data, x=c("NUSAP1","MMP9"), 
            add = "mean", rug = TRUE,
            combine = T,
            color = "event", fill = "event",
            palette = c("aquamarine2", "coral1"))

4.3 Density plots of theoretical distributions

curve(): Function to obtain the plot of the density function of a theoretical distribution on a range of values specified by xlim=c(a,b).

Example: Density function of a normal distribution with \(\mu =3\) and \(\sigma=0.6\):

curve(dnorm(x,3,0.6), xlim=c(1,5))

Example: You can add a second curve in the same plot by using add=T. Here we add the curve of a normal distribution with \(\mu =2\) and \(\sigma=1\) to the previous plot and also vertical lines at the mean for both curves. Option lty=3 specifies the type of line:

curve(dnorm(x,3,0.6), xlim=c(1,5), col="purple")
curve(dnorm(x,2,1), add=T, col="brown1")
abline(v=3, col="purple", lty=3)
abline(v=2, col="brown1", lty=3)

polygon(): this function can be used to color the area under a density curve. You should provide x and y, the coordinates of the vertices of the polygon. It is assumed that the polygon is to be closed by joining the last point to the first point.

Example: Density function of a F distribution with degrees of freedom 15 and 5. We use the function polygon() to color the area under the curve for values of x between 1.5 and 3:

curve(df(x,15,5), xlim=c(0,5))
x2 <- seq(1.5,3,0.01)
y2 <-  df(x2, 15, 5)
x2 = c(1.5,x2,3)
y2 = c(0,y2,0)
polygon(x2,y2, col="aliceblue", border=NA)
polygon(rep(1.5,length(y2)),y2, border=T)
y3<-seq(0,df(3,15,5),0.01)
polygon(rep(3,length(y3)),y3, border=T)
curve(df(x,15,5), xlim=c(0,5), add=T)   # we add again the curve on top of the colored region

Example: Density function of a normal distribution with with \(\mu =0\) and \(\sigma=1\) with the right tail colored beyond 1.25:

curve(dnorm(x,0,1), xlim=c(-4,4))
x2 <- seq(1.25,4,0.01)
y2 <-  dnorm(x2, 0, 1)
x2 = c(1.25,x2,4)
y2 = c(0,y2,0)
polygon(x2,y2, col="bisque", border=NA)
curve(dnorm(x,0,1), xlim=c(-4,4), add = TRUE)

5 Boxplots

A boxplot, also called box and whisker plot, is used to represent the variation of a continuous numerical variable. It displays the minimum, first quartile, median, third quartile, and maximum and also possible outliers: https://www.khanacademy.org/math/statistics-probability/summarizing-quantitative-data/box-whisker-plots/a/box-plot-review

5.1 Base R Boxplots

boxplot(): Base R function to obtain the boxplot of a numerical variable.

Example: Boxplot of expression levels of gene NUSAP1:

boxplot(data$NUSAP1, ylab="NUSAP1", col="darkseagreen1")

A multiple boxplot of a numerical variable y according to the different groups of a categorical variable x is obtained as boxplot(y~x).

Example: Boxplot of expression levels of gene NUSAP1 according to Grade:

boxplot(data$NUSAP1~data$Grade, col=c(2,3,4)) 

5.2 Boxplots with ggpubr

ggboxplot(): ggpubr function to obtain the boxplot of a numerical variable y.

Example: Boxplot of expression levels of gene NUSAP1. Option add="point"adds a point for each observation.

ggboxplot(data, y="NUSAP1", 
          add="point",
          color = "blue")

Option add=jitter moves the points randomly around the vertical line so that the observations are not overlapped.

Example: Boxplot of NUSAP1 values with jitter. As observed in the previous plots, the minimum expression value of NUSAP1 is an outlier. We can identify which observation is the outlier by using the option label.select = list(top.down = 1):

ggboxplot(data, y="NUSAP1", 
          color = "blue", add = "jitter", 
          label=rownames(data), 
          label.select = list(top.down = 1))

The size of the points and the jitter width can be modified with option add.params:

ggboxplot(data, y="NUSAP1", 
          color = "blue",
          add = "jitter", add.params = list(size = 1.2, jitter = 0.08), 
          label=rownames(data), 
          label.select = list(top.down = 1))

A multiple boxplot can be obtained by specifying a categorical variable x in addition to the numerical variable y.

Example: Boxplot of NUSAP1 expression levels according to tumour grade:

ggboxplot(data, x = "Grade", y = "NUSAP1",
               color = "Grade", palette ="hue",
               add = "jitter")

We can change the shape of the points with option shape:

ggboxplot(data, x = "Grade", y = "NUSAP1",
               color = "Grade", palette ="hue",
               add = "jitter", shape = "Grade")

Multiple boxplots of several numerical variables:

You can perform multiple boxplot of several numerical variables by specifying their name in y.

Example: Boxplots of NUSAP1 and DIAPH3 expression levels according to tumour grade. Option combine = TRUE puts both plots side-by-side:

 ggboxplot(data, x = "Grade", y = c("NUSAP1","DIAPH3"),
               combine = TRUE,
               color = "Grade", palette = "Paired",
               add = "point", shape = "Grade")

Multiple boxplots according to two categorial variables:

You can perform a multiple boxplot of a numerical variable according to two categorical variables, one is specified as x and the other with function color.

Example: Boxplot of NUSAP1 expression levels according to tumour grade and ER (estrogen receptor status):

 ggboxplot(data, x = "Grade", y = c("NUSAP1"),
               color = "ER", palette = "npg",
               add = "point", shape = "Grade")

Alternatively, you can perform a multiple boxplot of a numerical variable according to two categorical variables, one is specified as x and the other with function facet.by =.

Example: Boxplot of NUSAP1 expression levels according to tumour grade and ER (estrogen receptor status):

 ggboxplot(data, x = "Grade", y = c("NUSAP1"),
              facet.by = "ER",
               color = "Grade", palette = "jco",
               add = "point", shape = "Grade")

Add p-values:

stat_compare_means(): This ggpubr function provides a very simple way of adding mean comparison p-values to a boxplot. The default option implements non parametric tests: the Wilcoxon test (two categories) and Krukal-Wallis test (more than two categories).

You can learn more about adding p-values to ggplots here: http://www.sthda.com/english/articles/24-ggpubr-publication-ready-plots/76-add-p-values-and-significance-levels-to-ggplots/

Example: Boxplot of NUSAP1 expression levels according to tumour grade and Kruskal-Wallis test p-value:

p<-ggboxplot(data, x = "Grade", y = "NUSAP1",
               color = "Grade", palette = "Paired",
               add = "point")

p + stat_compare_means()      # Adds global p-value, default Kruskal-wallis test

You can adjust the position of the label with options label.x and label.y. In this example we only adjunts the vertical position of the label:

p + stat_compare_means(label.y = 1)      # label.y = 1 adjusts the vertical position of the label

You can implement an ANOVA instead of the default Kruskal-Wallis test with option method:

p + stat_compare_means(method="anova",label.y = 1)

You can also add pairwise tests by specifying a list of the different pairs to be compared:

my_pairs <- list( c("Poorly diff", "Intermediate"), c("Poorly diff", "Well diff"), c("Intermediate", "Well diff") )

p + stat_compare_means(comparisons = my_pairs) # Adds pairwise tests p-values

You can display both, the global p-value and the pairwise comparisons:

p + stat_compare_means(method="t.test",comparisons = my_pairs)+         # pairwise p-values
  stat_compare_means(method="anova",label.y = 1)                        # global p-value

6 Empirical cumulative distribution fuction

For a value x, the empirical cumulative distribution function of a numerical variable provides the proportion of observations with a value lower or equal to x.

6.1 Base R ecdf

ecdf(): Base R function to obtain the empirical cumulative distribution function of a numerical variable.

Example: Empirical cumulative distribution function of NUSAP1 levels:

f<-ecdf(data$NUSAP1)
plot(f)

6.2 Ecdf plots with ggpubr

ggecdf(): ggpubr function to obtain the empirical cumulative distribution function of a numerical variable.

Example: Empirical cumulative distribution function of NUSAP1 levels according to Grade:

ggecdf(data, x = "NUSAP1",
         color = "Grade", 
         linetype = "Grade",
         palette = "Dark2")

7 Scatter plots

Scatter plots provide a graphical representation of the possible relationship between two numerical variables.

7.1 Base R scatter plots

plot(x,y): Base R function to obtain the scatter plot between x and y.

abline(): Adds a straight line to the current plot

Example: Scatter plot of NUSAP1 and DIAPH3 expression levels:

plot(data$NUSAP1, data$DIAPH3)
abline(lm(data$DIAPH3 ~ data$NUSAP1))

7.2 Scatter plots with ggpubr

ggscatter(): ggpubr function to obtain the scatter plot between two numerical variables.

Example: Scatter plot of NUSAP1 and DIAPH3 expression levels with regression line:

ggscatter(data, x = "NUSAP1", y = "DIAPH3",
          add = "reg.line")
## `geom_smooth()` using formula 'y ~ x'

There are several options to improve the scatter plot. For instance, you can add confidence intervals, change the color of the line and the confidence band and display the correlation coefficient as follows:

ggscatter(data, x = "NUSAP1", y = "DIAPH3",
          add = "reg.line",                                      # regression line
          conf.int = TRUE,                                       # confidence interval
          add.params = list(color = "coral1", fill = "bisque")   # color
          )+
  stat_cor(method = "pearson", label.x = .1, label.y = 0.8)      # correlation coefficient
## `geom_smooth()` using formula 'y ~ x'

Scatter plot according to a categorical variable

Example: Scatter plot of NUSAP1 and DIAPH3 expression levels with regression lines for each Grade:

ggscatter(data, x = "NUSAP1", y = "DIAPH3",
          add = "reg.line",                           # regression line
          color = "Grade", palette = "lancet",           # color by "Grade"
          fullrange = TRUE,                           # extending the regression line
          )+
  stat_cor(aes(color = Grade), label.x = -0.7)        # correlation coefficient
## `geom_smooth()` using formula 'y ~ x'

Example: Separate scatters plots of NUSAP1 and DIAPH3 expression levels with regression lines for each Grade:

ggscatter(data, x = "NUSAP1", y = "DIAPH3",
          add = "reg.line",                              # regression line
          color = "Grade", palette = "ucscgb",           # color by groups "Grade"
          facet.by = "Grade", 
          conf.int = TRUE) +
  stat_cor(aes(color = Grade), label.y = .6)
## `geom_smooth()` using formula 'y ~ x'

Scatter plot with ellipses

Example: Scatter plot of NUSAP1 and DIAPH3 expression levels with ellipses for each Grade:

ggscatter(data, x = "NUSAP1", y = "DIAPH3",
          color = "Grade", palette = "aaas",
          shape = "Grade",
          ellipse = TRUE)

8 Barplots

Barplots are mainly used for categorical variables. They represent either counts (the number of observations at each category) or proportions (relative frequency of observations at each category).

8.1 Base R Barplots

barplot(): R function to obtain a barplot from a frequency table.

Example: Table of counts (absolute frequencies) and barplot of Grade:

table(data$Grade)  # counts: absolute frequencies
## 
##  Poorly diff Intermediate    Well diff 
##           48           55           41
barplot(table(data$Grade))  # barplot of counts

Example: Table of proportions (relative frequencies) and barplot of Grade variables:

prop.table(table(data$Grade))  # proportions: relative frequencies
## 
##  Poorly diff Intermediate    Well diff 
##    0.3333333    0.3819444    0.2847222
barplot(prop.table(table(data$Grade)))  # barplot of proportions

The barplot can be improved by customizing several options as the x and y labels and the color the bars:

relfreq<-prop.table(table(data$Grade))  # proportions: relative frequencies

barplot(relfreq, xlab="Grade", ylab="relative frequency",
        col=c("aquamarine2", "coral1","darkturquoise"))

8.2 Barplots with ggpubr

ggbarplot(): ggpubr function to obtain a barplot from a data frame. It requires the table of counts or proportions to be transformed into a data frame.

8.2.1 Barplots for one variable

Example: Table of counts (absolute frequencies) and barplot of Grade:

freq<-table(data$Grade)  # counts: absolute frequencies
freq
## 
##  Poorly diff Intermediate    Well diff 
##           48           55           41
freq<-as.data.frame(freq) # transform the table of counts into a data frame
freq
##           Var1 Freq
## 1  Poorly diff   48
## 2 Intermediate   55
## 3    Well diff   41
colnames(freq)<-c("Grade","Freq")   # Rename the columns of the data frame  
freq
##          Grade Freq
## 1  Poorly diff   48
## 2 Intermediate   55
## 3    Well diff   41
ggbarplot(freq, x="Grade", y="Freq")  # barplot of counts

The plot can be customized with colored bars and labels (counts):

ggbarplot(freq, x="Grade", y="Freq",
          color = "Grade",
          label = TRUE)             # show labels (counts for each category)

Additional options are shown in this example:

ggbarplot(freq, x="Grade", y="Freq",
          color = "Grade", fill ="Grade",   # fill the bars
          label = TRUE,             # show labels (counts for each category)
          lab.pos = "in",  lab.col = "white")            # puts the labels in white inside the bars

8.2.2 Barplot of counts for two variables

Example: Table of counts (absolute frequencies) and barplot of Grade and event variables:

freq<-table(data$Grade, data$event)  # 2 by 2 table of counts: rows = Grade, columns = event 
freq
##               
##                no yes
##   Poorly diff  26  22
##   Intermediate 36  19
##   Well diff    34   7
freq<-as.data.frame(freq)                     # transform the table into a data frame
colnames(freq)<-c("Grade","event","Freq")     # rename the columns of the data frame
freq
##          Grade event Freq
## 1  Poorly diff    no   26
## 2 Intermediate    no   36
## 3    Well diff    no   34
## 4  Poorly diff   yes   22
## 5 Intermediate   yes   19
## 6    Well diff   yes    7
ggbarplot(freq, x="Grade", y="Freq",           
          color = "event",
          label = TRUE) 

8.2.3 Barplots of row proportions for two variables

The table of row proportions is obtained by adding ,1 to prop.table(). The sums of the proportions by rows are equal to 1.

The barplot is obtained by specifying x as the variable that defines de rows of the table and the other variable defines the color of the plot. You should check that the heights of the bars of the barplot are equal to 1.

Example: Table of row proportions and barplot of Grade and event variables:

freqbyrow<-prop.table(table(data$Grade, data$event),1) #  row proportions (Grade)
freqbyrow
##               
##                       no       yes
##   Poorly diff  0.5416667 0.4583333
##   Intermediate 0.6545455 0.3454545
##   Well diff    0.8292683 0.1707317
freqbyrow<-as.data.frame(freqbyrow)                  # transform the table into a data frame
colnames(freqbyrow)<-c("Grade","event","Freq")
freqbyrow
##          Grade event      Freq
## 1  Poorly diff    no 0.5416667
## 2 Intermediate    no 0.6545455
## 3    Well diff    no 0.8292683
## 4  Poorly diff   yes 0.4583333
## 5 Intermediate   yes 0.3454545
## 6    Well diff   yes 0.1707317
ggbarplot(freqbyrow, x="Grade", y="Freq",       # x = variable that defines the rows of the table     
          color = "event",                            
          label = TRUE,
          lab.nb.digits = 2)                    # number of digits for the proportions

8.2.4 Barplots of column proportions for two variables

The table of column proportions is obtained by adding ,2 to prop.table(). The sums of the proportions by columns are equal to 1.

The barplot is obtained by specifying x as the variable that defines de columns of the table and the other variable defines the color of the plot. You should check that the heights of the bars of the barplot are equal to 1.

Example: Table of column proportions and barplot of Grade and event variables:

freqbycol<-prop.table(table(data$Grade, data$event),2) # column proportions (event)
freqbycol
##               
##                       no       yes
##   Poorly diff  0.2708333 0.4583333
##   Intermediate 0.3750000 0.3958333
##   Well diff    0.3541667 0.1458333
freqbycol<-as.data.frame(freqbycol)
colnames(freqbycol)<-c("Grade","event","Freq")
freqbycol
##          Grade event      Freq
## 1  Poorly diff    no 0.2708333
## 2 Intermediate    no 0.3750000
## 3    Well diff    no 0.3541667
## 4  Poorly diff   yes 0.4583333
## 5 Intermediate   yes 0.3958333
## 6    Well diff   yes 0.1458333
ggbarplot(freqbycol, x="event", y="Freq",   # x = variable that defines the columns of the table
          color = "Grade",
          label = TRUE, 
          lab.nb.digits = 2)                # number of digits for the proportions

9 Plots for paired and longitudinal data

So far we have considered one observation per subject, but some analysis involve multiple measures per subject. These data are called paired data when there are two observations per subject, longitudinal data when the observations for each individual were taken at different points in time or, in general, repeated measures data.

The code that we will use requires the data to be in long format:

Wide format: the repeated measures of a subject are in the same row, each measure in a different column

Long format: each reapeated measure of a subject is in a different row, each subject has as many rows as multiple measurements

At the end of this document there is an example on how to convert a data frame from wide to long format using function pivot_longer() from package tidyr

9.1 Plots for paired data with ggpubr

Example: We generate a data frame with paired data: the body temperature was measured before and after the administration of an antipyretic drug for 9 inviduals

temp<-c(38, 36.7, 37.5, 38.1, 39.1, 37.1, 40, 39.2, 37.8, 38.3, 38.6, 36, 37.3, 36.4, 39.5, 37.5, 38.6, 39.1)
id<-as.factor(c(1,1,2,2,3,3,4,4,5,5,6,6,7,7,8,8,9,9))
time<-rep(c("before","after"),9)
data<-data.frame(id,time, temp)

Long format data frame:

head(data)
##   id   time temp
## 1  1 before 38.0
## 2  1  after 36.7
## 3  2 before 37.5
## 4  2  after 38.1
## 5  3 before 39.1
## 6  3  after 37.1

ggline(): ggpubr function to plot lines between repeated measures for the same subject:

ggline(data, x = "time", y = "temp", 
       color = "id", palette = "npg")

ggpaired(): this ggpubr function provides an alternative plot for paired data:

ggpaired(data, x = "time", y = "temp",
   color = "time", line.color = "gray", line.size = 0.4,
   palette = "npg")+
 stat_compare_means(paired = TRUE)

9.2 Plots for longitudinal data with ggpubr

Example: We generate a data frame with longitudinal data: the water ph was measured from January to April in 3 locations of a river. Note that not all locations (subjects) have the same number of measurements in all time points:

ph<-c(6.4, 6.5, 6.4, 6.8, 6.7, 6.6, 6.5, 6.2, 6.1, 6.6, 
      6.7, 6.5, 6.2, 6.2, 6.3, 7.4, 6.8, 6.7, 
      6.1, 6.8, 6.9, 7.9, 7.7, 7.6, 7, 7.3, 7.1,
       7.4, 7.2, 6.9, 8.1, 8.2, 8,7.8, 7.4, 8)

location<-c("A","A","A","B","B","B","C","C","C","C",
            "A","A","A","A","A","C","C","C",
            "A","A","A","B","B","B","C","C","C",
            "A","A","A","B","B","B","C","C","C")

month<-c(rep("Jan",10),rep("Feb",8),rep("Mar",9),rep("Apr",9))

data<-data.frame(ph, month,location)

Long format data:

head(data)
##    ph month location
## 1 6.4   Jan        A
## 2 6.5   Jan        A
## 3 6.4   Jan        A
## 4 6.8   Jan        B
## 5 6.7   Jan        B
## 6 6.6   Jan        B

ggline(): ggpubr function to plot lines between repeated measures for the same subject.

Option add="mean_se" provides the mean and the standard error of the different measures for each location at each time point:

ggline(data, x = "month", y = "ph", 
       color = "location", 
       add="mean_se", 
       palette = "aaas")

9.3 Transform a wide data set to a long format

We will use function pivot_longer() from package tidyr to convert data from wide to long format.

Example: We generate a data frame with paired data: the body temperature was measured before and after the administration of an antipyretic drug for 9 inviduals

temp_before<-c(38, 37.5, 39.1, 40, 37.8, 38.6, 37.3, 39.5, 38.6)
temp_after<-c(36.7, 38.1, 37.1, 39.2, 38.3, 36, 36.4, 37.5, 39.1)
id<-as.factor(1:9)

datawide<-data.frame(id,temp_before,temp_after)

Wide format data frame:

datawide
##   id temp_before temp_after
## 1  1        38.0       36.7
## 2  2        37.5       38.1
## 3  3        39.1       37.1
## 4  4        40.0       39.2
## 5  5        37.8       38.3
## 6  6        38.6       36.0
## 7  7        37.3       36.4
## 8  8        39.5       37.5
## 9  9        38.6       39.1

pivot_longer(): tidyr function to transform a wide data set to a long format

data<-pivot_longer(datawide, cols=(2:3))
data
## # A tibble: 18 x 3
##    id    name        value
##    <fct> <chr>       <dbl>
##  1 1     temp_before  38  
##  2 1     temp_after   36.7
##  3 2     temp_before  37.5
##  4 2     temp_after   38.1
##  5 3     temp_before  39.1
##  6 3     temp_after   37.1
##  7 4     temp_before  40  
##  8 4     temp_after   39.2
##  9 5     temp_before  37.8
## 10 5     temp_after   38.3
## 11 6     temp_before  38.6
## 12 6     temp_after   36  
## 13 7     temp_before  37.3
## 14 7     temp_after   36.4
## 15 8     temp_before  39.5
## 16 8     temp_after   37.5
## 17 9     temp_before  38.6
## 18 9     temp_after   39.1
colnames(data)<-c("id","time","temp")

Long format data frame:

data
## # A tibble: 18 x 3
##    id    time         temp
##    <fct> <chr>       <dbl>
##  1 1     temp_before  38  
##  2 1     temp_after   36.7
##  3 2     temp_before  37.5
##  4 2     temp_after   38.1
##  5 3     temp_before  39.1
##  6 3     temp_after   37.1
##  7 4     temp_before  40  
##  8 4     temp_after   39.2
##  9 5     temp_before  37.8
## 10 5     temp_after   38.3
## 11 6     temp_before  38.6
## 12 6     temp_after   36  
## 13 7     temp_before  37.3
## 14 7     temp_after   36.4
## 15 8     temp_before  39.5
## 16 8     temp_after   37.5
## 17 9     temp_before  38.6
## 18 9     temp_after   39.1