In this entry you will find examples of the most commonly used plots for data visualization: histograms, density plots, boxplots, ecd plots, scatter plots, barplots and plots for longitudinal data.
For every kind of plot I provide examples using base R code and using the ggpubr library, an R package that provides simple syntax functions to create ggplot2 plots.
Since I encourage the use of ggpubr package, I will not introduce all base R graphical options. I will present very simple base R plots that can be improved and customized using additional base R graphical parameters.
Base R graphical parameters cheatsheet: https://www.r-graph-gallery.com/6-graph-parameters-reminder.html
ggpubr package vignette: https://rpkgs.datanovia.com/ggpubr/index.html
Some of the examples in this entry have been adapted from http://www.sthda.com/english/articles/24-ggpubr-publication-ready-plots/ and related entries.
library(RColorBrewer)
library(ggpubr)
library(penalized)
library(glmnet)
library(lattice)
library(ggsci)
library(tidyr)
We will use data nki70 from library “penalized”, a data frame of 144 lymph node positive breast cancer patients on metastasis-free survival, 5 clinical risk factors, and gene expression measurements of 70 genes found to be prognostic for metastasis-free survival in an earlier study. Use the following code to get the data:
data(nki70)
data<- na.omit(nki70) # na.omit removes all incomplete cases (rows) from the dataframe
data$event<-factor(data$event, levels=c(0,1), labels=c("no", "yes")) # event = indicator of metastasis
Look the first rows and columns:
head(data[,(1:8)])
## time event Diam N ER Grade Age TSPYL5
## 125 7.748118 no <=2cm 1-3 Positive Intermediate 50 -0.18752814
## 127 4.662560 yes <=2cm 1-3 Positive Well diff 42 0.15099047
## 128 8.739220 no >2cm 1-3 Positive Well diff 50 0.11695046
## 129 7.567420 no <=2cm 1-3 Positive Intermediate 43 0.10493318
## 130 7.296372 no <=2cm 1-3 Negative Poorly diff 47 0.30821656
## 132 6.718686 no <=2cm 1-3 Positive Intermediate 47 -0.09643536
Summary of the first 8 variables;
summary(data[,(1:8)])
## time event Diam N ER
## Min. : 0.05476 no :96 <=2cm:73 >=4: 38 Negative: 27
## 1st Qu.: 4.70568 yes:48 >2cm :71 1-3:106 Positive:117
## Median : 6.99521
## Mean : 7.35130
## 3rd Qu.: 9.98631
## Max. :17.65914
## Grade Age TSPYL5
## Poorly diff :48 Min. :26.00 Min. :-1.08275
## Intermediate:55 1st Qu.:41.00 1st Qu.:-0.33085
## Well diff :41 Median :45.00 Median :-0.08905
## Mean :44.31 Mean :-0.10853
## 3rd Qu.:49.00 3rd Qu.: 0.11728
## Max. :53.00 Max. : 0.60179
One important element of data visualization is the selection of colors. Here you will find a good description of the different ways you can specify a color in R: https://www.r-graph-gallery.com/ggplot2-color.html
Color palette in ggpubr plots: (extracted from function ggpar{ggpubr}
help) ggpubr allowed values include “grey” for grey color palettes; brewer palettes e.g. “RdBu”, “Blues”, …; or custom color palette e.g. c(“blue”, “red”); and scientific journal palettes from ggsci R package, e.g.: “npg”, “aaas”, “lancet”, “jco”, “ucscgb”, “uchicago”, “simpsons” and “rickandmorty”.
Histograms are used to visualize the variation of a numerical variable.
hist()
: base R function to create a histogram, with breaks
you can specify the number of intervals, with col
you specify the color of the plot.
Example: Histogram of expression values of gene NUSAP1:
hist(data$NUSAP1, breaks=20, col = c("cadetblue1"))
You can improve the plot by providing a title (main
) and removing the x label and y label:
hist(data$NUSAP1, breaks=20, col = c("cadetblue1"), main="NUSAP1 gene expression", xlab=NULL, ylab=NULL)
gghistogram()
: ggpubr function to create a histogram. Option add = "mean"
adds a vertical line at the mean value. You can use also add = "median"
or even add = c("mean","median")
. Option rug = TRUE
adds small segments on the x axis corresponding to each observation.
gghistogram(data, x="NUSAP1",
add = "mean", rug = TRUE,
fill = "cadetblue1",
bins = 20)
By simply adding the option color = "event"
we obtain two different histograms according to metastasis indicator (variable event):
gghistogram(data, x="NUSAP1",
add = "mean", rug = TRUE,
color = "event", fill = "event",
bins = 20,
palette = c("aquamarine2", "coral1"))
You can easily perform histograms of several variables by specifying their names in x
.
Example: Object p
creates two histograms, one for gene NUSAP1 and one for gene MMP9:
p<-gghistogram(data, x=c("NUSAP1","MMP9"),
add = "mean", rug = TRUE,
color = "event", fill = "event",
bins = 20,
palette = c("aquamarine2", "coral1"))
By specifying p$variablename
you obtain the histogram for each variable:
Histogram of NUSAP1:
p$NUSAP1
Histogram of MMP9:
p$MMP9
You can put both histograms side-by-side by adding the option combine = T
:
Example: Histograms for gene NUSAP1 and for gene MMP9 alongside each other:
gghistogram(data, x=c("NUSAP1","MMP9"),
add = "mean", rug = TRUE,
combine = T,
color = "event", fill = "event",
bins = 20,
palette = c("aquamarine2", "coral1"))
Function ggpar()
can be used to specify some graphical parameters outside the plot. This is useful to obtain different displays of the same plot.
Example: Histogram of gene NUSAP1 expression levels (p) with some graphical parameters (palette, main and caption) specified outside the plot with function ggpar()
:
p<-gghistogram(data, x="NUSAP1",
add = "mean", rug = TRUE,
combine = T,
color = "event", fill = "event",
bins = 20)
ggpar(p, palette = "jco", main="NUSAP1 expression levels", caption = "Figure 1: NUSAP1 expression levels according to event variable")
Density plots, as histograms, provide a visual representation of the variation of a numerical variable. A density plot is a smoothed version of a histogram.
density()
: Base R function that provides an estimation of the density function of a variable.
Example: Density plot for NUSAP1:
plot(density(data$NUSAP1))
You can improve the plot by, for instance, specifying a title, the color line and a vertical line corresponding to the mean expression values using function abline()
. Option lty
specifies the type of line and lty=3
corresponds to a dotted line:
plot(density(data$NUSAP1), main="NUSAP1 density", col="blue")
abline(v=mean(data$NUSAP1), col="blue", lty=3)
ggdensity()
: ggpubr function to create a density plot. Similar graphical options as in gghistogram
are available. For instance, option add = "mean"
or add = "median"
adds a vertical line with x value equal to the mean or the median, respectively. Option rug = TRUE
adds small segments on the x axis corresponding to each observation.
Example: Density plot for NUSAP1 with a vertical line corresponding to the median value:
ggdensity(data, x="NUSAP1",
add = "median", rug = TRUE,
fill = "cadetblue1")
By adding color = categorical_variable_name
you obtain the density plots according to the different categories.
Example: Density plot of NUSAP1 expression levels according to metastasis indicator (“event”):
ggdensity(data, x="NUSAP1",
add = "mean", rug = TRUE,
color = "event", fill = "event",
palette = c("aquamarine2", "coral1"))
You can perform density plots of different variables by specifying their names in parameter “x”:
Example: Density plots of NUSAP1 and MMP9 gene expression levels (p):
p<-ggdensity(data, x=c("NUSAP1","MMP9"),
add = "mean", rug = TRUE,
color = "event", fill = "event",
palette = c("aquamarine2", "coral1"))
You can call every density plot as p$variable_name
.
Density plot for NUSAP1 gene:
p$NUSAP1
Density plot for MMP9 gene:
p$MMP9
You can put the different density plots together by using the option combine = T
:
Example: Density plots of NUSAP1 and MMP9 gene expression levels (p) combined:
ggdensity(data, x=c("NUSAP1","MMP9"),
add = "mean", rug = TRUE,
combine = T,
color = "event", fill = "event",
palette = c("aquamarine2", "coral1"))
curve()
: Function to obtain the plot of the density function of a theoretical distribution on a range of values specified by xlim=c(a,b)
.
Example: Density function of a normal distribution with \(\mu =3\) and \(\sigma=0.6\):
curve(dnorm(x,3,0.6), xlim=c(1,5))
Example: You can add a second curve in the same plot by using add=T
. Here we add the curve of a normal distribution with \(\mu =2\) and \(\sigma=1\) to the previous plot and also vertical lines at the mean for both curves. Option lty=3
specifies the type of line:
curve(dnorm(x,3,0.6), xlim=c(1,5), col="purple")
curve(dnorm(x,2,1), add=T, col="brown1")
abline(v=3, col="purple", lty=3)
abline(v=2, col="brown1", lty=3)
polygon()
: this function can be used to color the area under a density curve. You should provide x
and y
, the coordinates of the vertices of the polygon. It is assumed that the polygon is to be closed by joining the last point to the first point.
Example: Density function of a F distribution with degrees of freedom 15 and 5. We use the function polygon()
to color the area under the curve for values of x between 1.5 and 3:
curve(df(x,15,5), xlim=c(0,5))
x2 <- seq(1.5,3,0.01)
y2 <- df(x2, 15, 5)
x2 = c(1.5,x2,3)
y2 = c(0,y2,0)
polygon(x2,y2, col="aliceblue", border=NA)
polygon(rep(1.5,length(y2)),y2, border=T)
y3<-seq(0,df(3,15,5),0.01)
polygon(rep(3,length(y3)),y3, border=T)
curve(df(x,15,5), xlim=c(0,5), add=T) # we add again the curve on top of the colored region
Example: Density function of a normal distribution with with \(\mu =0\) and \(\sigma=1\) with the right tail colored beyond 1.25:
curve(dnorm(x,0,1), xlim=c(-4,4))
x2 <- seq(1.25,4,0.01)
y2 <- dnorm(x2, 0, 1)
x2 = c(1.25,x2,4)
y2 = c(0,y2,0)
polygon(x2,y2, col="bisque", border=NA)
curve(dnorm(x,0,1), xlim=c(-4,4), add = TRUE)
A boxplot, also called box and whisker plot, is used to represent the variation of a continuous numerical variable. It displays the minimum, first quartile, median, third quartile, and maximum and also possible outliers: https://www.khanacademy.org/math/statistics-probability/summarizing-quantitative-data/box-whisker-plots/a/box-plot-review
boxplot()
: Base R function to obtain the boxplot of a numerical variable.
Example: Boxplot of expression levels of gene NUSAP1:
boxplot(data$NUSAP1, ylab="NUSAP1", col="darkseagreen1")
A multiple boxplot of a numerical variable y
according to the different groups of a categorical variable x
is obtained as boxplot(y~x)
.
Example: Boxplot of expression levels of gene NUSAP1 according to Grade:
boxplot(data$NUSAP1~data$Grade, col=c(2,3,4))
ggboxplot()
: ggpubr function to obtain the boxplot of a numerical variable y
.
Example: Boxplot of expression levels of gene NUSAP1. Option add="point"
adds a point for each observation.
ggboxplot(data, y="NUSAP1",
add="point",
color = "blue")
Option add=jitter
moves the points randomly around the vertical line so that the observations are not overlapped.
Example: Boxplot of NUSAP1 values with jitter
. As observed in the previous plots, the minimum expression value of NUSAP1 is an outlier. We can identify which observation is the outlier by using the option label.select = list(top.down = 1)
:
ggboxplot(data, y="NUSAP1",
color = "blue", add = "jitter",
label=rownames(data),
label.select = list(top.down = 1))
The size of the points and the jitter width can be modified with option add.params
:
ggboxplot(data, y="NUSAP1",
color = "blue",
add = "jitter", add.params = list(size = 1.2, jitter = 0.08),
label=rownames(data),
label.select = list(top.down = 1))
A multiple boxplot can be obtained by specifying a categorical variable x
in addition to the numerical variable y
.
Example: Boxplot of NUSAP1 expression levels according to tumour grade:
ggboxplot(data, x = "Grade", y = "NUSAP1",
color = "Grade", palette ="hue",
add = "jitter")
We can change the shape of the points with option shape
:
ggboxplot(data, x = "Grade", y = "NUSAP1",
color = "Grade", palette ="hue",
add = "jitter", shape = "Grade")
Multiple boxplots of several numerical variables:
You can perform multiple boxplot of several numerical variables by specifying their name in y
.
Example: Boxplots of NUSAP1 and DIAPH3 expression levels according to tumour grade. Option combine = TRUE
puts both plots side-by-side:
ggboxplot(data, x = "Grade", y = c("NUSAP1","DIAPH3"),
combine = TRUE,
color = "Grade", palette = "Paired",
add = "point", shape = "Grade")
Multiple boxplots according to two categorial variables:
You can perform a multiple boxplot of a numerical variable according to two categorical variables, one is specified as x
and the other with function color
.
Example: Boxplot of NUSAP1 expression levels according to tumour grade and ER (estrogen receptor status):
ggboxplot(data, x = "Grade", y = c("NUSAP1"),
color = "ER", palette = "npg",
add = "point", shape = "Grade")
Alternatively, you can perform a multiple boxplot of a numerical variable according to two categorical variables, one is specified as x
and the other with function facet.by =
.
Example: Boxplot of NUSAP1 expression levels according to tumour grade and ER (estrogen receptor status):
ggboxplot(data, x = "Grade", y = c("NUSAP1"),
facet.by = "ER",
color = "Grade", palette = "jco",
add = "point", shape = "Grade")
Add p-values:
stat_compare_means()
: This ggpubr function provides a very simple way of adding mean comparison p-values to a boxplot. The default option implements non parametric tests: the Wilcoxon test (two categories) and Krukal-Wallis test (more than two categories).
You can learn more about adding p-values to ggplots here: http://www.sthda.com/english/articles/24-ggpubr-publication-ready-plots/76-add-p-values-and-significance-levels-to-ggplots/
Example: Boxplot of NUSAP1 expression levels according to tumour grade and Kruskal-Wallis test p-value:
p<-ggboxplot(data, x = "Grade", y = "NUSAP1",
color = "Grade", palette = "Paired",
add = "point")
p + stat_compare_means() # Adds global p-value, default Kruskal-wallis test
You can adjust the position of the label with options label.x
and label.y
. In this example we only adjunts the vertical position of the label:
p + stat_compare_means(label.y = 1) # label.y = 1 adjusts the vertical position of the label
You can implement an ANOVA instead of the default Kruskal-Wallis test with option method
:
p + stat_compare_means(method="anova",label.y = 1)
You can also add pairwise tests by specifying a list of the different pairs to be compared:
my_pairs <- list( c("Poorly diff", "Intermediate"), c("Poorly diff", "Well diff"), c("Intermediate", "Well diff") )
p + stat_compare_means(comparisons = my_pairs) # Adds pairwise tests p-values
You can display both, the global p-value and the pairwise comparisons:
p + stat_compare_means(method="t.test",comparisons = my_pairs)+ # pairwise p-values
stat_compare_means(method="anova",label.y = 1) # global p-value
For a value x, the empirical cumulative distribution function of a numerical variable provides the proportion of observations with a value lower or equal to x.
ecdf()
: Base R function to obtain the empirical cumulative distribution function of a numerical variable.
Example: Empirical cumulative distribution function of NUSAP1 levels:
f<-ecdf(data$NUSAP1)
plot(f)
ggecdf()
: ggpubr function to obtain the empirical cumulative distribution function of a numerical variable.
Example: Empirical cumulative distribution function of NUSAP1 levels according to Grade:
ggecdf(data, x = "NUSAP1",
color = "Grade",
linetype = "Grade",
palette = "Dark2")
Scatter plots provide a graphical representation of the possible relationship between two numerical variables.
plot(x,y)
: Base R function to obtain the scatter plot between x
and y
.
abline()
: Adds a straight line to the current plot
Example: Scatter plot of NUSAP1 and DIAPH3 expression levels:
plot(data$NUSAP1, data$DIAPH3)
abline(lm(data$DIAPH3 ~ data$NUSAP1))
ggscatter()
: ggpubr function to obtain the scatter plot between two numerical variables.
Example: Scatter plot of NUSAP1 and DIAPH3 expression levels with regression line:
ggscatter(data, x = "NUSAP1", y = "DIAPH3",
add = "reg.line")
## `geom_smooth()` using formula 'y ~ x'
There are several options to improve the scatter plot. For instance, you can add confidence intervals, change the color of the line and the confidence band and display the correlation coefficient as follows:
ggscatter(data, x = "NUSAP1", y = "DIAPH3",
add = "reg.line", # regression line
conf.int = TRUE, # confidence interval
add.params = list(color = "coral1", fill = "bisque") # color
)+
stat_cor(method = "pearson", label.x = .1, label.y = 0.8) # correlation coefficient
## `geom_smooth()` using formula 'y ~ x'
Scatter plot according to a categorical variable
Example: Scatter plot of NUSAP1 and DIAPH3 expression levels with regression lines for each Grade:
ggscatter(data, x = "NUSAP1", y = "DIAPH3",
add = "reg.line", # regression line
color = "Grade", palette = "lancet", # color by "Grade"
fullrange = TRUE, # extending the regression line
)+
stat_cor(aes(color = Grade), label.x = -0.7) # correlation coefficient
## `geom_smooth()` using formula 'y ~ x'
Example: Separate scatters plots of NUSAP1 and DIAPH3 expression levels with regression lines for each Grade:
ggscatter(data, x = "NUSAP1", y = "DIAPH3",
add = "reg.line", # regression line
color = "Grade", palette = "ucscgb", # color by groups "Grade"
facet.by = "Grade",
conf.int = TRUE) +
stat_cor(aes(color = Grade), label.y = .6)
## `geom_smooth()` using formula 'y ~ x'
Scatter plot with ellipses
Example: Scatter plot of NUSAP1 and DIAPH3 expression levels with ellipses for each Grade:
ggscatter(data, x = "NUSAP1", y = "DIAPH3",
color = "Grade", palette = "aaas",
shape = "Grade",
ellipse = TRUE)
Barplots are mainly used for categorical variables. They represent either counts (the number of observations at each category) or proportions (relative frequency of observations at each category).
barplot()
: R function to obtain a barplot from a frequency table.
Example: Table of counts (absolute frequencies) and barplot of Grade:
table(data$Grade) # counts: absolute frequencies
##
## Poorly diff Intermediate Well diff
## 48 55 41
barplot(table(data$Grade)) # barplot of counts
Example: Table of proportions (relative frequencies) and barplot of Grade variables:
prop.table(table(data$Grade)) # proportions: relative frequencies
##
## Poorly diff Intermediate Well diff
## 0.3333333 0.3819444 0.2847222
barplot(prop.table(table(data$Grade))) # barplot of proportions
The barplot can be improved by customizing several options as the x and y labels and the color the bars:
relfreq<-prop.table(table(data$Grade)) # proportions: relative frequencies
barplot(relfreq, xlab="Grade", ylab="relative frequency",
col=c("aquamarine2", "coral1","darkturquoise"))
ggbarplot()
: ggpubr function to obtain a barplot from a data frame. It requires the table of counts or proportions to be transformed into a data frame.
Example: Table of counts (absolute frequencies) and barplot of Grade:
freq<-table(data$Grade) # counts: absolute frequencies
freq
##
## Poorly diff Intermediate Well diff
## 48 55 41
freq<-as.data.frame(freq) # transform the table of counts into a data frame
freq
## Var1 Freq
## 1 Poorly diff 48
## 2 Intermediate 55
## 3 Well diff 41
colnames(freq)<-c("Grade","Freq") # Rename the columns of the data frame
freq
## Grade Freq
## 1 Poorly diff 48
## 2 Intermediate 55
## 3 Well diff 41
ggbarplot(freq, x="Grade", y="Freq") # barplot of counts
The plot can be customized with colored bars and labels (counts):
ggbarplot(freq, x="Grade", y="Freq",
color = "Grade",
label = TRUE) # show labels (counts for each category)
Additional options are shown in this example:
ggbarplot(freq, x="Grade", y="Freq",
color = "Grade", fill ="Grade", # fill the bars
label = TRUE, # show labels (counts for each category)
lab.pos = "in", lab.col = "white") # puts the labels in white inside the bars
Example: Table of counts (absolute frequencies) and barplot of Grade and event variables:
freq<-table(data$Grade, data$event) # 2 by 2 table of counts: rows = Grade, columns = event
freq
##
## no yes
## Poorly diff 26 22
## Intermediate 36 19
## Well diff 34 7
freq<-as.data.frame(freq) # transform the table into a data frame
colnames(freq)<-c("Grade","event","Freq") # rename the columns of the data frame
freq
## Grade event Freq
## 1 Poorly diff no 26
## 2 Intermediate no 36
## 3 Well diff no 34
## 4 Poorly diff yes 22
## 5 Intermediate yes 19
## 6 Well diff yes 7
ggbarplot(freq, x="Grade", y="Freq",
color = "event",
label = TRUE)
The table of row proportions is obtained by adding ,1
to prop.table()
. The sums of the proportions by rows are equal to 1.
The barplot is obtained by specifying x
as the variable that defines de rows of the table and the other variable defines the color
of the plot. You should check that the heights of the bars of the barplot are equal to 1.
Example: Table of row proportions and barplot of Grade and event variables:
freqbyrow<-prop.table(table(data$Grade, data$event),1) # row proportions (Grade)
freqbyrow
##
## no yes
## Poorly diff 0.5416667 0.4583333
## Intermediate 0.6545455 0.3454545
## Well diff 0.8292683 0.1707317
freqbyrow<-as.data.frame(freqbyrow) # transform the table into a data frame
colnames(freqbyrow)<-c("Grade","event","Freq")
freqbyrow
## Grade event Freq
## 1 Poorly diff no 0.5416667
## 2 Intermediate no 0.6545455
## 3 Well diff no 0.8292683
## 4 Poorly diff yes 0.4583333
## 5 Intermediate yes 0.3454545
## 6 Well diff yes 0.1707317
ggbarplot(freqbyrow, x="Grade", y="Freq", # x = variable that defines the rows of the table
color = "event",
label = TRUE,
lab.nb.digits = 2) # number of digits for the proportions
The table of column proportions is obtained by adding ,2
to prop.table()
. The sums of the proportions by columns are equal to 1.
The barplot is obtained by specifying x
as the variable that defines de columns of the table and the other variable defines the color
of the plot. You should check that the heights of the bars of the barplot are equal to 1.
Example: Table of column proportions and barplot of Grade and event variables:
freqbycol<-prop.table(table(data$Grade, data$event),2) # column proportions (event)
freqbycol
##
## no yes
## Poorly diff 0.2708333 0.4583333
## Intermediate 0.3750000 0.3958333
## Well diff 0.3541667 0.1458333
freqbycol<-as.data.frame(freqbycol)
colnames(freqbycol)<-c("Grade","event","Freq")
freqbycol
## Grade event Freq
## 1 Poorly diff no 0.2708333
## 2 Intermediate no 0.3750000
## 3 Well diff no 0.3541667
## 4 Poorly diff yes 0.4583333
## 5 Intermediate yes 0.3958333
## 6 Well diff yes 0.1458333
ggbarplot(freqbycol, x="event", y="Freq", # x = variable that defines the columns of the table
color = "Grade",
label = TRUE,
lab.nb.digits = 2) # number of digits for the proportions
So far we have considered one observation per subject, but some analysis involve multiple measures per subject. These data are called paired data when there are two observations per subject, longitudinal data when the observations for each individual were taken at different points in time or, in general, repeated measures data.
The code that we will use requires the data to be in long format:
Wide format: the repeated measures of a subject are in the same row, each measure in a different column
Long format: each reapeated measure of a subject is in a different row, each subject has as many rows as multiple measurements
At the end of this document there is an example on how to convert a data frame from wide to long format using function pivot_longer()
from package tidyr
Example: We generate a data frame with paired data: the body temperature was measured before and after the administration of an antipyretic drug for 9 inviduals
temp<-c(38, 36.7, 37.5, 38.1, 39.1, 37.1, 40, 39.2, 37.8, 38.3, 38.6, 36, 37.3, 36.4, 39.5, 37.5, 38.6, 39.1)
id<-as.factor(c(1,1,2,2,3,3,4,4,5,5,6,6,7,7,8,8,9,9))
time<-rep(c("before","after"),9)
data<-data.frame(id,time, temp)
Long format data frame:
head(data)
## id time temp
## 1 1 before 38.0
## 2 1 after 36.7
## 3 2 before 37.5
## 4 2 after 38.1
## 5 3 before 39.1
## 6 3 after 37.1
ggline()
: ggpubr function to plot lines between repeated measures for the same subject:
ggline(data, x = "time", y = "temp",
color = "id", palette = "npg")
ggpaired()
: this ggpubr function provides an alternative plot for paired data:
ggpaired(data, x = "time", y = "temp",
color = "time", line.color = "gray", line.size = 0.4,
palette = "npg")+
stat_compare_means(paired = TRUE)
Example: We generate a data frame with longitudinal data: the water ph was measured from January to April in 3 locations of a river. Note that not all locations (subjects) have the same number of measurements in all time points:
ph<-c(6.4, 6.5, 6.4, 6.8, 6.7, 6.6, 6.5, 6.2, 6.1, 6.6,
6.7, 6.5, 6.2, 6.2, 6.3, 7.4, 6.8, 6.7,
6.1, 6.8, 6.9, 7.9, 7.7, 7.6, 7, 7.3, 7.1,
7.4, 7.2, 6.9, 8.1, 8.2, 8,7.8, 7.4, 8)
location<-c("A","A","A","B","B","B","C","C","C","C",
"A","A","A","A","A","C","C","C",
"A","A","A","B","B","B","C","C","C",
"A","A","A","B","B","B","C","C","C")
month<-c(rep("Jan",10),rep("Feb",8),rep("Mar",9),rep("Apr",9))
data<-data.frame(ph, month,location)
Long format data:
head(data)
## ph month location
## 1 6.4 Jan A
## 2 6.5 Jan A
## 3 6.4 Jan A
## 4 6.8 Jan B
## 5 6.7 Jan B
## 6 6.6 Jan B
ggline()
: ggpubr function to plot lines between repeated measures for the same subject.
Option add="mean_se"
provides the mean and the standard error of the different measures for each location at each time point:
ggline(data, x = "month", y = "ph",
color = "location",
add="mean_se",
palette = "aaas")
We will use function pivot_longer()
from package tidyr to convert data from wide to long format.
Example: We generate a data frame with paired data: the body temperature was measured before and after the administration of an antipyretic drug for 9 inviduals
temp_before<-c(38, 37.5, 39.1, 40, 37.8, 38.6, 37.3, 39.5, 38.6)
temp_after<-c(36.7, 38.1, 37.1, 39.2, 38.3, 36, 36.4, 37.5, 39.1)
id<-as.factor(1:9)
datawide<-data.frame(id,temp_before,temp_after)
Wide format data frame:
datawide
## id temp_before temp_after
## 1 1 38.0 36.7
## 2 2 37.5 38.1
## 3 3 39.1 37.1
## 4 4 40.0 39.2
## 5 5 37.8 38.3
## 6 6 38.6 36.0
## 7 7 37.3 36.4
## 8 8 39.5 37.5
## 9 9 38.6 39.1
pivot_longer()
: tidyr function to transform a wide data set to a long format
data<-pivot_longer(datawide, cols=(2:3))
data
## # A tibble: 18 x 3
## id name value
## <fct> <chr> <dbl>
## 1 1 temp_before 38
## 2 1 temp_after 36.7
## 3 2 temp_before 37.5
## 4 2 temp_after 38.1
## 5 3 temp_before 39.1
## 6 3 temp_after 37.1
## 7 4 temp_before 40
## 8 4 temp_after 39.2
## 9 5 temp_before 37.8
## 10 5 temp_after 38.3
## 11 6 temp_before 38.6
## 12 6 temp_after 36
## 13 7 temp_before 37.3
## 14 7 temp_after 36.4
## 15 8 temp_before 39.5
## 16 8 temp_after 37.5
## 17 9 temp_before 38.6
## 18 9 temp_after 39.1
colnames(data)<-c("id","time","temp")
Long format data frame:
data
## # A tibble: 18 x 3
## id time temp
## <fct> <chr> <dbl>
## 1 1 temp_before 38
## 2 1 temp_after 36.7
## 3 2 temp_before 37.5
## 4 2 temp_after 38.1
## 5 3 temp_before 39.1
## 6 3 temp_after 37.1
## 7 4 temp_before 40
## 8 4 temp_after 39.2
## 9 5 temp_before 37.8
## 10 5 temp_after 38.3
## 11 6 temp_before 38.6
## 12 6 temp_after 36
## 13 7 temp_before 37.3
## 14 7 temp_after 36.4
## 15 8 temp_before 39.5
## 16 8 temp_after 37.5
## 17 9 temp_before 38.6
## 18 9 temp_after 39.1