An example of a characteristic/demograpghic table
We aim to automatically generate a characteristic/demograpghic table between on a subject of interest (usually capable of dichotomizing a cohort, such as a treatment or a gene expression) and multiple parameters (like age, BMI, gender and genetic background). This type of table is well populated in clinical/epidemiologic publication where our study subject has a complex correlation with other parameters; the table is capable of summarising how the study subject is distributed with other parameters and,
- if we use Pearson’s chi-square test, whether there is an association between the subject and each parameter OR
- if we report odds ratios, to what extent the odds in favor of a subgroup in parameter is assciated with a subgroup in our subject.
An example table (adapted from Table II, Jin et. al., 2017) is a typical characteristic table that summarises how a cohort with their HS6ST2 expression measured (the columns) is distributed along with various parameters (the rows); according to the P-value from Pearson’s chi-square test, we could realise that “local invasion”, “distance metastasis” and “TNM stage” are the three parameters that associate with HS6ST2 expression.
Simulating a dataset (skippable)
To regenerate the table automatically, we firstly simulate the raw data from the Table II. It basically reverses the contigenct tables back to counts and cbind them together.
library(tidyverse)
library(kableExtra)
countsToCases <- function(x, countcol = "Freq") {
# Get the row indices to pull from x
idx <- rep.int(seq_len(nrow(x)), x[[countcol]])
# Drop count column
x[[countcol]] <- NULL
# Get the rows from x
x[idx, ]
}
sex <- as.table(matrix(c(56L, 22L, 24L, 8L), ncol = 2,
dimnames = list(sex = c("Male", "Female"), expr = c("Positive", "Negative")))) %>%
as_tibble(n = "Freq") %>%
countsToCases(.)
age <- as.table(matrix(c(40L, 38L, 14L, 18L), ncol = 2, dimnames = list(age = c(">=60", "<60"), expr = c("Positive", "Negative")))) %>%
as_tibble(n = "Freq") %>%
countsToCases(.)
ts <- as.table(matrix(c(45L, 33L, 17L, 15L), ncol = 2, dimnames = list(size = c(">=50", "<50"), expr = c("Positive", "Negative")))) %>%
as_tibble(n = "Freq") %>%
countsToCases(.)
tl <- as.table(matrix(c(20L, 28L, 30L, 6L, 10L, 16L), ncol = 2, dimnames = list(loc = c("Fundus", "Body", "Antrum"), expr = c("Positive", "Negative")))) %>%
as_tibble(n = "Freq") %>%
countsToCases(.)
di <- as.table(matrix(c(32L, 46L, 12L, 20L), ncol = 2, dimnames = list(diff = c("Well or moderate", "Poor"), expr = c("Positive", "Negative")))) %>%
as_tibble(n = "Freq") %>%
countsToCases(.)
li <- as.table(matrix(c(14L, 64L, 12L, 20L), ncol = 2, dimnames = list(inv = c("T1-T2", "T3-T4"), expr = c("Positive", "Negative")))) %>%
as_tibble(n = "Freq") %>%
countsToCases(.)
nm <- as.table(matrix(c(58L, 20L, 23L, 9L), ncol = 2, dimnames = list(node = c("Yes", "No"), expr = c("Positive", "Negative")))) %>%
as_tibble(n = "Freq") %>%
countsToCases(.)
dm <- as.table(matrix(c(22L, 56L, 2L, 30L), ncol = 2, dimnames = list(meta = c("Yes", "No"), expr = c("Positive", "Negative")))) %>%
as_tibble(n = "Freq") %>%
countsToCases(.)
tst <- as.table(matrix(c(7L, 13L, 36L,22L, 6L, 9L, 15L, 2L), ncol = 2, dimnames = list(tnm = c("I", "II", "III","IV"), expr = c("Positive", "Negative")))) %>%
as_tibble(n = "Freq") %>%
countsToCases(.)
## checking the +/- expression column.
x <- cbind(sex[,2],
age[,2],
ts[,2],
tl[,2],
di[,2],
li[,2],
nm[,2],
dm[,2],
tst[,2])
k <- as.data.frame(expand.grid(1:ncol(x),1:ncol(x)))
## test whether HS6ST2 is the same to ensure no errors.
# all(sapply(1:nrow(k), function(n){
# all(identical(x[,k[n,1]], x[,k[n,2]]))
# }))
df <- bind_cols(sex[,1], age[,1], ts[,1], tl[,1], di[,1], li[,1], nm[,1], dm[,1], tst) %>%
mutate_at(.vars = c("sex"), list(~factor(., levels = c("Male", "Female")))) %>%
mutate_at(.vars = c("age"), list(~factor(., levels = c(">=60", "<60")))) %>%
mutate_at(.vars = c("size"), list(~factor(., levels = c(">=50", "<50")))) %>%
mutate_at(.vars = c("loc"), list(~factor(., levels = c("Fundus", "Body", "Antrum")))) %>%
mutate_at(.vars = c("diff"), list(~factor(., levels = c("Well or moderate", "Poor")))) %>%
mutate_at(.vars = c("inv"), list(~factor(., levels = c("T1-T2", "T3-T4")))) %>%
mutate_at(.vars = c("node"), list(~factor(., levels = c("Yes", "No")))) %>%
mutate_at(.vars = c("meta"), list(~factor(., levels = c("Yes", "No")))) %>%
mutate_at(.vars = c("tnm"), list(~factor(., levels = c("I", "II", "III","IV")))) %>%
mutate_at(.vars = c("expr"), list(~factor(., levels = c("Positive", "Negative"))))
colnames(df) <- c("Sex", "Age, years", "Tumor size, mm", "Tumor location", "Differentiation", "Local invasion", "Node metastasis", "Distant metastasis", "TNM stage", "HS6ST2 expression")
## head(df)
The actual code
The main weightlifting part of the code contains the following,
- obtain contingency tables between the study subject with a common parameter (
table(parameters, subject)
) to ensure the subject is the columns). - perform statistical test (fisher exact, chisq or kendall, etc).
- formatting on decimal place, row alignment and indentation.
do.call(rbind)
all the contigency table together.
## about using one decimal place through presenting (https://stackoverflow.com/a/12135122)
## wrt data presentation, package "[formattable](https://github.com/renkun-ken/formattable)" should be explored
specify_decimal <- function(x, k = 1) {
if (is.nan(x)) {
trimws(format(round(0, k), nsmall=k))
} else {
trimws(format(round(x, k), nsmall=k))}
}
mk_ftbl <- function(subj, cov,
statsTest = rep.int("chisq", length(cov)),
total_cases = c("seperate", "together"),
show_row_perc = c(FALSE, TRUE),
useNA = c("no", "always"),
kable_output = c("latex", "html")){
## statsTest should be length one or same length as `cov`
if (length(statsTest) == 1){
statsTest <- rep.int(statsTest, length(cov))
} else if (length(statsTest) != length(cov)) {
stop("statsTest should be length one or same length as `cov`")
}
l <- list()
for (ind_ncol in 1:ncol(cov)) {
## extract the current parameters
ind <- cov[[ind_ncol]]
ind_name <- names(cov[,ind_ncol])
## perform stat test
if (statsTest[ind_ncol] == "chisq"){
pval <- chisq.test(subj, as.factor(ind), correct = FALSE)$p.value # no continuity correction.
} else if (statsTest == "fisher"){
pval <- fisher.test(subj, as.factor(ind))$p.value
} else if (statsTest == "kendall"){
pval <- Kendall::Kendall(subj, ind)$sl
} else {stop("statsTest is not an instance within the current scope. You can add your own test.")}
pval <- specify_decimal(pval, 3)
## construct contingency table
a <- table(ind, subj, useNA = useNA)
if (useNA == "always"){
## in useNA == "always", change the last row to "unavailable", get rid of last column since subject should be cleaned with no NAs.
rownames(a)[nrow(a)] <- "Unavailable"
a <- a[,-ncol(a)]
rn <- rownames(a)
} else if (useNA == "no") {
rn <- rownames(a)
} else {
stop("useNA can only be \"always\" or \"no\".")
}
a <- as.tibble(as.data.frame.matrix(a)) %>%
rownames_to_column(., var = "rn")
a$rn <- rn
## formatting total cases and percentages
perc_tbl <- t(apply(a,1, function(r){
new_row <- c()
a <- as.numeric(r[2])
b <- as.numeric(r[3])
## should percentage be concatinated with values?
if (show_row_perc){
new_row[2] <- paste0(a, " (", specify_decimal(a/(a+b)*100), ")")
new_row[3] <- paste0(b, " (", specify_decimal(b/(a+b)*100), ")")
} else {
new_row[2] <- a
new_row[3] <- b
}
## how to display total cases
if (total_cases == "seperate") {
new_row[2:length(new_row)+1] <- new_row[2:length(new_row)] # shift the vector to right by 1.
new_row[1] <- r[1]
new_row[2] <- a+b
final_colnames <<- c("Parameter", "Cases (n)", levels(subj), "P-value") # value used outside of apply should be global assigned.
} else if (total_cases == "together") {
new_row[1] <- paste0(r[1], " (n=", a+b, ")")
final_colnames <<- c("Parameter", levels(subj), "P-value")
} else (stop("total_cases should be either \"together\" or \"seperate\""))
return(new_row)
})) %>%
as.data.frame.matrix(., stringsAsFactors = F) %>%
as.tibble()
## formatting paste ind_name and pval to perc_tbl
perc_tbl[,ncol(perc_tbl)+1] <- as.character(NA) # a slot for pval
perc_tbl[nrow(perc_tbl)+1,] <- as.character(NA) # a slot for header
perc_tbl <- perc_tbl[c(nrow(perc_tbl), 1:(nrow(perc_tbl)-1)),]
perc_tbl[1, c(ncol(perc_tbl))] <- pval
perc_tbl[1, 1] <- ind_name
l[[ind_ncol]] <- perc_tbl
}
ftab <<- do.call(rbind,l)
colnames(ftab) <- final_colnames
ftab[is.na(ftab)] <- ""
## kable output format (row level indentation)
row_indent <- which(ftab[[2]] != "")
fs <- 15
if (kable_output == "latex") {
kable(ftab, format = "latex", booktabs = TRUE, linesep = "") %>%
kable_styling(font_size = fs) %>%
add_indent(row_indent)
} else if (kable_output == "html"){
html <- kable(ftab, format = "html", align = c("l", rep.int("c", length(ftab)))) %>%
kable_styling(bootstrap_options = c("striped", "hover"), position = "center", font_size = fs) %>%
add_indent(row_indent)
if (total_cases == "seperate") {
html %>% add_header_above(c(" ", " ", "HS6ST2\n expression" = 2, " "))
} else if (total_cases == "together") {
html %>% add_header_above(c(" ", "HS6ST2\n expression" = 2, " "))
} else (stop("total_cases should be either \"together\" or \"seperate\""))
} else {
stop ("Only \"latex\" and \"html\" format is considered here.")
}
}
The following table looks mostly likely to Table II
Parameter | Cases (n) | Positive | Negative | P-value |
---|---|---|---|---|
Sex | 0.732 | |||
Male | 80 | 56 | 24 | |
Female | 30 | 22 | 8 | |
Age, years | 0.473 | |||
>=60 | 54 | 40 | 14 | |
<60 | 56 | 38 | 18 | |
Tumor size, mm | 0.661 | |||
>=50 | 62 | 45 | 17 | |
<50 | 48 | 33 | 15 | |
Tumor location | 0.517 | |||
Fundus | 26 | 20 | 6 | |
Body | 38 | 28 | 10 | |
Antrum | 46 | 30 | 16 | |
Differentiation | 0.732 | |||
Well or moderate | 44 | 32 | 12 | |
Poor | 66 | 46 | 20 | |
Local invasion | 0.028 | |||
T1-T2 | 26 | 14 | 12 | |
T3-T4 | 84 | 64 | 20 | |
Node metastasis | 0.788 | |||
Yes | 81 | 58 | 23 | |
No | 29 | 20 | 9 | |
Distant metastasis | 0.011 | |||
Yes | 24 | 22 | 2 | |
No | 86 | 56 | 30 | |
TNM stage | 0.039 | |||
I | 13 | 7 | 6 | |
II | 22 | 13 | 9 | |
III | 51 | 36 | 15 | |
IV | 24 | 22 | 2 |
A format that I used before.
Parameter | Positive | Negative | P-value |
---|---|---|---|
Sex | 0.732 | ||
Male (n=80) | 56 (70.0) | 24 (30.0) | |
Female (n=30) | 22 (73.3) | 8 (26.7) | |
Unavailable (n=0) | 0 (0.0) | 0 (0.0) | |
Age, years | 0.473 | ||
>=60 (n=54) | 40 (74.1) | 14 (25.9) | |
<60 (n=56) | 38 (67.9) | 18 (32.1) | |
Unavailable (n=0) | 0 (0.0) | 0 (0.0) | |
Tumor size, mm | 0.661 | ||
>=50 (n=62) | 45 (72.6) | 17 (27.4) | |
<50 (n=48) | 33 (68.8) | 15 (31.2) | |
Unavailable (n=0) | 0 (0.0) | 0 (0.0) | |
Tumor location | 0.517 | ||
Fundus (n=26) | 20 (76.9) | 6 (23.1) | |
Body (n=38) | 28 (73.7) | 10 (26.3) | |
Antrum (n=46) | 30 (65.2) | 16 (34.8) | |
Unavailable (n=0) | 0 (0.0) | 0 (0.0) | |
Differentiation | 0.732 | ||
Well or moderate (n=44) | 32 (72.7) | 12 (27.3) | |
Poor (n=66) | 46 (69.7) | 20 (30.3) | |
Unavailable (n=0) | 0 (0.0) | 0 (0.0) | |
Local invasion | 0.028 | ||
T1-T2 (n=26) | 14 (53.8) | 12 (46.2) | |
T3-T4 (n=84) | 64 (76.2) | 20 (23.8) | |
Unavailable (n=0) | 0 (0.0) | 0 (0.0) | |
Node metastasis | 0.788 | ||
Yes (n=81) | 58 (71.6) | 23 (28.4) | |
No (n=29) | 20 (69.0) | 9 (31.0) | |
Unavailable (n=0) | 0 (0.0) | 0 (0.0) | |
Distant metastasis | 0.011 | ||
Yes (n=24) | 22 (91.7) | 2 (8.3) | |
No (n=86) | 56 (65.1) | 30 (34.9) | |
Unavailable (n=0) | 0 (0.0) | 0 (0.0) | |
TNM stage | 0.039 | ||
I (n=13) | 7 (53.8) | 6 (46.2) | |
II (n=22) | 13 (59.1) | 9 (40.9) | |
III (n=51) | 36 (70.6) | 15 (29.4) | |
IV (n=24) | 22 (91.7) | 2 (8.3) | |
Unavailable (n=0) | 0 (0.0) | 0 (0.0) |