An example of a characteristic/demograpghic table

We aim to automatically generate a characteristic/demograpghic table between on a subject of interest (usually capable of dichotomizing a cohort, such as a treatment or a gene expression) and multiple parameters (like age, BMI, gender and genetic background). This type of table is well populated in clinical/epidemiologic publication where our study subject has a complex correlation with other parameters; the table is capable of summarising how the study subject is distributed with other parameters and,

if we use Pearson’s chi-square test, whether there is an association between the subject and each parameter OR
if we report odds ratios, to what extent the odds in favor of a subgroup in parameter is assciated with a subgroup in our subject.

An example table (adapted from Table II, Jin et. al., 2017) is a typical characteristic table that summarises how a cohort with their HS6ST2 expression measured (the columns) is distributed along with various parameters (the rows); according to the P-value from Pearson’s chi-square test, we could realise that “local invasion”, “distance metastasis” and “TNM stage” are the three parameters that associate with HS6ST2 expression.

Simulating a dataset (skippable)

To regenerate the table automatically, we firstly simulate the raw data from the Table II. It basically reverses the contigenct tables back to counts and cbind them together.

library(tidyverse)
library(kableExtra)


countsToCases <- function(x, countcol = "Freq") {
    # Get the row indices to pull from x
    idx <- rep.int(seq_len(nrow(x)), x[[countcol]])
    
    # Drop count column
    x[[countcol]] <- NULL

    # Get the rows from x
    x[idx, ]
}

sex <- as.table(matrix(c(56L, 22L, 24L, 8L), ncol = 2, 
                       dimnames = list(sex = c("Male", "Female"), expr = c("Positive", "Negative")))) %>% 
  as_tibble(n = "Freq") %>%
  countsToCases(.)

age <- as.table(matrix(c(40L, 38L, 14L, 18L), ncol = 2, dimnames = list(age = c(">=60", "<60"), expr = c("Positive", "Negative")))) %>% 
  as_tibble(n = "Freq") %>%
  countsToCases(.)

ts <- as.table(matrix(c(45L, 33L, 17L, 15L), ncol = 2, dimnames = list(size = c(">=50", "<50"), expr = c("Positive", "Negative")))) %>% 
  as_tibble(n = "Freq") %>%
  countsToCases(.)

tl <- as.table(matrix(c(20L, 28L, 30L, 6L, 10L, 16L), ncol = 2, dimnames = list(loc = c("Fundus", "Body", "Antrum"), expr = c("Positive", "Negative")))) %>% 
  as_tibble(n = "Freq") %>%
  countsToCases(.)

di <- as.table(matrix(c(32L, 46L, 12L, 20L), ncol = 2, dimnames = list(diff = c("Well or moderate", "Poor"), expr = c("Positive", "Negative")))) %>% 
  as_tibble(n = "Freq") %>%
  countsToCases(.)

li <- as.table(matrix(c(14L, 64L, 12L, 20L), ncol = 2, dimnames = list(inv = c("T1-T2", "T3-T4"), expr = c("Positive", "Negative")))) %>% 
  as_tibble(n = "Freq") %>%
  countsToCases(.)

nm <- as.table(matrix(c(58L, 20L, 23L, 9L), ncol = 2, dimnames = list(node = c("Yes", "No"), expr = c("Positive", "Negative")))) %>% 
  as_tibble(n = "Freq") %>%
  countsToCases(.)

dm <- as.table(matrix(c(22L, 56L, 2L, 30L), ncol = 2, dimnames = list(meta = c("Yes", "No"), expr = c("Positive", "Negative")))) %>% 
  as_tibble(n = "Freq") %>%
  countsToCases(.)

tst <- as.table(matrix(c(7L, 13L, 36L,22L, 6L, 9L, 15L, 2L), ncol = 2, dimnames = list(tnm = c("I", "II", "III","IV"), expr = c("Positive", "Negative")))) %>% 
  as_tibble(n = "Freq") %>%
  countsToCases(.)


## checking the +/- expression column. 
x <- cbind(sex[,2],
      age[,2],
      ts[,2],
      tl[,2],
      di[,2],
      li[,2],
      nm[,2],
      dm[,2],
      tst[,2])
k <- as.data.frame(expand.grid(1:ncol(x),1:ncol(x))) 

## test whether HS6ST2 is the same to ensure no errors.
# all(sapply(1:nrow(k), function(n){
#   all(identical(x[,k[n,1]], x[,k[n,2]]))
# }))


df <- bind_cols(sex[,1], age[,1], ts[,1], tl[,1], di[,1], li[,1], nm[,1], dm[,1], tst) %>% 
  mutate_at(.vars = c("sex"), list(~factor(., levels = c("Male", "Female")))) %>%
  mutate_at(.vars = c("age"), list(~factor(., levels = c(">=60", "<60")))) %>%
  mutate_at(.vars = c("size"), list(~factor(., levels = c(">=50", "<50")))) %>%
  mutate_at(.vars = c("loc"), list(~factor(., levels = c("Fundus", "Body", "Antrum")))) %>%
  mutate_at(.vars = c("diff"), list(~factor(., levels = c("Well or moderate", "Poor")))) %>%
  mutate_at(.vars = c("inv"), list(~factor(., levels = c("T1-T2", "T3-T4")))) %>%
  mutate_at(.vars = c("node"), list(~factor(., levels = c("Yes", "No")))) %>%
  mutate_at(.vars = c("meta"), list(~factor(., levels = c("Yes", "No")))) %>%
  mutate_at(.vars = c("tnm"), list(~factor(., levels = c("I", "II", "III","IV")))) %>%
  mutate_at(.vars = c("expr"), list(~factor(., levels = c("Positive", "Negative"))))

colnames(df) <- c("Sex", "Age, years", "Tumor size, mm", "Tumor location", "Differentiation", "Local invasion", "Node metastasis", "Distant metastasis", "TNM stage", "HS6ST2 expression")

## head(df)

The actual code

The main weightlifting part of the code contains the following,

obtain contingency tables between the study subject with a common parameter (table(parameters, subject)) to ensure the subject is the columns).
perform statistical test (fisher exact, chisq or kendall, etc).
formatting on decimal place, row alignment and indentation.
do.call(rbind) all the contigency table together.

## about using one decimal place through presenting (https://stackoverflow.com/a/12135122)
## wrt data presentation, package "[formattable](https://github.com/renkun-ken/formattable)" should be explored 
specify_decimal <- function(x, k = 1) {
  if (is.nan(x)) {
    trimws(format(round(0, k), nsmall=k))
  } else {
    trimws(format(round(x, k), nsmall=k))}
}

mk_ftbl <- function(subj, cov, 
                    statsTest = rep.int("chisq", length(cov)), 
                    total_cases = c("seperate", "together"),
                    show_row_perc = c(FALSE, TRUE), 
                    useNA = c("no", "always"),
                    kable_output = c("latex", "html")){
  ## statsTest should be length one or same length as `cov`
  if (length(statsTest) == 1){
    statsTest  <- rep.int(statsTest, length(cov))
  } else if (length(statsTest) != length(cov)) {
    stop("statsTest should be length one or same length as `cov`")
  }
  

    l <- list()
  
  for (ind_ncol in 1:ncol(cov)) {
    ## extract the current parameters
    ind <- cov[[ind_ncol]]
    ind_name <- names(cov[,ind_ncol])
    
    ## perform stat test  
    if (statsTest[ind_ncol] == "chisq"){
      pval <- chisq.test(subj, as.factor(ind), correct = FALSE)$p.value # no continuity correction. 
    } else if (statsTest == "fisher"){
      pval <- fisher.test(subj, as.factor(ind))$p.value
    } else if (statsTest == "kendall"){
      pval <- Kendall::Kendall(subj, ind)$sl
    } else {stop("statsTest is not an instance within the current scope. You can add your own test.")}
    pval <- specify_decimal(pval, 3) 
    
    ## construct contingency table
    a <- table(ind, subj, useNA = useNA)
    if (useNA == "always"){
      ## in useNA == "always", change the last row to "unavailable", get rid of last column since subject should be cleaned with no NAs. 
      rownames(a)[nrow(a)] <- "Unavailable" 
      a <- a[,-ncol(a)] 
      rn <- rownames(a)
    } else if (useNA == "no") {
      rn <- rownames(a)
    } else {
      stop("useNA can only be \"always\" or \"no\".")
    }
    a <- as.tibble(as.data.frame.matrix(a)) %>%
      rownames_to_column(., var = "rn")
    a$rn <- rn
    
    ## formatting total cases and percentages
    perc_tbl <- t(apply(a,1, function(r){
      new_row <- c()
      a <- as.numeric(r[2])
      b <- as.numeric(r[3])
      
      ## should percentage be concatinated with values?
      if (show_row_perc){
        new_row[2] <- paste0(a, " (", specify_decimal(a/(a+b)*100), ")")
        new_row[3] <- paste0(b, " (", specify_decimal(b/(a+b)*100), ")")
      } else {
        new_row[2] <- a 
        new_row[3] <- b
      }
      
      ## how to display total cases
      if (total_cases == "seperate") {
        new_row[2:length(new_row)+1] <- new_row[2:length(new_row)] # shift the vector to right by 1. 
        new_row[1] <- r[1]
        new_row[2] <- a+b
        final_colnames <<- c("Parameter", "Cases (n)", levels(subj), "P-value") # value used outside of apply should be global assigned.  
      } else if (total_cases == "together") {
        new_row[1] <- paste0(r[1], " (n=", a+b, ")")
        final_colnames <<- c("Parameter",  levels(subj), "P-value")
      } else (stop("total_cases should be either \"together\" or \"seperate\""))
      return(new_row)
    })) %>% 
      as.data.frame.matrix(., stringsAsFactors = F) %>%
      as.tibble()
    
    ## formatting paste ind_name and pval to perc_tbl
    perc_tbl[,ncol(perc_tbl)+1] <- as.character(NA) # a slot for pval
    perc_tbl[nrow(perc_tbl)+1,] <- as.character(NA) # a slot for header
    perc_tbl <- perc_tbl[c(nrow(perc_tbl), 1:(nrow(perc_tbl)-1)),] 
    perc_tbl[1, c(ncol(perc_tbl))] <- pval
    perc_tbl[1, 1] <- ind_name
    l[[ind_ncol]] <- perc_tbl
  }
  
  ftab <<- do.call(rbind,l)
  colnames(ftab) <- final_colnames
  ftab[is.na(ftab)] <- ""
  
  ## kable output format (row level indentation)
  row_indent <- which(ftab[[2]] != "")
  fs <- 15
  
  if (kable_output == "latex") {
    kable(ftab, format = "latex", booktabs = TRUE, linesep = "") %>% 
      kable_styling(font_size = fs) %>%
      add_indent(row_indent)
  } else if (kable_output == "html"){
    html <- kable(ftab, format = "html", align = c("l", rep.int("c", length(ftab)))) %>% 
      kable_styling(bootstrap_options = c("striped", "hover"), position = "center", font_size = fs) %>%
      add_indent(row_indent)
    
    if (total_cases == "seperate") {
      html %>% add_header_above(c(" ", " ", "HS6ST2\n expression" = 2, " "))
    } else if (total_cases == "together") {
      html %>% add_header_above(c(" ", "HS6ST2\n expression" = 2, " "))
    } else (stop("total_cases should be either \"together\" or \"seperate\""))
    
  } else {
    stop ("Only \"latex\" and \"html\" format is considered here.")
  }
}

The following table looks mostly likely to Table II

		HS6ST2 expression
Parameter	Cases (n)	Positive	Negative	P-value
Sex				0.732
Male	80	56	24
Female	30	22	8
Age, years				0.473
>=60	54	40	14
<60	56	38	18
Tumor size, mm				0.661
>=50	62	45	17
<50	48	33	15
Tumor location				0.517
Fundus	26	20	6
Body	38	28	10
Antrum	46	30	16
Differentiation				0.732
Well or moderate	44	32	12
Poor	66	46	20
Local invasion				0.028
T1-T2	26	14	12
T3-T4	84	64	20
Node metastasis				0.788
Yes	81	58	23
No	29	20	9
Distant metastasis				0.011
Yes	24	22	2
No	86	56	30
TNM stage				0.039
I	13	7	6
II	22	13	9
III	51	36	15
IV	24	22	2

A format that I used before.

	HS6ST2 expression
Parameter	Positive	Negative	P-value
Sex			0.732
Male (n=80)	56 (70.0)	24 (30.0)
Female (n=30)	22 (73.3)	8 (26.7)
Unavailable (n=0)	0 (0.0)	0 (0.0)
Age, years			0.473
>=60 (n=54)	40 (74.1)	14 (25.9)
<60 (n=56)	38 (67.9)	18 (32.1)
Unavailable (n=0)	0 (0.0)	0 (0.0)
Tumor size, mm			0.661
>=50 (n=62)	45 (72.6)	17 (27.4)
<50 (n=48)	33 (68.8)	15 (31.2)
Unavailable (n=0)	0 (0.0)	0 (0.0)
Tumor location			0.517
Fundus (n=26)	20 (76.9)	6 (23.1)
Body (n=38)	28 (73.7)	10 (26.3)
Antrum (n=46)	30 (65.2)	16 (34.8)
Unavailable (n=0)	0 (0.0)	0 (0.0)
Differentiation			0.732
Well or moderate (n=44)	32 (72.7)	12 (27.3)
Poor (n=66)	46 (69.7)	20 (30.3)
Unavailable (n=0)	0 (0.0)	0 (0.0)
Local invasion			0.028
T1-T2 (n=26)	14 (53.8)	12 (46.2)
T3-T4 (n=84)	64 (76.2)	20 (23.8)
Unavailable (n=0)	0 (0.0)	0 (0.0)
Node metastasis			0.788
Yes (n=81)	58 (71.6)	23 (28.4)
No (n=29)	20 (69.0)	9 (31.0)
Unavailable (n=0)	0 (0.0)	0 (0.0)
Distant metastasis			0.011
Yes (n=24)	22 (91.7)	2 (8.3)
No (n=86)	56 (65.1)	30 (34.9)
Unavailable (n=0)	0 (0.0)	0 (0.0)
TNM stage			0.039
I (n=13)	7 (53.8)	6 (46.2)
II (n=22)	13 (59.1)	9 (40.9)
III (n=51)	36 (70.6)	15 (29.4)
IV (n=24)	22 (91.7)	2 (8.3)
Unavailable (n=0)	0 (0.0)	0 (0.0)

Automating demographic table generation

2018/03/21

An example of a characteristic/demograpghic table

Simulating a dataset (skippable)

The actual code

The following table looks mostly likely to Table II

A format that I used before.