Automating demographic table generation

2018/03/21

An example of a characteristic/demograpghic table

We aim to automatically generate a characteristic/demograpghic table between on a subject of interest (usually capable of dichotomizing a cohort, such as a treatment or a gene expression) and multiple parameters (like age, BMI, gender and genetic background). This type of table is well populated in clinical/epidemiologic publication where our study subject has a complex correlation with other parameters; the table is capable of summarising how the study subject is distributed with other parameters and,

An example table (adapted from Table II, Jin et. al., 2017) is a typical characteristic table that summarises how a cohort with their HS6ST2 expression measured (the columns) is distributed along with various parameters (the rows); according to the P-value from Pearson’s chi-square test, we could realise that “local invasion”, “distance metastasis” and “TNM stage” are the three parameters that associate with HS6ST2 expression.

Simulating a dataset (skippable)

To regenerate the table automatically, we firstly simulate the raw data from the Table II. It basically reverses the contigenct tables back to counts and cbind them together.

library(tidyverse)
library(kableExtra)


countsToCases <- function(x, countcol = "Freq") {
    # Get the row indices to pull from x
    idx <- rep.int(seq_len(nrow(x)), x[[countcol]])
    
    # Drop count column
    x[[countcol]] <- NULL

    # Get the rows from x
    x[idx, ]
}

sex <- as.table(matrix(c(56L, 22L, 24L, 8L), ncol = 2, 
                       dimnames = list(sex = c("Male", "Female"), expr = c("Positive", "Negative")))) %>% 
  as_tibble(n = "Freq") %>%
  countsToCases(.)

age <- as.table(matrix(c(40L, 38L, 14L, 18L), ncol = 2, dimnames = list(age = c(">=60", "<60"), expr = c("Positive", "Negative")))) %>% 
  as_tibble(n = "Freq") %>%
  countsToCases(.)

ts <- as.table(matrix(c(45L, 33L, 17L, 15L), ncol = 2, dimnames = list(size = c(">=50", "<50"), expr = c("Positive", "Negative")))) %>% 
  as_tibble(n = "Freq") %>%
  countsToCases(.)

tl <- as.table(matrix(c(20L, 28L, 30L, 6L, 10L, 16L), ncol = 2, dimnames = list(loc = c("Fundus", "Body", "Antrum"), expr = c("Positive", "Negative")))) %>% 
  as_tibble(n = "Freq") %>%
  countsToCases(.)

di <- as.table(matrix(c(32L, 46L, 12L, 20L), ncol = 2, dimnames = list(diff = c("Well or moderate", "Poor"), expr = c("Positive", "Negative")))) %>% 
  as_tibble(n = "Freq") %>%
  countsToCases(.)

li <- as.table(matrix(c(14L, 64L, 12L, 20L), ncol = 2, dimnames = list(inv = c("T1-T2", "T3-T4"), expr = c("Positive", "Negative")))) %>% 
  as_tibble(n = "Freq") %>%
  countsToCases(.)

nm <- as.table(matrix(c(58L, 20L, 23L, 9L), ncol = 2, dimnames = list(node = c("Yes", "No"), expr = c("Positive", "Negative")))) %>% 
  as_tibble(n = "Freq") %>%
  countsToCases(.)

dm <- as.table(matrix(c(22L, 56L, 2L, 30L), ncol = 2, dimnames = list(meta = c("Yes", "No"), expr = c("Positive", "Negative")))) %>% 
  as_tibble(n = "Freq") %>%
  countsToCases(.)

tst <- as.table(matrix(c(7L, 13L, 36L,22L, 6L, 9L, 15L, 2L), ncol = 2, dimnames = list(tnm = c("I", "II", "III","IV"), expr = c("Positive", "Negative")))) %>% 
  as_tibble(n = "Freq") %>%
  countsToCases(.)


## checking the +/- expression column. 
x <- cbind(sex[,2],
      age[,2],
      ts[,2],
      tl[,2],
      di[,2],
      li[,2],
      nm[,2],
      dm[,2],
      tst[,2])
k <- as.data.frame(expand.grid(1:ncol(x),1:ncol(x))) 

## test whether HS6ST2 is the same to ensure no errors.
# all(sapply(1:nrow(k), function(n){
#   all(identical(x[,k[n,1]], x[,k[n,2]]))
# }))


df <- bind_cols(sex[,1], age[,1], ts[,1], tl[,1], di[,1], li[,1], nm[,1], dm[,1], tst) %>% 
  mutate_at(.vars = c("sex"), list(~factor(., levels = c("Male", "Female")))) %>%
  mutate_at(.vars = c("age"), list(~factor(., levels = c(">=60", "<60")))) %>%
  mutate_at(.vars = c("size"), list(~factor(., levels = c(">=50", "<50")))) %>%
  mutate_at(.vars = c("loc"), list(~factor(., levels = c("Fundus", "Body", "Antrum")))) %>%
  mutate_at(.vars = c("diff"), list(~factor(., levels = c("Well or moderate", "Poor")))) %>%
  mutate_at(.vars = c("inv"), list(~factor(., levels = c("T1-T2", "T3-T4")))) %>%
  mutate_at(.vars = c("node"), list(~factor(., levels = c("Yes", "No")))) %>%
  mutate_at(.vars = c("meta"), list(~factor(., levels = c("Yes", "No")))) %>%
  mutate_at(.vars = c("tnm"), list(~factor(., levels = c("I", "II", "III","IV")))) %>%
  mutate_at(.vars = c("expr"), list(~factor(., levels = c("Positive", "Negative"))))

colnames(df) <- c("Sex", "Age, years", "Tumor size, mm", "Tumor location", "Differentiation", "Local invasion", "Node metastasis", "Distant metastasis", "TNM stage", "HS6ST2 expression")

## head(df)

The actual code

The main weightlifting part of the code contains the following,

  1. obtain contingency tables between the study subject with a common parameter (table(parameters, subject)) to ensure the subject is the columns).
  2. perform statistical test (fisher exact, chisq or kendall, etc).
  3. formatting on decimal place, row alignment and indentation.
  4. do.call(rbind) all the contigency table together.
## about using one decimal place through presenting (https://stackoverflow.com/a/12135122)
## wrt data presentation, package "[formattable](https://github.com/renkun-ken/formattable)" should be explored 
specify_decimal <- function(x, k = 1) {
  if (is.nan(x)) {
    trimws(format(round(0, k), nsmall=k))
  } else {
    trimws(format(round(x, k), nsmall=k))}
}
mk_ftbl <- function(subj, cov, 
                    statsTest = rep.int("chisq", length(cov)), 
                    total_cases = c("seperate", "together"),
                    show_row_perc = c(FALSE, TRUE), 
                    useNA = c("no", "always"),
                    kable_output = c("latex", "html")){
  ## statsTest should be length one or same length as `cov`
  if (length(statsTest) == 1){
    statsTest  <- rep.int(statsTest, length(cov))
  } else if (length(statsTest) != length(cov)) {
    stop("statsTest should be length one or same length as `cov`")
  }
  

    l <- list()
  
  for (ind_ncol in 1:ncol(cov)) {
    ## extract the current parameters
    ind <- cov[[ind_ncol]]
    ind_name <- names(cov[,ind_ncol])
    
    ## perform stat test  
    if (statsTest[ind_ncol] == "chisq"){
      pval <- chisq.test(subj, as.factor(ind), correct = FALSE)$p.value # no continuity correction. 
    } else if (statsTest == "fisher"){
      pval <- fisher.test(subj, as.factor(ind))$p.value
    } else if (statsTest == "kendall"){
      pval <- Kendall::Kendall(subj, ind)$sl
    } else {stop("statsTest is not an instance within the current scope. You can add your own test.")}
    pval <- specify_decimal(pval, 3) 
    
    ## construct contingency table
    a <- table(ind, subj, useNA = useNA)
    if (useNA == "always"){
      ## in useNA == "always", change the last row to "unavailable", get rid of last column since subject should be cleaned with no NAs. 
      rownames(a)[nrow(a)] <- "Unavailable" 
      a <- a[,-ncol(a)] 
      rn <- rownames(a)
    } else if (useNA == "no") {
      rn <- rownames(a)
    } else {
      stop("useNA can only be \"always\" or \"no\".")
    }
    a <- as.tibble(as.data.frame.matrix(a)) %>%
      rownames_to_column(., var = "rn")
    a$rn <- rn
    
    ## formatting total cases and percentages
    perc_tbl <- t(apply(a,1, function(r){
      new_row <- c()
      a <- as.numeric(r[2])
      b <- as.numeric(r[3])
      
      ## should percentage be concatinated with values?
      if (show_row_perc){
        new_row[2] <- paste0(a, " (", specify_decimal(a/(a+b)*100), ")")
        new_row[3] <- paste0(b, " (", specify_decimal(b/(a+b)*100), ")")
      } else {
        new_row[2] <- a 
        new_row[3] <- b
      }
      
      ## how to display total cases
      if (total_cases == "seperate") {
        new_row[2:length(new_row)+1] <- new_row[2:length(new_row)] # shift the vector to right by 1. 
        new_row[1] <- r[1]
        new_row[2] <- a+b
        final_colnames <<- c("Parameter", "Cases (n)", levels(subj), "P-value") # value used outside of apply should be global assigned.  
      } else if (total_cases == "together") {
        new_row[1] <- paste0(r[1], " (n=", a+b, ")")
        final_colnames <<- c("Parameter",  levels(subj), "P-value")
      } else (stop("total_cases should be either \"together\" or \"seperate\""))
      return(new_row)
    })) %>% 
      as.data.frame.matrix(., stringsAsFactors = F) %>%
      as.tibble()
    
    ## formatting paste ind_name and pval to perc_tbl
    perc_tbl[,ncol(perc_tbl)+1] <- as.character(NA) # a slot for pval
    perc_tbl[nrow(perc_tbl)+1,] <- as.character(NA) # a slot for header
    perc_tbl <- perc_tbl[c(nrow(perc_tbl), 1:(nrow(perc_tbl)-1)),] 
    perc_tbl[1, c(ncol(perc_tbl))] <- pval
    perc_tbl[1, 1] <- ind_name
    l[[ind_ncol]] <- perc_tbl
  }
  
  ftab <<- do.call(rbind,l)
  colnames(ftab) <- final_colnames
  ftab[is.na(ftab)] <- ""
  
  ## kable output format (row level indentation)
  row_indent <- which(ftab[[2]] != "")
  fs <- 15
  
  if (kable_output == "latex") {
    kable(ftab, format = "latex", booktabs = TRUE, linesep = "") %>% 
      kable_styling(font_size = fs) %>%
      add_indent(row_indent)
  } else if (kable_output == "html"){
    html <- kable(ftab, format = "html", align = c("l", rep.int("c", length(ftab)))) %>% 
      kable_styling(bootstrap_options = c("striped", "hover"), position = "center", font_size = fs) %>%
      add_indent(row_indent)
    
    if (total_cases == "seperate") {
      html %>% add_header_above(c(" ", " ", "HS6ST2\n expression" = 2, " "))
    } else if (total_cases == "together") {
      html %>% add_header_above(c(" ", "HS6ST2\n expression" = 2, " "))
    } else (stop("total_cases should be either \"together\" or \"seperate\""))
    
  } else {
    stop ("Only \"latex\" and \"html\" format is considered here.")
  }
}

The following table looks mostly likely to Table II

HS6ST2
expression
Parameter Cases (n) Positive Negative P-value
Sex 0.732
Male 80 56 24
Female 30 22 8
Age, years 0.473
>=60 54 40 14
<60 56 38 18
Tumor size, mm 0.661
>=50 62 45 17
<50 48 33 15
Tumor location 0.517
Fundus 26 20 6
Body 38 28 10
Antrum 46 30 16
Differentiation 0.732
Well or moderate 44 32 12
Poor 66 46 20
Local invasion 0.028
T1-T2 26 14 12
T3-T4 84 64 20
Node metastasis 0.788
Yes 81 58 23
No 29 20 9
Distant metastasis 0.011
Yes 24 22 2
No 86 56 30
TNM stage 0.039
I 13 7 6
II 22 13 9
III 51 36 15
IV 24 22 2

A format that I used before.

HS6ST2
expression
Parameter Positive Negative P-value
Sex 0.732
Male (n=80) 56 (70.0) 24 (30.0)
Female (n=30) 22 (73.3) 8 (26.7)
Unavailable (n=0) 0 (0.0) 0 (0.0)
Age, years 0.473
>=60 (n=54) 40 (74.1) 14 (25.9)
<60 (n=56) 38 (67.9) 18 (32.1)
Unavailable (n=0) 0 (0.0) 0 (0.0)
Tumor size, mm 0.661
>=50 (n=62) 45 (72.6) 17 (27.4)
<50 (n=48) 33 (68.8) 15 (31.2)
Unavailable (n=0) 0 (0.0) 0 (0.0)
Tumor location 0.517
Fundus (n=26) 20 (76.9) 6 (23.1)
Body (n=38) 28 (73.7) 10 (26.3)
Antrum (n=46) 30 (65.2) 16 (34.8)
Unavailable (n=0) 0 (0.0) 0 (0.0)
Differentiation 0.732
Well or moderate (n=44) 32 (72.7) 12 (27.3)
Poor (n=66) 46 (69.7) 20 (30.3)
Unavailable (n=0) 0 (0.0) 0 (0.0)
Local invasion 0.028
T1-T2 (n=26) 14 (53.8) 12 (46.2)
T3-T4 (n=84) 64 (76.2) 20 (23.8)
Unavailable (n=0) 0 (0.0) 0 (0.0)
Node metastasis 0.788
Yes (n=81) 58 (71.6) 23 (28.4)
No (n=29) 20 (69.0) 9 (31.0)
Unavailable (n=0) 0 (0.0) 0 (0.0)
Distant metastasis 0.011
Yes (n=24) 22 (91.7) 2 (8.3)
No (n=86) 56 (65.1) 30 (34.9)
Unavailable (n=0) 0 (0.0) 0 (0.0)
TNM stage 0.039
I (n=13) 7 (53.8) 6 (46.2)
II (n=22) 13 (59.1) 9 (40.9)
III (n=51) 36 (70.6) 15 (29.4)
IV (n=24) 22 (91.7) 2 (8.3)
Unavailable (n=0) 0 (0.0) 0 (0.0)