Package walkthrough: sleuth

2018/04/28

kallisto-sleuth workflow in dealing with sequencing data

Needless to say, kallisto-sleuth workflow opens up a new era in dealing with sequencing data; it dramatically scales down the computational complexity to the point which sequencing can be handled by a PC. I find this article is a good startpoint to understand the mechanism of kallisto

As the downstream of kallisto, sleuth has been optimised to deal with kallisto output. An attactive concept from sleuth is “variance decoupling”, to “decouple biological variance from inferential variance.” Inferential variance, according to online method, consists of “errors in mesurement and subsequent inference”, which can be obtained through bootstrapping from kallisto; inferential variance is eliminated, or decoupled away before shrinkage (and later added back when estimating the fixed-effect parameter). This articles walks through R code that realises this decoupling process.

Dataset used and method to “open up packages”

Kallisto pre-processing

We obtained the data as descripted in “Get data” section from “Differential gene expression using Kallisto and Degust”. We processed using the following bash code.

cd /your_dirs_to_Ecoli_kallisto_files

kallisto index -i transcripts.idx Ecoli_transcripts.fasta

kallisto quant -b 100 -i transcripts.idx -o LB1 --single -l 500 -s 50 LB1.fastq.gz
kallisto quant -b 100 -i transcripts.idx -o LB2 --single -l 500 -s 50 LB2.fastq.gz
kallisto quant -b 100 -i transcripts.idx -o LB3 --single -l 500 -s 50 LB3.fastq.gz
kallisto quant -b 100 -i transcripts.idx -o MG1 --single -l 500 -s 50 MG1.fastq.gz
kallisto quant -b 100 -i transcripts.idx -o MG2 --single -l 500 -s 50 MG2.fastq.gz
kallisto quant -b 100 -i transcripts.idx -o MG3 --single -l 500 -s 50 MG3.fastq.gz

Method to open up sleuth packages.

The aforementioned bash script yielded a dataset that is a typical “3 treatment Vs 3 control” experimental design. According to the official manual, dealing with it requires 4 lines with 3 functions

  • one sleuth_prep performs filtering (default as <5 count in 47% samples) and sample normalisation.
  • two sleuth_fit perform, on full model (with treatment) and null model (without treatment) respectively, inferential variance decoupling and using the variance left (biological variance) to perform odinary lease sq estimation of coefficient.
  • one sleuth_lrt perform likelihood ratio test between full and null model to see which model is likely to account for reality.

There are vaious intermediates that are wrapped wihin these functions. To understand these function, we have to find a way to examine the input/output of each intermediate steps. We modified these three functions which are summarised as follows,

  • adding sleuth::: to seek internal functions from sleuth library. By doing this, we do not have to touch the rest of packages.
  • assign the intermediate outputs, which we might be interested in, to global variables using <<-. A three-letter code was prefixed to the variables’ original names; these are

    • “PPP_” for variable in sleuth_prep function.
    • “FFF_” for variable in sleuth_fit on full model.
    • “RRR_” for varaible in sleuth_fit on reduced model.
    • “LLL_” for variable in sleuth_lrt function.

The modified code is stored “sleuth_internal_structure.R” and will be sourced in and run as a replacemnt of normal code. As a result, all the intermediate outputs will be “recorded”.

## setwd(paste0(getwd(), "/content/package")) ## needed to run this code chunk alone instead of Rmd compilation. 
if (packageVersion("sleuth") != "0.29.0"){
    devtools::install_github("pachterlab/sleuth@v0.29.0")
}

library(tidyverse)
library(sleuth)
library(kableExtra)

source("../../static/sleuth/sleuth_internal_structure.R")

## read the experiment design matrix
t <- read.table("../../static/sleuth/tbl.txt", header = T)
t$path <- as.character(t$path)

obj <<- test <- PPP_sleuth_prep(t, extra_bootstrap_summary = TRUE, num_cores = 1)
## Warning: `select_()` is deprecated as of dplyr 0.7.0.
## Please use `select()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
test <- FFF_sleuth_fit(test, ~condition, 'full')
test <- RRR_sleuth_fit(test, ~1, 'reduced')
test <- LLL_sleuth_lrt(test, 'reduced', 'full')

The walkthrough

options(knitr.table.format = "html", kableExtra.auto_format = TRUE, kableExtra.font_size = 8)
fs <- 8

sleuth_prep output is a list of 19; the observed count and bootstrapping information is stored in bs_summary.

str(obj$bs_summary)
## List of 2
##  $ obs_counts: num [1:1481, 1:6] 4.62 3.13 3.53 2.46 5.6 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:1481] "b0002" "b0003" "b0004" "b0006" ...
##   .. ..$ : chr [1:6] "SRR493366" "SRR493367" "SRR493368" "SRR493369" ...
##  $ sigma_q_sq: Named num [1:1481] 0.02105 0.07362 0.03354 0.049 0.00279 ...
##   ..- attr(*, "names")= chr [1:1481] "b0002" "b0003" "b0004" "b0006" ...

Within sleuth_fit, the function sleuth:::me_model wrapped in sleuth:::me_model_by_row elimiates inferential variance from total variance. Inferrential variance (variable sigma_q_sq) is obtained from rowMean of the bs_summary.

## modified from me_model and me_model_by_row
me_model_by_row <- lapply(1:nrow(bs_summary$obs_counts), function(i) {
  function(X, bs_summary$obs_counts[i, ], bs_summary$sigma_q_sq[i]) {
    n <- nrow(X)
    degrees_free <- n - ncol(X)
    ols_fit <- lm.fit(X, y)
    rss <- sum(ols_fit$residuals^2) # residual sum sq
    
    ## (raw) biological_var = total_var - inferential_var
    sigma_sq <- rss/(degrees_free) - sigma_q_sq 
    
    ## collectin result 
    mean_obs <- mean(y)
    var_obs <- var(y)
    list(ols_fit = ols_fit, b1 = ols_fit$coefficients[2], rss = rss, 
         sigma_sq = sigma_sq, sigma_q_sq = sigma_q_sq, mean_obs = mean_obs, 
         var_obs = var_obs)
    }
})

Subsequently, sigma_sq_pmax is obtained from sigma_sq of each gene by sigma_sq_pmax = pmax(sigma_sq,0). This is to prevent negative variance of count.

head(FFF_mes_df) %>%
  kable(.) %>%
  kable_styling(font_size = fs) 
rss sigma_sq sigma_q_sq mean_obs var_obs target_id sigma_sq_pmax
0.1104198 0.0065574 0.0210476 4.084207 0.2834129 b0002 0.0065574
0.6804381 0.0964892 0.0736203 2.803340 0.2939586 b0003 0.0964892
0.1362586 0.0005256 0.0335390 3.376731 0.0424544 b0004 0.0005256
0.2719853 0.0189926 0.0490038 3.071661 0.2606999 b0006 0.0189926
0.0323607 0.0052969 0.0027933 5.873750 0.0943577 b0008 0.0052969
0.0722622 0.0166392 0.0014264 6.367659 0.1211020 b0014 0.0166392

sleuth:::sliding_window_grouping (uses ecdf) and sleuth:::shrink_df perform loess prediction using biological variance.

head(FFF_swg) %>%
  kable(., "html") %>%
  kable_styling(font_size = fs)
rss sigma_sq sigma_q_sq mean_obs var_obs target_id sigma_sq_pmax iqr
1.0236176 0.1164558 0.1394486 1.1483599 2.748282 b0071 0.1164558 TRUE
1.6284312 0.1188251 0.2882827 1.1150755 1.040309 b0128 0.1188251 TRUE
3.2163447 0.5418898 0.2621964 1.2032913 1.017009 b0325 0.5418898 TRUE
1.3087485 0.0864052 0.2407820 1.1389325 1.376002 b0859 0.0864052 TRUE
0.7502072 0.0037614 0.1837905 0.6699952 1.378772 b0897 0.0037614 FALSE
3.7595103 0.7443768 0.1955008 1.1707549 1.019893 b1199 0.7443768 FALSE
head(FFF_shrink1) %>%
  kable(., "html") %>%
  kable_styling(font_size = fs)
rss sigma_sq sigma_q_sq mean_obs var_obs target_id sigma_sq_pmax iqr shrink
1.0236176 0.1164558 0.1394486 1.1483599 2.748282 b0071 0.1164558 TRUE 0.7539531
1.6284312 0.1188251 0.2882827 1.1150755 1.040309 b0128 0.1188251 TRUE 0.7663570
3.2163447 0.5418898 0.2621964 1.2032913 1.017009 b0325 0.5418898 TRUE 0.7339655
1.3087485 0.0864052 0.2407820 1.1389325 1.376002 b0859 0.0864052 TRUE 0.7574440
0.7502072 0.0037614 0.1837905 0.6699952 1.378772 b0897 0.0037614 FALSE NA
3.7595103 0.7443768 0.1955008 1.1707549 1.019893 b1199 0.7443768 FALSE 0.7457314

A power of 4 is performed on shrink variable to produce smooth_sigma_sq.

head(FFF_shrink_select) %>%
  kable(., "html") %>%
  kable_styling(font_size = fs)
rss sigma_sq sigma_q_sq mean_obs var_obs target_id sigma_sq_pmax iqr smooth_sigma_sq
1.0236176 0.1164558 0.1394486 1.1483599 2.748282 b0071 0.1164558 TRUE 0.3231301
1.6284312 0.1188251 0.2882827 1.1150755 1.040309 b0128 0.1188251 TRUE 0.3449249
3.2163447 0.5418898 0.2621964 1.2032913 1.017009 b0325 0.5418898 TRUE 0.2902034
1.3087485 0.0864052 0.2407820 1.1389325 1.376002 b0859 0.0864052 TRUE 0.3291563
0.7502072 0.0037614 0.1837905 0.6699952 1.378772 b0897 0.0037614 FALSE NA
3.7595103 0.7443768 0.1955008 1.1707549 1.019893 b1199 0.7443768 FALSE 0.3092642

The maximun of variance was obtained between raw and shrunken estimate.

head(FFF_shrink_mutate) %>%
  kable(., "html") %>%
  kable_styling(font_size = fs)
rss sigma_sq sigma_q_sq mean_obs var_obs target_id sigma_sq_pmax iqr smooth_sigma_sq smooth_sigma_sq_pmax
1.0236176 0.1164558 0.1394486 1.1483599 2.748282 b0071 0.1164558 TRUE 0.3231301 0.3231301
1.6284312 0.1188251 0.2882827 1.1150755 1.040309 b0128 0.1188251 TRUE 0.3449249 0.3449249
3.2163447 0.5418898 0.2621964 1.2032913 1.017009 b0325 0.5418898 TRUE 0.2902034 0.5418898
1.3087485 0.0864052 0.2407820 1.1389325 1.376002 b0859 0.0864052 TRUE 0.3291563 0.3291563
0.7502072 0.0037614 0.1837905 0.6699952 1.378772 b0897 0.0037614 FALSE NA NA
3.7595103 0.7443768 0.1955008 1.1707549 1.019893 b1199 0.7443768 FALSE 0.3092642 0.7443768

The fixed-effect beta is estimated with sleuth:::covar_beta using total vairance, which is the shrunken biological variance smooth_sigma_sq_pmax plus inference variance sigma_q_sq.

In the later part, the likelihood of each model is calculated by

all_likelihood <- sapply(nrow, function(i) {
    cur_mu <- fitted.values
    obs <- residuals + cur_mu
    cur_summary <- obj$fits[[which_model]]$summary
    cur_var <- smooth_sigma_sq_pmax + sigma_q_sq
    sum(dnorm(obs, mean = cur_mu, sd = sqrt(cur_var),log = TRUE)) ## normal distribution!!
})

… with test stats as \[2 \times (l_{full}- l_{null})\] P-value is from pchisq and q-value from p.adjust(pval, method = "BH").