kallisto-sleuth workflow in dealing with sequencing data

Needless to say, kallisto-sleuth workflow opens up a new era in dealing with sequencing data; it dramatically scales down the computational complexity to the point which sequencing can be handled by a PC. I find this article is a good startpoint to understand the mechanism of kallisto

As the downstream of kallisto, sleuth has been optimised to deal with kallisto output. An attactive concept from sleuth is “variance decoupling”, to “decouple biological variance from inferential variance.” Inferential variance, according to online method, consists of “errors in mesurement and subsequent inference”, which can be obtained through bootstrapping from kallisto; inferential variance is eliminated, or decoupled away before shrinkage (and later added back when estimating the fixed-effect parameter). This articles walks through R code that realises this decoupling process.

Dataset used and method to “open up packages”

Kallisto pre-processing

We obtained the data as descripted in “Get data” section from “Differential gene expression using Kallisto and Degust”. We processed using the following bash code.

cd /your_dirs_to_Ecoli_kallisto_files

kallisto index -i transcripts.idx Ecoli_transcripts.fasta

kallisto quant -b 100 -i transcripts.idx -o LB1 --single -l 500 -s 50 LB1.fastq.gz
kallisto quant -b 100 -i transcripts.idx -o LB2 --single -l 500 -s 50 LB2.fastq.gz
kallisto quant -b 100 -i transcripts.idx -o LB3 --single -l 500 -s 50 LB3.fastq.gz
kallisto quant -b 100 -i transcripts.idx -o MG1 --single -l 500 -s 50 MG1.fastq.gz
kallisto quant -b 100 -i transcripts.idx -o MG2 --single -l 500 -s 50 MG2.fastq.gz
kallisto quant -b 100 -i transcripts.idx -o MG3 --single -l 500 -s 50 MG3.fastq.gz

Method to open up sleuth packages.

The aforementioned bash script yielded a dataset that is a typical “3 treatment Vs 3 control” experimental design. According to the official manual, dealing with it requires 4 lines with 3 functions

one sleuth_prep performs filtering (default as <5 count in 47% samples) and sample normalisation.
two sleuth_fit perform, on full model (with treatment) and null model (without treatment) respectively, inferential variance decoupling and using the variance left (biological variance) to perform odinary lease sq estimation of coefficient.
one sleuth_lrt perform likelihood ratio test between full and null model to see which model is likely to account for reality.

There are vaious intermediates that are wrapped wihin these functions. To understand these function, we have to find a way to examine the input/output of each intermediate steps. We modified these three functions which are summarised as follows,

adding sleuth::: to seek internal functions from sleuth library. By doing this, we do not have to touch the rest of packages.
assign the intermediate outputs, which we might be interested in, to global variables using <<-. A three-letter code was prefixed to the variables’ original names; these are
- “PPP_” for variable in sleuth_prep function.
- “FFF_” for variable in sleuth_fit on full model.
- “RRR_” for varaible in sleuth_fit on reduced model.
- “LLL_” for variable in sleuth_lrt function.

The modified code is stored “sleuth_internal_structure.R” and will be sourced in and run as a replacemnt of normal code. As a result, all the intermediate outputs will be “recorded”.

## setwd(paste0(getwd(), "/content/package")) ## needed to run this code chunk alone instead of Rmd compilation. 
if (packageVersion("sleuth") != "0.29.0"){
    devtools::install_github("pachterlab/sleuth@v0.29.0")
}

library(tidyverse)
library(sleuth)
library(kableExtra)

source("../../static/sleuth/sleuth_internal_structure.R")

## read the experiment design matrix
t <- read.table("../../static/sleuth/tbl.txt", header = T)
t$path <- as.character(t$path)

obj <<- test <- PPP_sleuth_prep(t, extra_bootstrap_summary = TRUE, num_cores = 1)

## Warning: `select_()` is deprecated as of dplyr 0.7.0.
## Please use `select()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.

test <- FFF_sleuth_fit(test, ~condition, 'full')
test <- RRR_sleuth_fit(test, ~1, 'reduced')
test <- LLL_sleuth_lrt(test, 'reduced', 'full')

The walkthrough

options(knitr.table.format = "html", kableExtra.auto_format = TRUE, kableExtra.font_size = 8)
fs <- 8

sleuth_prep output is a list of 19; the observed count and bootstrapping information is stored in bs_summary.

str(obj$bs_summary)

## List of 2
##  $ obs_counts: num [1:1481, 1:6] 4.62 3.13 3.53 2.46 5.6 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:1481] "b0002" "b0003" "b0004" "b0006" ...
##   .. ..$ : chr [1:6] "SRR493366" "SRR493367" "SRR493368" "SRR493369" ...
##  $ sigma_q_sq: Named num [1:1481] 0.02105 0.07362 0.03354 0.049 0.00279 ...
##   ..- attr(*, "names")= chr [1:1481] "b0002" "b0003" "b0004" "b0006" ...

Within sleuth_fit, the function sleuth:::me_model wrapped in sleuth:::me_model_by_row elimiates inferential variance from total variance. Inferrential variance (variable sigma_q_sq) is obtained from rowMean of the bs_summary.

## modified from me_model and me_model_by_row
me_model_by_row <- lapply(1:nrow(bs_summary$obs_counts), function(i) {
  function(X, bs_summary$obs_counts[i, ], bs_summary$sigma_q_sq[i]) {
    n <- nrow(X)
    degrees_free <- n - ncol(X)
    ols_fit <- lm.fit(X, y)
    rss <- sum(ols_fit$residuals^2) # residual sum sq
    
    ## (raw) biological_var = total_var - inferential_var
    sigma_sq <- rss/(degrees_free) - sigma_q_sq 
    
    ## collectin result 
    mean_obs <- mean(y)
    var_obs <- var(y)
    list(ols_fit = ols_fit, b1 = ols_fit$coefficients[2], rss = rss, 
         sigma_sq = sigma_sq, sigma_q_sq = sigma_q_sq, mean_obs = mean_obs, 
         var_obs = var_obs)
    }
})

Subsequently, sigma_sq_pmax is obtained from sigma_sq of each gene by sigma_sq_pmax = pmax(sigma_sq,0). This is to prevent negative variance of count.

head(FFF_mes_df) %>%
  kable(.) %>%
  kable_styling(font_size = fs)

rss	sigma_sq	sigma_q_sq	mean_obs	var_obs	target_id	sigma_sq_pmax
0.1104198	0.0065574	0.0210476	4.084207	0.2834129	b0002	0.0065574
0.6804381	0.0964892	0.0736203	2.803340	0.2939586	b0003	0.0964892
0.1362586	0.0005256	0.0335390	3.376731	0.0424544	b0004	0.0005256
0.2719853	0.0189926	0.0490038	3.071661	0.2606999	b0006	0.0189926
0.0323607	0.0052969	0.0027933	5.873750	0.0943577	b0008	0.0052969
0.0722622	0.0166392	0.0014264	6.367659	0.1211020	b0014	0.0166392

sleuth:::sliding_window_grouping (uses ecdf) and sleuth:::shrink_df perform loess prediction using biological variance.

head(FFF_swg) %>%
  kable(., "html") %>%
  kable_styling(font_size = fs)

rss	sigma_sq	sigma_q_sq	mean_obs	var_obs	target_id	sigma_sq_pmax	iqr
1.0236176	0.1164558	0.1394486	1.1483599	2.748282	b0071	0.1164558	TRUE
1.6284312	0.1188251	0.2882827	1.1150755	1.040309	b0128	0.1188251	TRUE
3.2163447	0.5418898	0.2621964	1.2032913	1.017009	b0325	0.5418898	TRUE
1.3087485	0.0864052	0.2407820	1.1389325	1.376002	b0859	0.0864052	TRUE
0.7502072	0.0037614	0.1837905	0.6699952	1.378772	b0897	0.0037614	FALSE
3.7595103	0.7443768	0.1955008	1.1707549	1.019893	b1199	0.7443768	FALSE

head(FFF_shrink1) %>%
  kable(., "html") %>%
  kable_styling(font_size = fs)

rss	sigma_sq	sigma_q_sq	mean_obs	var_obs	target_id	sigma_sq_pmax	iqr	shrink
1.0236176	0.1164558	0.1394486	1.1483599	2.748282	b0071	0.1164558	TRUE	0.7539531
1.6284312	0.1188251	0.2882827	1.1150755	1.040309	b0128	0.1188251	TRUE	0.7663570
3.2163447	0.5418898	0.2621964	1.2032913	1.017009	b0325	0.5418898	TRUE	0.7339655
1.3087485	0.0864052	0.2407820	1.1389325	1.376002	b0859	0.0864052	TRUE	0.7574440
0.7502072	0.0037614	0.1837905	0.6699952	1.378772	b0897	0.0037614	FALSE	NA
3.7595103	0.7443768	0.1955008	1.1707549	1.019893	b1199	0.7443768	FALSE	0.7457314

A power of 4 is performed on shrink variable to produce smooth_sigma_sq.

head(FFF_shrink_select) %>%
  kable(., "html") %>%
  kable_styling(font_size = fs)

rss	sigma_sq	sigma_q_sq	mean_obs	var_obs	target_id	sigma_sq_pmax	iqr	smooth_sigma_sq
1.0236176	0.1164558	0.1394486	1.1483599	2.748282	b0071	0.1164558	TRUE	0.3231301
1.6284312	0.1188251	0.2882827	1.1150755	1.040309	b0128	0.1188251	TRUE	0.3449249
3.2163447	0.5418898	0.2621964	1.2032913	1.017009	b0325	0.5418898	TRUE	0.2902034
1.3087485	0.0864052	0.2407820	1.1389325	1.376002	b0859	0.0864052	TRUE	0.3291563
0.7502072	0.0037614	0.1837905	0.6699952	1.378772	b0897	0.0037614	FALSE	NA
3.7595103	0.7443768	0.1955008	1.1707549	1.019893	b1199	0.7443768	FALSE	0.3092642

The maximun of variance was obtained between raw and shrunken estimate.

head(FFF_shrink_mutate) %>%
  kable(., "html") %>%
  kable_styling(font_size = fs)

rss	sigma_sq	sigma_q_sq	mean_obs	var_obs	target_id	sigma_sq_pmax	iqr	smooth_sigma_sq	smooth_sigma_sq_pmax
1.0236176	0.1164558	0.1394486	1.1483599	2.748282	b0071	0.1164558	TRUE	0.3231301	0.3231301
1.6284312	0.1188251	0.2882827	1.1150755	1.040309	b0128	0.1188251	TRUE	0.3449249	0.3449249
3.2163447	0.5418898	0.2621964	1.2032913	1.017009	b0325	0.5418898	TRUE	0.2902034	0.5418898
1.3087485	0.0864052	0.2407820	1.1389325	1.376002	b0859	0.0864052	TRUE	0.3291563	0.3291563
0.7502072	0.0037614	0.1837905	0.6699952	1.378772	b0897	0.0037614	FALSE	NA	NA
3.7595103	0.7443768	0.1955008	1.1707549	1.019893	b1199	0.7443768	FALSE	0.3092642	0.7443768

The fixed-effect beta is estimated with sleuth:::covar_beta using total vairance, which is the shrunken biological variance smooth_sigma_sq_pmax plus inference variance sigma_q_sq.

In the later part, the likelihood of each model is calculated by

all_likelihood <- sapply(nrow, function(i) {
    cur_mu <- fitted.values
    obs <- residuals + cur_mu
    cur_summary <- obj$fits[[which_model]]$summary
    cur_var <- smooth_sigma_sq_pmax + sigma_q_sq
    sum(dnorm(obs, mean = cur_mu, sd = sqrt(cur_var),log = TRUE)) ## normal distribution!!
})

… with test stats as $2 \times (l_{f u l l} - l_{n u l l})$ P-value is from pchisq and q-value from p.adjust(pval, method = "BH").

Package walkthrough: sleuth

2018/04/28

kallisto-sleuth workflow in dealing with sequencing data

Dataset used and method to “open up packages”

Kallisto pre-processing

Method to open up sleuth packages.

The walkthrough