kallisto-sleuth workflow in dealing with sequencing data
Needless to say, kallisto-sleuth workflow opens up a new era in dealing with sequencing data; it dramatically scales down the computational complexity to the point which sequencing can be handled by a PC. I find this article is a good startpoint to understand the mechanism of kallisto
As the downstream of kallisto, sleuth has been optimised to deal with kallisto output. An attactive concept from sleuth is “variance decoupling”, to “decouple biological variance from inferential variance.” Inferential variance, according to online method, consists of “errors in mesurement and subsequent inference”, which can be obtained through bootstrapping from kallisto; inferential variance is eliminated, or decoupled away before shrinkage (and later added back when estimating the fixed-effect parameter). This articles walks through R code that realises this decoupling process.
Dataset used and method to “open up packages”
Kallisto pre-processing
We obtained the data as descripted in “Get data” section from “Differential gene expression using Kallisto and Degust”. We processed using the following bash code.
cd /your_dirs_to_Ecoli_kallisto_files
kallisto index -i transcripts.idx Ecoli_transcripts.fasta
kallisto quant -b 100 -i transcripts.idx -o LB1 --single -l 500 -s 50 LB1.fastq.gz
kallisto quant -b 100 -i transcripts.idx -o LB2 --single -l 500 -s 50 LB2.fastq.gz
kallisto quant -b 100 -i transcripts.idx -o LB3 --single -l 500 -s 50 LB3.fastq.gz
kallisto quant -b 100 -i transcripts.idx -o MG1 --single -l 500 -s 50 MG1.fastq.gz
kallisto quant -b 100 -i transcripts.idx -o MG2 --single -l 500 -s 50 MG2.fastq.gz
kallisto quant -b 100 -i transcripts.idx -o MG3 --single -l 500 -s 50 MG3.fastq.gz
Method to open up sleuth packages.
The aforementioned bash script yielded a dataset that is a typical “3 treatment Vs 3 control” experimental design. According to the official manual, dealing with it requires 4 lines with 3 functions
- one
sleuth_prep
performs filtering (default as <5 count in 47% samples) and sample normalisation. - two
sleuth_fit
perform, on full model (with treatment) and null model (without treatment) respectively, inferential variance decoupling and using the variance left (biological variance) to perform odinary lease sq estimation of coefficient. - one
sleuth_lrt
perform likelihood ratio test between full and null model to see which model is likely to account for reality.
There are vaious intermediates that are wrapped wihin these functions. To understand these function, we have to find a way to examine the input/output of each intermediate steps. We modified these three functions which are summarised as follows,
- adding
sleuth:::
to seek internal functions from sleuth library. By doing this, we do not have to touch the rest of packages. assign the intermediate outputs, which we might be interested in, to global variables using
<<-
. A three-letter code was prefixed to the variables’ original names; these are- “PPP_” for variable in
sleuth_prep
function. - “FFF_” for variable in
sleuth_fit
on full model. - “RRR_” for varaible in
sleuth_fit
on reduced model. - “LLL_” for variable in
sleuth_lrt
function.
- “PPP_” for variable in
The modified code is stored “sleuth_internal_structure.R” and will be sourced in and run as a replacemnt of normal code. As a result, all the intermediate outputs will be “recorded”.
## setwd(paste0(getwd(), "/content/package")) ## needed to run this code chunk alone instead of Rmd compilation.
if (packageVersion("sleuth") != "0.29.0"){
devtools::install_github("pachterlab/sleuth@v0.29.0")
}
library(tidyverse)
library(sleuth)
library(kableExtra)
source("../../static/sleuth/sleuth_internal_structure.R")
## read the experiment design matrix
t <- read.table("../../static/sleuth/tbl.txt", header = T)
t$path <- as.character(t$path)
obj <<- test <- PPP_sleuth_prep(t, extra_bootstrap_summary = TRUE, num_cores = 1)
## Warning: `select_()` is deprecated as of dplyr 0.7.0.
## Please use `select()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
test <- FFF_sleuth_fit(test, ~condition, 'full')
test <- RRR_sleuth_fit(test, ~1, 'reduced')
test <- LLL_sleuth_lrt(test, 'reduced', 'full')
The walkthrough
options(knitr.table.format = "html", kableExtra.auto_format = TRUE, kableExtra.font_size = 8)
fs <- 8
sleuth_prep
output is a list of 19; the observed count and bootstrapping information is stored in bs_summary
.
str(obj$bs_summary)
## List of 2
## $ obs_counts: num [1:1481, 1:6] 4.62 3.13 3.53 2.46 5.6 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:1481] "b0002" "b0003" "b0004" "b0006" ...
## .. ..$ : chr [1:6] "SRR493366" "SRR493367" "SRR493368" "SRR493369" ...
## $ sigma_q_sq: Named num [1:1481] 0.02105 0.07362 0.03354 0.049 0.00279 ...
## ..- attr(*, "names")= chr [1:1481] "b0002" "b0003" "b0004" "b0006" ...
Within sleuth_fit
, the function sleuth:::me_model
wrapped in sleuth:::me_model_by_row
elimiates inferential variance from total variance. Inferrential variance (variable sigma_q_sq
) is obtained from rowMean
of the bs_summary
.
## modified from me_model and me_model_by_row
me_model_by_row <- lapply(1:nrow(bs_summary$obs_counts), function(i) {
function(X, bs_summary$obs_counts[i, ], bs_summary$sigma_q_sq[i]) {
n <- nrow(X)
degrees_free <- n - ncol(X)
ols_fit <- lm.fit(X, y)
rss <- sum(ols_fit$residuals^2) # residual sum sq
## (raw) biological_var = total_var - inferential_var
sigma_sq <- rss/(degrees_free) - sigma_q_sq
## collectin result
mean_obs <- mean(y)
var_obs <- var(y)
list(ols_fit = ols_fit, b1 = ols_fit$coefficients[2], rss = rss,
sigma_sq = sigma_sq, sigma_q_sq = sigma_q_sq, mean_obs = mean_obs,
var_obs = var_obs)
}
})
Subsequently, sigma_sq_pmax
is obtained from sigma_sq
of each gene by sigma_sq_pmax = pmax(sigma_sq,0)
. This is to prevent negative variance of count.
head(FFF_mes_df) %>%
kable(.) %>%
kable_styling(font_size = fs)
rss | sigma_sq | sigma_q_sq | mean_obs | var_obs | target_id | sigma_sq_pmax |
---|---|---|---|---|---|---|
0.1104198 | 0.0065574 | 0.0210476 | 4.084207 | 0.2834129 | b0002 | 0.0065574 |
0.6804381 | 0.0964892 | 0.0736203 | 2.803340 | 0.2939586 | b0003 | 0.0964892 |
0.1362586 | 0.0005256 | 0.0335390 | 3.376731 | 0.0424544 | b0004 | 0.0005256 |
0.2719853 | 0.0189926 | 0.0490038 | 3.071661 | 0.2606999 | b0006 | 0.0189926 |
0.0323607 | 0.0052969 | 0.0027933 | 5.873750 | 0.0943577 | b0008 | 0.0052969 |
0.0722622 | 0.0166392 | 0.0014264 | 6.367659 | 0.1211020 | b0014 | 0.0166392 |
sleuth:::sliding_window_grouping
(uses ecdf
) and sleuth:::shrink_df
perform loess prediction using biological variance.
head(FFF_swg) %>%
kable(., "html") %>%
kable_styling(font_size = fs)
rss | sigma_sq | sigma_q_sq | mean_obs | var_obs | target_id | sigma_sq_pmax | iqr |
---|---|---|---|---|---|---|---|
1.0236176 | 0.1164558 | 0.1394486 | 1.1483599 | 2.748282 | b0071 | 0.1164558 | TRUE |
1.6284312 | 0.1188251 | 0.2882827 | 1.1150755 | 1.040309 | b0128 | 0.1188251 | TRUE |
3.2163447 | 0.5418898 | 0.2621964 | 1.2032913 | 1.017009 | b0325 | 0.5418898 | TRUE |
1.3087485 | 0.0864052 | 0.2407820 | 1.1389325 | 1.376002 | b0859 | 0.0864052 | TRUE |
0.7502072 | 0.0037614 | 0.1837905 | 0.6699952 | 1.378772 | b0897 | 0.0037614 | FALSE |
3.7595103 | 0.7443768 | 0.1955008 | 1.1707549 | 1.019893 | b1199 | 0.7443768 | FALSE |
head(FFF_shrink1) %>%
kable(., "html") %>%
kable_styling(font_size = fs)
rss | sigma_sq | sigma_q_sq | mean_obs | var_obs | target_id | sigma_sq_pmax | iqr | shrink |
---|---|---|---|---|---|---|---|---|
1.0236176 | 0.1164558 | 0.1394486 | 1.1483599 | 2.748282 | b0071 | 0.1164558 | TRUE | 0.7539531 |
1.6284312 | 0.1188251 | 0.2882827 | 1.1150755 | 1.040309 | b0128 | 0.1188251 | TRUE | 0.7663570 |
3.2163447 | 0.5418898 | 0.2621964 | 1.2032913 | 1.017009 | b0325 | 0.5418898 | TRUE | 0.7339655 |
1.3087485 | 0.0864052 | 0.2407820 | 1.1389325 | 1.376002 | b0859 | 0.0864052 | TRUE | 0.7574440 |
0.7502072 | 0.0037614 | 0.1837905 | 0.6699952 | 1.378772 | b0897 | 0.0037614 | FALSE | NA |
3.7595103 | 0.7443768 | 0.1955008 | 1.1707549 | 1.019893 | b1199 | 0.7443768 | FALSE | 0.7457314 |
A power of 4 is performed on shrink
variable to produce smooth_sigma_sq
.
head(FFF_shrink_select) %>%
kable(., "html") %>%
kable_styling(font_size = fs)
rss | sigma_sq | sigma_q_sq | mean_obs | var_obs | target_id | sigma_sq_pmax | iqr | smooth_sigma_sq |
---|---|---|---|---|---|---|---|---|
1.0236176 | 0.1164558 | 0.1394486 | 1.1483599 | 2.748282 | b0071 | 0.1164558 | TRUE | 0.3231301 |
1.6284312 | 0.1188251 | 0.2882827 | 1.1150755 | 1.040309 | b0128 | 0.1188251 | TRUE | 0.3449249 |
3.2163447 | 0.5418898 | 0.2621964 | 1.2032913 | 1.017009 | b0325 | 0.5418898 | TRUE | 0.2902034 |
1.3087485 | 0.0864052 | 0.2407820 | 1.1389325 | 1.376002 | b0859 | 0.0864052 | TRUE | 0.3291563 |
0.7502072 | 0.0037614 | 0.1837905 | 0.6699952 | 1.378772 | b0897 | 0.0037614 | FALSE | NA |
3.7595103 | 0.7443768 | 0.1955008 | 1.1707549 | 1.019893 | b1199 | 0.7443768 | FALSE | 0.3092642 |
The maximun of variance was obtained between raw and shrunken estimate.
head(FFF_shrink_mutate) %>%
kable(., "html") %>%
kable_styling(font_size = fs)
rss | sigma_sq | sigma_q_sq | mean_obs | var_obs | target_id | sigma_sq_pmax | iqr | smooth_sigma_sq | smooth_sigma_sq_pmax |
---|---|---|---|---|---|---|---|---|---|
1.0236176 | 0.1164558 | 0.1394486 | 1.1483599 | 2.748282 | b0071 | 0.1164558 | TRUE | 0.3231301 | 0.3231301 |
1.6284312 | 0.1188251 | 0.2882827 | 1.1150755 | 1.040309 | b0128 | 0.1188251 | TRUE | 0.3449249 | 0.3449249 |
3.2163447 | 0.5418898 | 0.2621964 | 1.2032913 | 1.017009 | b0325 | 0.5418898 | TRUE | 0.2902034 | 0.5418898 |
1.3087485 | 0.0864052 | 0.2407820 | 1.1389325 | 1.376002 | b0859 | 0.0864052 | TRUE | 0.3291563 | 0.3291563 |
0.7502072 | 0.0037614 | 0.1837905 | 0.6699952 | 1.378772 | b0897 | 0.0037614 | FALSE | NA | NA |
3.7595103 | 0.7443768 | 0.1955008 | 1.1707549 | 1.019893 | b1199 | 0.7443768 | FALSE | 0.3092642 | 0.7443768 |
The fixed-effect beta is estimated with sleuth:::covar_beta
using total vairance, which is the shrunken biological variance smooth_sigma_sq_pmax
plus inference variance sigma_q_sq
.
In the later part, the likelihood of each model is calculated by
all_likelihood <- sapply(nrow, function(i) {
cur_mu <- fitted.values
obs <- residuals + cur_mu
cur_summary <- obj$fits[[which_model]]$summary
cur_var <- smooth_sigma_sq_pmax + sigma_q_sq
sum(dnorm(obs, mean = cur_mu, sd = sqrt(cur_var),log = TRUE)) ## normal distribution!!
})
… with test stats as
\[2 \times (l_{full}- l_{null})\]
P-value is from pchisq
and q-value from p.adjust(pval, method = "BH")
.