Containerising a Pipeline

参考 https://pawseysc.github.io/containers-bioinformatics-workshop/3.pipeline/index.html

1
2
3
4
$ cd /data
$ git clone https://github.com/PawseySC/containers-bioinformatics-workshop.git
$ cd containers-bioinformatics-workshop
$ export WORK=$(pwd)

目标:port a small RNA sequencing pipeline (如图) to containers

  • 使用的三个工具
    • salmon:从 RNA-seq 数据快速定量转录,salmon quant
    • fastqc:高通量测序数据质控
    • multiqc:将多个样本的分析结果合并为一个报告

技能

  • search for container images on web registries

  • download images with singularity pull <IMAGE>

  • execute commands in containers through singularity exec <IMAGE> <CMD> <ARGS>

  • bind mount additional host containers using either

    • execution flag -B/--bind
    1
    $ singularity shell -B /opt,/data:/mnt /tmp/Centos7-ompi.img
    • environment variable SINGULARITY_BINDPATH
    1
    2
    $ export SINGULARITY_BINDPATH="/opt,/data:/mnt"
    $ singularity shell /tmp/Centos7-ompi.img

    Bind /opt on the host to /opt in the container and /data on the host to /mnt in the container.

pipeline

目录

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
[huangsisi@login01 pipeline]$ tree
.
├── data
│ ├── clean_outputs.sh
│ ├── ggal_gut_1.fq
│ ├── ggal_gut_2.fq
│ ├── images.sh
│ ├── original_pipe.sh
│ ├── solutions
│ │ ├── pipe.1.sh
│ │ ├── pipe.2.sh
│ │ └── pipe.3.sh
│ └── wrappers
│ ├── fastqc
│ ├── multiqc
│ └── salmon
└── reference
└── ggal_1_48850000_49020000.Ggal71.500bpflank.fa

关键步骤

  • step 1: salmon index
1
salmon index -t ggal_1_48850000_49020000.Ggal71.500bpflank.fa -i out_index &>log_index
  • step 2: salmon quant
1
salmon quant --libType=U -i ../reference/out_index -1 ggal_gut_1.fq -2 ggal_gut_2.fq -o ggal_gut &>log_quant
  • step 3: fastqc quality control
1
fastqc -o out_fastqc -f fastq -q ggal_gut_1.fq ggal_gut_2.fq &>log_fq
  • step 4: multiqc multiple quality control(学习一下运用软链接这波操作)
1
2
3
4
5
6
mkdir out_multiqc
cd out_multiqc
ln -s ../ggal_gut .
ln -s ../out_fastqc .
# ONLY CHANGE THE NEXT LINE - EXECUTION LINE
multiqc -v . &>../log_mq

Packages

  • salmon 1.2.1
  • fastqc 0.11.9
  • multiqc 1.9

Find and pull Packages

  • salmon为例,Find a container image for Salmon 使用 web registry Quay, at https://quay.io. We could have gone directly at the BioContainers home page, https://biocontainers.pro, however its user interface is a bit less friendly right now.
    • 右上角 Search salmon,回车
    • 结果列表中找到 biocontainers/salcom 点击
    • 点击左侧 Tags,找到最近的版本 1.2.1,为1.2.1--hf69c8f4_0
    • 点击右侧 Fetch,选择 Pull by Tag,复制quay.io/biocontainers/salmon:1.2.1--hf69c8f4_0
  • Pull the container image for Salmon
1
$ singularity pull docker://quay.io/biocontainers/salmon:1.2.1--hf69c8f4_0
  • Find and pull images for FastQC and MultiQC
1
2
$ singularity pull docker://quay.io/biocontainers/fastqc:0.11.9--0
$ singularity pull docker://quay.io/biocontainers/multiqc:1.9--pyh9f0ad1d_0
  • Current directory, 3 images, with inputs and scripts for the pipeline

  • 测试下载镜像
1
singularity exec ./multiqc_1.9--pyh9f0ad1d_0.sif multiqc --help

Containerise the pipeline

first pass

对以上关键步骤的四行命令用singularity exec进行改动,注意一下mount bind可能需要绝对路径

  • step 1: salmon index

the current directory reference/ is mounted by default in the container

1
2
3
singularity exec \
../data/salmon_1.2.1--hf69c8f4_0.sif \
salmon index -t ggal_1_48850000_49020000.Ggal71.500bpflank.fa -i out_index &>log_index
  • step 2: salmon quant

bind mount other directory ../reference

1
2
3
4
singularity exec \
-B /share/home/jianglab/huangsisi/usr/singularity/data/containers-bioinformatics-workshop/exercises/pipeline/reference \
./salmon_1.2.1--hf69c8f4_0.sif \
salmon quant --libType=U -i ../reference/out_index -1 ggal_gut_1.fq -2 ggal_gut_2.fq -o ggal_gut &>log_quant
  • step 3: fastqc quality control
1
2
3
singularity exec \
./fastqc_0.11.9--0.sif \
fastqc -o out_fastqc -f fastq -q ggal_gut_1.fq ggal_gut_2.fq &>log_fq
  • step 4: multiqc multiple quality control
1
2
3
4
5
6
7
8
9
mkdir out_multiqc
cd out_multiqc
ln -s ../ggal_gut .
ln -s ../out_fastqc .
# ONLY CHANGE THE NEXT LINE - EXECUTION LINE
singularity exec \
-B /share/home/jianglab/huangsisi/usr/singularity/data/containers-bioinformatics-workshop/exercises/pipeline/data \
../multiqc_1.9--pyh9f0ad1d_0.sif \
multiqc -v . &>../log_mq

备份,做修改然后运行

1
2
3
$ cp original_pipe.sh pipe.1.sh
$ vi pipe.1.sh
$ ./pipe.1.sh

second pass

1
2
$ ./clean_outputs.sh
$ cp pipe.1.sh pipe.2.sh
  • Container image paths

#!/bin/bash 之后,为每一个容器镜像定义变量,使用绝对路径

1
2
3
4
WORK=/share/home/jianglab/huangsisi/usr/singularity/data/containers-bioinformatics-workshop
salmon_image="$WORK/exercises/pipeline/data/salmon_1.2.1--hf69c8f4_0.sif"
fastqc_image="$WORK/exercises/pipeline/data/fastqc_0.11.9--0.sif"
multiqc_image="$WORK/exercises/pipeline/data/multiqc_1.9--pyh9f0ad1d_0.sif"
  • Bind mounted directories

使用变量SINGULARITY_BIN

1
export SINGULARITY_BINDPATH="$WORK/exercises/pipeline"

脚本框架

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
#!/bin/bash

WORK=/share/home/jianglab/huangsisi/usr/singularity/data/containers-bioinformatics-workshop

salmon_image="$WORK/exercises/pipeline/data/salmon_1.2.1--hf69c8f4_0.sif"
fastqc_image="$WORK/exercises/pipeline/data/fastqc_0.11.9--0.sif"
multiqc_image="$WORK/exercises/pipeline/data/multiqc_1.9--pyh9f0ad1d_0.sif"

export SINGULARITY_BINDPATH="$WORK/exercises/pipeline"
[..]

singularity exec \
$salmon_image \
[..]

singularity exec \
$salmon_image \
[..]

singularity exec \
$fastqc_image \
[..]

singularity exec \
$multiqc_image \
[..]

测试

1
$ ./pipe.2.sh

Streamlining the user experience of containers

注意到命令具有相似的形式

1
singularity exec $image <COMMAND> <ARGUMENTS>

于是可以把打包成如下

1
2
3
4
5
6
7
8
#!/bin/bash

image="$WORK/exercises/pipeline/data/salmon_1.2.1--hf69c8f4_0.sif"

cmd="salmon"
args="$@"

singularity exec $image $cmd $args

args值为$@ ,bash语法,代表了正在调用的脚本参数,在脚本后加上的任何参数都将传递给args

将以上三个软件命令都打包到wrappers文件夹,然后

  • bind mount paths with SINGULARITY_BINDPATH
  • include wrappers directory in the value of bash variable PATH
1
2
PATH="$WORK/exercises/pipeline/data/wrappers:$PATH"
export SINGULARITY_BINDPATH="$WORK/exercises/pipeline"

每个脚本记得定义WORK变量(?)

总,运行:)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
#!/bin/bash

echo "Pipeline started..."
WORK=/share/home/jianglab/huangsisi/usr/singularity/data/containers-bioinformatics-workshop
PATH="$WORK/exercises/pipeline/data/wrappers:$PATH"
export SINGULARITY_BINDPATH="$WORK/exercises/pipeline"

# step 1
cd ../reference
salmon index -t ggal_1_48850000_49020000.Ggal71.500bpflank.fa -i out_index &>log_index
cd ../data
echo " indexing completed"

# step 2
salmon quant --libType=U -i ../reference/out_index -1 ggal_gut_1.fq -2 ggal_gut_2.fq -o ggal_gut &>log_quant
echo " quantification completed"

# step 3
mkdir out_fastqc
fastqc -o out_fastqc -f fastq -q ggal_gut_1.fq ggal_gut_2.fq &>log_fq
echo " quality control completed"

# step 4
mkdir out_multiqc
cd out_multiqc
ln -s ../ggal_gut .
ln -s ../out_fastqc .
multiqc -v . &>../log_mq
cd ..
echo " multiple quality control completed"

echo "Pipeline finished!"
1
2
3
4
5
6
7
8
9
10
$ cat wrappers/salmon
#!/bin/bash

WORK=/share/home/jianglab/huangsisi/usr/singularity/data/containers-bioinformatics-workshop
image="$WORK/exercises/pipeline/data/salmon_1.2.1--hf69c8f4_0.sif"

cmd="salmon"
args="$@"

singularity exec $image $cmd $args