参考 https://pawseysc.github.io/containers-bioinformatics-workshop/3.pipeline/index.html
1 2 3 4 $ cd /data $ git clone https://github.com/PawseySC/containers-bioinformatics-workshop.git $ cd containers-bioinformatics-workshop $ export WORK=$(pwd )
目标:port a small RNA sequencing pipeline (如图) to containers
使用的三个工具
salmon
:从 RNA-seq 数据快速定量转录,salmon quant
fastqc
:高通量测序数据质控
multiqc
:将多个样本的分析结果合并为一个报告
技能
search for container images on web registries
download images with singularity pull <IMAGE>
execute commands in containers through singularity exec <IMAGE> <CMD> <ARGS>
bind mount additional host containers using either
1 $ singularity shell -B /opt,/data:/mnt /tmp/Centos7-ompi.img
environment variable SINGULARITY_BINDPATH
1 2 $ export SINGULARITY_BINDPATH="/opt,/data:/mnt" $ singularity shell /tmp/Centos7-ompi.img
Bind /opt
on the host to /opt
in the container and /data
on the host to /mnt
in the container.
pipeline 目录
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 [huangsisi@login01 pipeline]$ tree . ├── data │ ├── clean_outputs.sh │ ├── ggal_gut_1.fq │ ├── ggal_gut_2.fq │ ├── images.sh │ ├── original_pipe.sh │ ├── solutions │ │ ├── pipe.1.sh │ │ ├── pipe.2.sh │ │ └── pipe.3.sh │ └── wrappers │ ├── fastqc │ ├── multiqc │ └── salmon └── reference └── ggal_1_48850000_49020000.Ggal71.500bpflank.fa
关键步骤
1 salmon index -t ggal_1_48850000_49020000.Ggal71.500bpflank.fa -i out_index &>log_index
1 salmon quant --libType=U -i ../reference/out_index -1 ggal_gut_1.fq -2 ggal_gut_2.fq -o ggal_gut &>log_quant
step 3: fastqc quality control
1 fastqc -o out_fastqc -f fastq -q ggal_gut_1.fq ggal_gut_2.fq &>log_fq
step 4: multiqc multiple quality control(学习一下运用软链接这波操作)
1 2 3 4 5 6 mkdir out_multiqc cd out_multiqcln -s ../ggal_gut . ln -s ../out_fastqc . multiqc -v . &>../log_mq
Packages
salmon 1.2.1
fastqc 0.11.9
multiqc 1.9
Find and pull Packages
以salmon
为例,Find a container image for Salmon 使用 web registry Quay , at https://quay.io . We could have gone directly at the BioContainers home page, https://biocontainers.pro , however its user interface is a bit less friendly right now.
右上角 Search salmon
,回车
结果列表中找到 biocontainers/salcom
点击
点击左侧 Tags ,找到最近的版本 1.2.1,为1.2.1--hf69c8f4_0
点击右侧 Fetch ,选择 Pull by Tag ,复制quay.io/biocontainers/salmon:1.2.1--hf69c8f4_0
Pull the container image for Salmon
1 $ singularity pull docker://quay.io/biocontainers/salmon:1.2.1--hf69c8f4_0
Find and pull images for FastQC and MultiQC
1 2 $ singularity pull docker://quay.io/biocontainers/fastqc:0.11.9--0 $ singularity pull docker://quay.io/biocontainers/multiqc:1.9--pyh9f0ad1d_0
Current directory, 3 images, with inputs and scripts for the pipeline
1 singularity exec ./multiqc_1.9--pyh9f0ad1d_0.sif multiqc --help
Containerise the pipeline first pass 对以上关键步骤的四行命令用singularity exec
进行改动,注意一下mount bind可能需要绝对路径
the current directory reference/
is mounted by default in the container
1 2 3 singularity exec \ ../data/salmon_1.2.1--hf69c8f4_0.sif \ salmon index -t ggal_1_48850000_49020000.Ggal71.500bpflank.fa -i out_index &>log_index
bind mount other directory ../reference
1 2 3 4 singularity exec \ -B /share/home/jianglab/huangsisi/usr/singularity/data/containers-bioinformatics-workshop/exercises/pipeline/reference \ ./salmon_1.2.1--hf69c8f4_0.sif \ salmon quant --libType=U -i ../reference/out_index -1 ggal_gut_1.fq -2 ggal_gut_2.fq -o ggal_gut &>log_quant
step 3: fastqc quality control
1 2 3 singularity exec \ ./fastqc_0.11.9--0.sif \ fastqc -o out_fastqc -f fastq -q ggal_gut_1.fq ggal_gut_2.fq &>log_fq
step 4: multiqc multiple quality control
1 2 3 4 5 6 7 8 9 mkdir out_multiqc cd out_multiqcln -s ../ggal_gut . ln -s ../out_fastqc . singularity exec \ -B /share/home/jianglab/huangsisi/usr/singularity/data/containers-bioinformatics-workshop/exercises/pipeline/data \ ../multiqc_1.9--pyh9f0ad1d_0.sif \ multiqc -v . &>../log_mq
备份,做修改然后运行
1 2 3 $ cp original_pipe.sh pipe.1.sh $ vi pipe.1.sh $ ./pipe.1.sh
second pass 1 2 $ ./clean_outputs.sh $ cp pipe.1.sh pipe.2.sh
在 #!/bin/bash
之后,为每一个容器镜像定义变量,使用绝对路径
1 2 3 4 WORK=/share/home/jianglab/huangsisi/usr/singularity/data/containers-bioinformatics-workshop salmon_image="$WORK /exercises/pipeline/data/salmon_1.2.1--hf69c8f4_0.sif" fastqc_image="$WORK /exercises/pipeline/data/fastqc_0.11.9--0.sif" multiqc_image="$WORK /exercises/pipeline/data/multiqc_1.9--pyh9f0ad1d_0.sif"
使用变量SINGULARITY_BIN
1 export SINGULARITY_BINDPATH="$WORK /exercises/pipeline"
脚本框架
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 #!/bin/bash WORK=/share/home/jianglab/huangsisi/usr/singularity/data/containers-bioinformatics-workshop salmon_image="$WORK /exercises/pipeline/data/salmon_1.2.1--hf69c8f4_0.sif" fastqc_image="$WORK /exercises/pipeline/data/fastqc_0.11.9--0.sif" multiqc_image="$WORK /exercises/pipeline/data/multiqc_1.9--pyh9f0ad1d_0.sif" export SINGULARITY_BINDPATH="$WORK /exercises/pipeline" [..] singularity exec \ $salmon_image \ [..] singularity exec \ $salmon_image \ [..] singularity exec \ $fastqc_image \ [..] singularity exec \ $multiqc_image \ [..]
测试
Streamlining the user experience of containers 注意到命令具有相似的形式
1 singularity exec $image <COMMAND> <ARGUMENTS>
于是可以把打包成如下
1 2 3 4 5 6 7 8 #!/bin/bash image="$WORK /exercises/pipeline/data/salmon_1.2.1--hf69c8f4_0.sif" cmd="salmon" args="$@ " singularity exec $image $cmd $args
args
值为$@
,bash语法,代表了正在调用的脚本参数,在脚本后加上的任何参数都将传递给args
。
将以上三个软件命令都打包到wrappers
文件夹,然后
bind mount paths with SINGULARITY_BINDPATH
include wrappers
directory in the value of bash variable PATH
1 2 PATH="$WORK /exercises/pipeline/data/wrappers:$PATH " export SINGULARITY_BINDPATH="$WORK /exercises/pipeline"
每个脚本记得定义WORK变量(?)
总,运行:)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 #!/bin/bash echo "Pipeline started..." WORK=/share/home/jianglab/huangsisi/usr/singularity/data/containers-bioinformatics-workshop PATH="$WORK /exercises/pipeline/data/wrappers:$PATH " export SINGULARITY_BINDPATH="$WORK /exercises/pipeline" cd ../referencesalmon index -t ggal_1_48850000_49020000.Ggal71.500bpflank.fa -i out_index &>log_index cd ../dataecho " indexing completed" salmon quant --libType=U -i ../reference/out_index -1 ggal_gut_1.fq -2 ggal_gut_2.fq -o ggal_gut &>log_quant echo " quantification completed" mkdir out_fastqc fastqc -o out_fastqc -f fastq -q ggal_gut_1.fq ggal_gut_2.fq &>log_fq echo " quality control completed" mkdir out_multiqc cd out_multiqcln -s ../ggal_gut . ln -s ../out_fastqc . multiqc -v . &>../log_mq cd ..echo " multiple quality control completed" echo "Pipeline finished!"
1 2 3 4 5 6 7 8 9 10 $ cat wrappers/salmon WORK=/share/home/jianglab/huangsisi/usr/singularity/data/containers-bioinformatics-workshop image="$WORK /exercises/pipeline/data/salmon_1.2.1--hf69c8f4_0.sif" cmd="salmon" args="$@ " singularity exec $image $cmd $args