Sentieon

2017-4-10

Sentieon 是一套分析DNA序列的软件工具集,主要的功能是识别mutations。最大的优点是速度非常快,准确性高。

Manual Quick Start
Manual 对软件包的使用进行了详细的介绍;Quick Start主要介绍搭建Sentieon系统,目前已经搭建在集群上

相关文件所在目录

使用前先导入必要的变量 [Necessary Before Quick Start]

export LICSRVR_HOST=mgt01
export LICSRVR_PORT=8990
export LOG_FILE=/Share/home/zhangqf/lipan/sentieon/log/1.log
export LICENSE_DIR=/Share/home/zhangqf/lipan/sentieon/license
export LICENSE_FILE=Tsinghua_University_Qiangfeng_Zhang_cluster_eval.lic
export BIN=/Share/home/zhangqf/lipan/sentieon/sentieon-genomics-201611.03
export SENTIEON_LICENSE=$LICSRVR_HOST:$LICSRVR_PORT
export SENTIEON_INSTALL_DIR=/Share/home/zhangqf/lipan/sentieon/sentieon-genomics-201611.03

配置Licence服务器 [管理员操作,用户不用操作]

在mgt01上操作

# + 运行
$BIN/libexec/licsrvr --start --log $LOG_FILE $LICENSE_DIR/$LICENSE_FILE

# + 停止
$BIN/libexec/licsrvr --stop --log $LOG_FILE $LICENSE_DIR/$LICENSE_FILE

# + 检查
$BIN/libexec/licclnt ping -s $LICSRVR_HOST:$LICSRVR_PORT

Quick Start [Run It Step by Step]

建议逐条运行下面的命令,结合Manual尝试理解命令的含义

# ******************************************
# 0. User settings 
# ******************************************

# Number of threads. You can get the number of cpu cores by running nproc  
nt=16

# Sample name and read group name. 
# It is important to assign meaningful names in actual cases.
# It is particularly important to assign different read group names 
# for each lane, when you need to combine results from different lanes from later on.
group="read_group_name"
sample="sample_name"

# Sequencing platform. Only Illumina supported.
pl="ILLUMINA" #platform

# 这里修改成你自己的quick start所在的目录
DIR=/Share/home/zhangqf/lipan/sentieon/quick_start

# Input pair-ended Illumina fastq files 
fastq_1=$DIR/1.fastq.gz
fastq_2=$DIR/2.fastq.gz

# Fasta reference file
fasta=$DIR/reference/ucsc.hg19_chr22.fasta

# SNP known sites
dbsnp=$DIR/reference/dbsnp_135.hg19_chr22.vcf
db_1000G=$DIR/reference/1000G_phase1.snps.high_confidence.hg19_chr22.sites.vcf
db_mills=$DIR/reference/Mills_and_1000G_gold_standard.indels.hg19_chr22.sites.vcf



# ******************************************
# 1. Mapping reads with BWA-MEM. Output coordinate-sorted BAM file.
# ******************************************
$SENTIEON_INSTALL_DIR/bin/bwa mem \
    -M \
    -R "@RG\tID:$group\tSM:$sample\tPL:$pl" \
    -t $nt $fasta $fastq_1 $fastq_2 > smoke_sorted.raw.sam

$SENTIEON_INSTALL_DIR/bin/sentieon util sort -o smoke_sorted.bam -t $nt --sam2bam -i smoke_sorted.raw.sam


# ******************************************
# 2. Metrics
#
# Produces Picard standard metrics of the input sequence and alignment. 
# ******************************************
$SENTIEON_INSTALL_DIR/bin/sentieon driver \
    -r $fasta \
    -t $nt \
    -i smoke_sorted.bam \
    --algo MeanQualityByCycle smoke_mq_metrics.txt \
    --algo QualDistribution smoke_qd_metrics.txt \
    --algo GCBias 
        --summary smoke_gc_summary.txt smoke_gc_metrics.txt \
    --algo AlignmentStat smoke_aln_metrics.txt \
    --algo InsertSizeMetricAlgo smoke_is_metrics.txt

$SENTIEON_INSTALL_DIR/bin/sentieon plot metrics \
    -o smoke_metrics_report.pdf \
    gc=smoke_gc_metrics.txt \
    qd=smoke_qd_metrics.txt \
    mq=smoke_mq_metrics.txt \
    isize=smoke_is_metrics.txt


# ******************************************
# 3. Remove Duplicate Reads
#
# This step will mark and remove duplicates.
# If only marking without removal is desired, remove -rmdup option in the second command.
# ******************************************

# + collects read information that will be used for removing duplicate reads
$SENTIEON_INSTALL_DIR/bin/sentieon driver \
    -t $nt \
    -i smoke_sorted.bam \
    --algo LocusCollector \
        --fun score_info smoke_score.txt

# + performs the removing of duplicate reads.
$SENTIEON_INSTALL_DIR/bin/sentieon driver  \
    -t $nt \
    -i smoke_sorted.bam \
    --algo Dedup \
        --rmdup \
        --score_info smoke_score.txt \
        --metrics smoke_dedup_metrics.txt smoke_deduped.bam


# ******************************************
# 5. Indel realigner
#
# This step produces InDel-realigned BAM file.
# ******************************************
$SENTIEON_INSTALL_DIR/bin/sentieon driver \
    -r $fasta  \
    -t $nt \
    -i smoke_deduped.bam \
    --algo Realigner \
        -k $db_1000G \
        -k $db_mills \
        smoke_realigned.bam

# ******************************************
# 6. BQSR--Base Quality Score Recalibration
#
# NOTES: 
# 1. -k options to known SNP sites are optional, but recommended. 
# 2. Applying the ReadWriter algo is optional, the next step in HC will apply calibration table on the fly.
# ******************************************

# + calculates the recalibration table necessary to do the BQSR
$SENTIEON_INSTALL_DIR/bin/sentieon driver \
    -r $fasta  \
    -t $nt \
    -i smoke_realigned.bam \
    --algo QualCal \
        -k $dbsnp \
        -k $db_1000G \
        -k $db_mills \
        smoke_recal_data.table

$SENTIEON_INSTALL_DIR/bin/sentieon driver \
    -r $fasta  \
    -t $nt \
    -i smoke_realigned.bam \
    -q smoke_recal_data.table \
    --algo QualCal \
        -k $dbsnp \
        -k $db_1000G \
        -k $db_mills smoke_recal_data.table.post \
    --algo ReadWriter smoke_recaled.bam # THE CALIRATED BAM

$SENTIEON_INSTALL_DIR/bin/sentieon driver \
    -t $nt \
    --algo QualCal \
        --plot \
        --before smoke_recal_data.table \
        --after smoke_recal_data.table.post \
        smoke_recal_data.csv

$SENTIEON_INSTALL_DIR/bin/sentieon plot bqsr \
    -o smoke_bqsrreport.pdf smoke_recal_data.csv


# ******************************************
# 7b. HC Variant caller
#
# Notes:
# 1. HC variant caller can take either Non-BQSR-calibrated BAM files and the calibration table as shown in the example below, 
#    or calibrated BAM directly. 
# 2. In the former case, where HC variant caller will apply calibration table on the fly (needs both smoke_realigned.bam and smoke_recal_data.table input)
# 3. In the latter case, use only the calibrated BAM, in this case, smoke_recaled.bam. 
# 4. DO NOT INCLUDE THE CALIBRATION TABLE TOGETHER WITH THE CALIRATED BAM INPUT, otherwise the calibration table will be applied twice.
# ******************************************
$SENTIEON_INSTALL_DIR/bin/sentieon driver \
    -r $fasta  \
    -t $nt \
    -i smoke_realigned.bam \
    -q smoke_recal_data.table \
    --algo Haplotyper \
        -d $dbsnp smoke_output-hc.vcf


# ******************************************
# 7c. HC Variant caller generating GVCF
# ******************************************
$SENTIEON_INSTALL_DIR/bin/sentieon driver \
    -r $fasta  \
    -t $nt \
    -i smoke_realigned.bam \
    -q smoke_recal_data.table \
    --algo Haplotyper \
        -d $dbsnp \
        --emit_mode gvcf smoke_output-hc.g.vcf