[关闭]
@lakesea 2018-04-26T02:30:10.000000Z 字数 6363 阅读 633

SNP calling for soybean data

While waiting for assembly, I am going to do the SNP calling for the all 800+ soybean data. I am using the GATK + samtools SNP calling piepline.

Data quality control

Using trimmomatic doing the quality control first.

  1. java -jar trimmomatic-0.33.jar PE s_1_1_sequence.txt.gz s_1_2_sequence.txt.gz
  2. lane1_forward_paired.fq.gz lane1_forward_unpaired.fq.gz lane1_reverse_paired.fq.gz
  3. lane1_reverse_unpaired.fq.gz ILLUMINACLIP:all-PE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:80

Mapping

BWA is employed for the mapping:

  1. bwa mem -M -R '@RG\tID:sample_1\tLB:sample_1\tPL:ILLUMINA\tPM:HISEQ\tSM:sample_1' -t 5 ref input_1 input_2 > aligned_reads.sam
  2. # Need to provide the -M flag to BWA, this tells it to consider split reads as secondary, need this for GATK variant calling/Picard support.
  3. #Readgroup info is provided with the -R flag. This information is key for downstream GATK functionality. GATK will not work without a read group tag.

Sam2Bam and sort it

In this step, the output will be the sorted bam files.

  1. java -jar picard.jar SortSam INPUT=aligned_reads.sam OUTPUT=sorted_reads.bam SORT_ORDER=coordinate

Optional steps: Collect Alignment & Insert Size Metrics

Getting basic metrics:

  1. java -jar picard.jar CollectAlignmentSummaryMetrics R=ref I=sorted_reads.bam O=alignment_metrics.txt
  2. java -jar picard.jar CollectInsertSizeMetrics INPUT=sorted_reads.bam OUTPUT=insert_metrics.txt HISTOGRAM_FILE=insert_size_histogram.pdf
  3. samtools depth -a sorted_reads.bam > depth_out.txt

Mark Duplicates

  1. java -jar picard.jar MarkDuplicates INPUT=sorted_reads.bam OUTPUT=dedup_reads.bam METRICS_FILE=metrics.txt

Merging bam files

merging mutiple bam files into one bam file if this individual have mulitple libs.

  1. for i in `ls --color=no *.bam|cut -d'_' -f1 |sort -u `; do echo; files=`ls --color=no ${i}*.bam`; if [ $(echo "$files" |wc -l) -gt 1 ]; then mergcmd=`echo -n samtools merge -@ 9 ${i}.merge.bam $files`; echo -n "'${mergcmd}'"; else remove=`echo -n mv $files ${i}.merge.bam`; echo -n "'${remove}'" ;fi; done

Mark Duplicates again for the merged bam files

  1. java -jar picard.jar MarkDuplicates INPUT=sorted_reads.bam OUTPUT=dedup_reads.bam METRICS_FILE=metrics.txt

Build BAM Index and fa index

  1. java -jar picard.jar BuildBamIndex INPUT=dedup_reads.bam
  2. samtools faidx genome.fasta
  3. java -jar picard.jar CreateSequenceDictionary R=genome.fasta O=genome.dict

AddorReplacereadgroups

In this step, you need to motify the rg ID in the bam file to make GATK know each bam file is belong to one individual.

  1. for i in `cat bamfile.list`;
  2. do new=`echo ${i} |cut -f1 -d "."`;
  3. echo "picard AddOrReplaceReadGroups I=/scratch/pawsey0149/hhu/soy_bean/pangenome/results/3.SNPcalling/7.realginment/SRR/step2out/${i} O=/scratch/pawsey0149/hhu/soy_bean/pangenome/results/3.SNPcalling/7.realginment/SRR/step2out_rg/${i} RGID=${new} RGLB=lib1 RGPL=illumina RGPU=unit1 RGSM=${new}; samtools index /scratch/pawsey0149/hhu/soy_bean/pangenome/results/3.SNPcalling/7.realginment/SRR/step2out_rg/${i}" ;
  4. done;

Indel realigner

In this step, each bam file will perform individually.

  1. java -jar GenomeAnalysisTK.jar -T IndelRealigner -R genome.fasta -targetIntervals realn.intervals -I V1.bam -o V1.realn.bam
  2. java -jar GenomeAnalysisTK.jar -T IndelRealigner -R genome.fasta -targetIntervals realn.intervals -I V2.bam -o V2.realn.bam

merge all gvcf file

  1. samples=$(find . | sed 's/.\///' | grep -E 'g.vcf$' | sed 's/^/--variant /')
  2. java -Xmx500g -jar /home/hhu/biosoft/GenomeAnalysisTK-3.8/GenomeAnalysisTK.jar -T GenotypeGVCFs -R /scratch/pawsey0149/hhu/soy_bean/GBS/ref/william/pchr_Wm82_QC11/Wm82.fasta -nt 30 -o variants.gatk.raw1.vcf $(echo $samples)

Seperate each bam file based on Chr

  1. #!/bin/bash
  2. Chr_position=$1
  3. Bam_file=$2
  4. echo "start ${Chr_position}_${Bam_file}"
  5. echo `grep ">" /scratch/pawsey0149/hhu/soy_bean/GBS/ref/william/pchr_Wm82_QC11/${Chr_position} | cut -f2 -d ">"| cut -f1 -d " "` | xargs samtools view -b /scratch/pawsey0149/hhu/soy_bean/GBS/data/HN_HP/results/4.Realignment/step2out/${Bam_file} > /scratch/pawsey0149/hhu/soy_bean/GBS/data/HN_HP/results/4.Realignment/subbam/output/${Chr_position}_${Bam_file}
  6. echo done

Using samtools to call SNP

  1. samtools mpileup -t AD,ADF,ADR,DP,SP -ug -b Gm01.bam.list| bcftools call -vm > Gm01_variants.samtools.raw1.vcf

merge all vcf calling in different Chr

  1. java -jar ~/biosoft/GenomeAnalysisTK-3.8/GenomeAnalysisTK.jar org.broadinstitute.gatk.tools.CatVariants -R /scratch/pawsey0149/hhu/soy_bean/GBS/ref/william/pchr_Wm82_QC11/Wm82.fasta -V Gm01_variants.samtools.raw1.vcf -V Gm02_variants.samtools.raw1.vcf -V Gm03_variants.samtools.raw1.vcf -V Gm04_variants.samtools.raw1.vcf -V Gm05_variants.samtools.raw1.vcf -V Gm06_variants.samtools.raw1.vcf -V Gm07_variants.samtools.raw1.vcf -V Gm08_variants.samtools.raw1.vcf -V Gm09_variants.samtools.raw1.vcf -V Gm10_variants.samtools.raw1.vcf -V Gm11_variants.samtools.raw1.vcf -V Gm12_variants.samtools.raw1.vcf -V Gm13_variants.samtools.raw1.vcf -V Gm14_variants.samtools.raw1.vcf -V Gm15_variants.samtools.raw1.vcf -V Gm16_variants.samtools.raw1.vcf -V Gm17_variants.samtools.raw1.vcf -V Gm18_variants.samtools.raw1.vcf -V Gm19_variants.samtools.raw1.vcf -V Gm20_variants.samtools.raw1.vcf -V GmXX_variants.samtools.raw1.vcf -out variants.samtools.raw1.vcf -assumeSorted

merge results from GATK and Samtools

  1. java -Xmx120g -jar /home/hhu/biosoft/GenomeAnalysisTK-3.8/GenomeAnalysisTK.jar -T SelectVariants -R /scratch/pawsey0149/hhu/soy_bean/GBS/ref/william/pchr_Wm82_QC11/Wm82.fasta -nt 28 --variant /scratch/pawsey0149/hhu/soy_bean/GBS/data/HN_HP/results/5.1snpcalling_gatk/hapo_out/variants.gatk.raw1.vcf --concordance /scratch/pawsey0149/hhu/soy_bean/GBS/data/HN_HP/results/5.1snpcalling_samtools/vcfout/variants.samtools.raw1.vcf -o variants.concordance.raw1.vcf

filtering and Getting the high coverage SNPs:

you can change the parameters based on your requirement

  1. java -Xmx120g -jar /home/hhu/biosoft/GenomeAnalysisTK-3.8/GenomeAnalysisTK.jar -T VariantFiltration -R /scratch/pawsey0149/hhu/soy_bean/GBS/ref/william/pchr_Wm82_QC11/Wm82.fasta -V /scratch/pawsey0149/hhu/soy_bean/GBS/data/HN_HP/results/5.1snpcalling_merge/variants.concordance.raw1.vcf -filterName FilterQual --filterExpression "QUAL < 60.0" --filterName FilterQD --filterExpression "QD < 20.0" --filterName FilterFS --filterExpression "FS > 13.0" --filterName FilterMQ --filterExpression "MQ < 30.0" --filterName FilterMQRankSum --filterExpression "MQRankSum < -1.65" --filterName FilterReadPosRankSum --filterExpression "ReadPosRankSum < -1.65" -o variants.concordance.flt1.vcf 2> variants.concordance.flt1.log

extract SNPs and INDELs from filtered VCF file

  1. java -jar /opt/biosoft/GenomeAnalysisTK-3.7/GenomeAnalysisTK.jar -T SelectVariants -R genome.fasta -V variants.vcf -selectType SNP -o SNPs.vcf
  2. java -jar /opt/biosoft/GenomeAnalysisTK-3.7/GenomeAnalysisTK.jar -T SelectVariants -R genome.fasta -V variants.vcf -selectType INDEL -o INDELs.vcf
添加新批注
在作者公开此批注前,只有你和作者可见。
回复批注