@lakesea 2020-10-08T06:00:31.000000Z 字数 8721 阅读 413

Chicken annotation method record

项目

RNA-seq evidences

The RNA-seq libs were selected:

SRR8692168.sra
SRR8692152.sra
SRR8633200.sra
SRR7126367.sra
SRR6378268.sra
SRR6378275.sra
SRR6378282.sra
SRR6116784.sra
SRR4929869.sra
SRR3176315.sra
SRR1531072.sra
SRR1531044.sra
SRR5786000.sra
SRR6884529.sra
SRR6911324.sra
SRR7072169.sra
SRR9137587.sra
SRR9109599.sra
SRR5228682.sra
SRR2079878.sra
SRR2079857.sra
SRR8216284.sra
SRR9179250.sra
SRR8902636.sra
SRR8216279.sra
SRR7072170.sra
SRR5822167.sra
SRR3970455.sra
SRR1299039.sra
SRR7184316.sra
SRR8985031.sra
SRR8985046.sra
SRR7779699.sra
SRR7503967.sra
SRR5190400.sra
SRR4030880.sra
SRR5828096.sra
SRR5828100.sra
SRR5839742.sra
SRR5839746.sra
SRR6830245.sra
SRR6830234.sra
SRR6830225.sra
SRR6830201.sra
SRR6830216.sra
SRR4933174.sra
SRR5190502.sra
SRR5190512.sra
SRR7072163.sra
SRR7009040.sra
SRR5340698.sra
SRR1299037.sra
SRR3965980.sra
SRR6380258.sra
SRR5823222.sra
SRR5823204.sra
SRR5823188.sra
SRR2032312.sra
SRR2032322.sra
SRR2032333.sra
SRR5119412.sra
SRR7900202.sra
SRR7900205.sra
SRR7900209.sra
SRR7127919.sra
SRR7127901.sra
SRR6076932.sra
SRR4045466.sra
SRR7127910.sra
SRR5824317.sra
SRR5824320.sra
SRR5824325.sra
SRR9179219.sra
SRR9179228.sra
SRR9179247.sra
SRR1289906.sra
SRR1290657.sra
SRR1290680.sra
SRR5004517
SRR5004746
SRR5007601
SRR5007738
ERR569571
ERR569564
SRR6357659.sra
SRR6357662.sra
SRR6116750.sra
SRR6116762.sra
SRR6116778.sra
ERR597330.sra
ERR597353.sra
SRR5190404.sra
SRR8927005.sra
SRR8927016.sra
SRR7963328.sra
SRR7963310.sra
SRR7904023.sra
SRR2005777.sra
SRR4101252.sra
SRR9211413.sra
SRR8490105.sra
SRR7779717.sra
SRR7110440.sra
SRR5089019.sra
SRR1769980.sra
SRR6076921.sra
SRR4478735.sra
SRR4478689.sra
SRR4478643.sra
SRR4478667.sra
SRR8356628.sra
SRR7779731.sra
SRR5190474.sra
SRR4096123.sra
SRR2516426.sra
SRR1821184.sra
SRR1821188.sra

RNA-seq align to pangenome and run stringtie:

fastp -i SRR9179250_1.fastq -I SRR9179250_2.fastq -o SRR9179250.clean_1.fastq -O SRR9179250.clean_2.fastq -w 4
hisat2 -x /scratch/pawsey0149/hhu/mainwork/chi_pan/ref/pan_nonref.36.gt500.round2.fa -1 SRR9179250.clean_1.fastq -2 SRR9179250.clean_2.fastq -p 4 |samtools view -@ 3 -Sb - | samtools sort -@ 3 -o /scratch/pawsey0149/hhu/mainwork/chi_pan/rna_seq/summary_srr_id/select/fastq/pair/bam/SRR9179250.sort.bam
samtools sort -@ 28  all.merge.bam -o all.sort.bam
samtools index -@ 28 all.sort.bam
/scratch/pawsey0149/hhu/tool/stringtie-2.1.4.Linux_x86_64/stringtie -p 16 -o merged.gtf /scratch/pawsey0149/hhu/mainwork/chi_pan/rna_seq/summary_srr_id/select/fastq/pair/bam/all.sort.bam

assembly the transcript and convert it into transcript cds fasta:

util/cufflinks_gtf_genome_to_cdna_fasta.pl merged.gtf input/chi_masked.fa > transcripts.fasta
util/cufflinks_gtf_to_alignment_gff3.pl merged.gtf > transcripts.gff3
TransDecoder.LongOrfs -t transcripts.fasta
TransDecoder.Predict -t transcripts.fasta
util/cdna_alignment_orf_to_genome_orf.pl \
     transcripts.fasta.transdecoder.gff3 \
     transcripts.gff3 \
     transcripts.fasta > transcripts.fasta.transdecoder.genome.gff3

protein evidences

remove the highly similar sequences from the protein sequences.

cat Aves_Nchicken.fasta Chicken.fasta human.fasta > subseq.fasta
~/biosoft/cd-hit-v4.6.8-2017-0621/cd-hit -i subseq.fasta -o subseq.dedup -M 6000

SNAP

Generate based on the previous notes, which resulted from the first round of the MAKER annotation.

Gene prediction based on BRAKER

BRAKER can used to generate the gene prdiction gene model.This step will genereate the augustus and genemark gene prediction model, which can be used for maker annotation.

/group/pawsey0149/hhu/bioconda/miniconda2/bin/perl /group/pawsey0149/groupEes=chi_pan_36 --bam=all.sort.bam --cores 16

Maker

All the evidence is intergrated by Maker.

#-----Genome (these are always required)
genome=/ws/hhu/soybean_data/braker/chi_maker/nonref.36.gt500.round2.fa #genome sequence (fasta file or fasta embeded in GFF3 file)
organism_type=eukaryotic #eukaryotic or prokaryotic. Default is eukaryotic
#-----Re-annotation Using MAKER Derived GFF3
maker_gff= #MAKER derived GFF3 file
est_pass=0 #use ESTs in maker_gff: 1 = yes, 0 = no
altest_pass=0 #use alternate organism ESTs in maker_gff: 1 = yes, 0 = no
protein_pass=0 #use protein alignments in maker_gff: 1 = yes, 0 = no
rm_pass=0 #use repeats in maker_gff: 1 = yes, 0 = no
model_pass=0 #use gene models in maker_gff: 1 = yes, 0 = no
pred_pass=0 #use ab-initio predictions in maker_gff: 1 = yes, 0 = no
other_pass=0 #passthrough anyything else in maker_gff: 1 = yes, 0 = no
#-----EST Evidence (for best results provide a file for at least one)
est=/ws/hhu/soybean_data/braker/chi_maker/final.transcript1s.fasta.transdecoder.cds.red.fa #set of ESTs or assembled mRNA-seq in fasta format
altest= #EST/cDNA sequence file in fasta format from an alternate organism
est_gff= #aligned ESTs or mRNA-seq from an external GFF3 file
altest_gff= #aligned ESTs from a closly relate species in GFF3 format
#-----Protein Homology Evidence (for best results provide a file for at least one)
protein=/ws/hhu/soybean_data/braker/chi_maker/subseq.dedup.fasta  #protein sequence file in fasta format (i.e. from mutiple oransisms)
protein_gff=  #aligned protein homology evidence from an external GFF3 file
#-----Repeat Masking (leave values blank to skip repeat masking)
model_org=simple #select a model organism for RepBase masking in RepeatMasker
rmlib= #provide an organism specific repeat library in fasta format for RepeatMasker
repeat_protein=/ws/groupEnv/appbio/app/maker/data/te_proteins.fasta #provide a fasta file of transposable element proteins for RepeatRunner
rm_gff=/ws/hhu/soybean_data/braker/chi_maker/full_mask.complex.reformat.gff3#pre-identified repeat elements from an external GFF3 file
prok_rm=0 #forces MAKER to repeatmask prokaryotes (no reason to change this), 1 = yes, 0 = no
softmask=1 #use soft-masking rather than hard-masking in BLAST (i.e. seg and dust filtering)
#-----Gene Prediction
snaphmm=/ws/hhu/soybean_data/braker/chi_maker/Pan_rnd1.zff.length50_aed0.25.hmm #SNAP HMM file
gmhmm= #GeneMark HMM file
augustus_species=chi_pan_36 #Augustus gene prediction species model
fgenesh_par_file= #FGENESH parameter file
pred_gff= #ab-initio predictions from an external GFF3 file
model_gff= #annotated gene models from an external GFF3 file (annotation pass-through)
est2genome=0 #infer gene predictions directly from ESTs, 1 = yes, 0 = no
protein2genome=0 #infer predictions from protein homology, 1 = yes, 0 = no
trna=0 #find tRNAs with tRNAscan, 1 = yes, 0 = no
snoscan_rrna= #rRNA file to have Snoscan find snoRNAs
unmask=0 #also run ab-initio prediction programs on unmasked sequence, 1 = yes, 0 = no
#-----Other Annotation Feature Types (features MAKER doesn't recognize)
other_gff= #extra features to pass-through to final MAKER generated GFF3 file
#-----External Application Behavior Options
alt_peptide=C #amino acid used to replace non-standard amino acids in BLAST databases
cpus=1 #max number of cpus to use in BLAST and RepeatMasker (not for MPI, leave 1 when using MPI)
#-----MAKER Behavior Options
max_dna_len=100000 #length for dividing up contigs into chunks (increases/decreases memory usage)
min_contig=1 #skip genome contigs below this length (under 10kb are often useless)
pred_flank=200 #flank for extending evidence clusters sent to gene predictors
pred_stats=0 #report AED and QI statistics for all predictions as well as models
AED_threshold=1 #Maximum Annotation Edit Distance allowed (bound by 0 and 1)
min_protein=0 #require at least this many amino acids in predicted proteins
alt_splice=0 #Take extra steps to try and find alternative splicing, 1 = yes, 0 = no
always_complete=0 #extra steps to force start and stop codons, 1 = yes, 0 = no
map_forward=0 #map names and attributes forward from old GFF3 genes, 1 = yes, 0 = no
keep_preds=0 #Concordance threshold to add unsupported gene prediction (bound by 0 and 1)
split_hit=10000 #length for the splitting of hits (expected max intron size for evidence alignments)
single_exon=0 #consider single exon EST evidence when generating annotations, 1 = yes, 0 = no
single_length=250 #min length required for single exon ESTs if 'single_exon is enabled'
correct_est_fusion=0 #limits use of ESTs in annotation to avoid fusion genes
tries=2 #number of times to try a contig if there is a failure for some reason
clean_try=0 #remove all data from previous run before retrying, 1 = yes, 0 = no
clean_up=0 #removes theVoid directory with individual analysis files, 1 = yes, 0 = no
TMP=/ws/hhu/soybean_data/braker/chi_maker/tmp/ #specify a directory other than the system default temporary directory for temporary files

Filter the redundant sequences

First extract thpse proteins with AA length longer than 50AA:

perl remove_small.pl 49 chi_pan.all.maker.augustus_masked.proteins.fasta > chi_pan.all.maker.proteins_50aa.fasta

script:

#!/usr/bin/perl
#remove small sequences from fasta file
use strict;
use warnings;
my $minlen = shift or die "Error: `minlen` parameter not provided\n";
{
   local $/=">";
   while(<>) {
       chomp;
       next unless /\w/;
       s/>$//gs;
       my @chunk = split /\n/;
       my $header = shift @chunk;
       my $seqlen = length join "", @chunk;
       print ">$_" if($seqlen >= $minlen);
   }
   local $/="\n";
}

Using cd-hits with 90% cut-off:

cd-hit-2d -i Gallus_gallus.GRCg6a.pep.all.fa -i2 chi_pan.all.maker.proteins_50aa.fasta -o chi_pan.all.maker.proteins_50aa_rd_0.9.fa -c 0.9

Based on the filtered set of protein file, removed those filtred genes from the gff file.

rename the annotation gff file with newID

# create naming table (there are additional options for naming beyond defaults)
maker_map_ids --prefix PanGallus_Gene --justify 5  chi_pan.all.gff > rename.map
# replace names in GFF files
map_gff_ids rename.map chi_pan.all.gff
grep "maker" chi_pan.all.gff >chi.pan.gff
# replace names in FASTA headers
map_fasta_ids rename.map chi_pan.all.maker.proteins_50aa_rd_0.9.fa