[关闭]
@lakesea 2020-10-08T06:00:31.000000Z 字数 8721 阅读 390

Chicken annotation method record

项目


RNA-seq evidences

The RNA-seq libs were selected:

  1. SRR8692168.sra
  2. SRR8692152.sra
  3. SRR8633200.sra
  4. SRR7126367.sra
  5. SRR6378268.sra
  6. SRR6378275.sra
  7. SRR6378282.sra
  8. SRR6116784.sra
  9. SRR4929869.sra
  10. SRR3176315.sra
  11. SRR1531072.sra
  12. SRR1531044.sra
  13. SRR5786000.sra
  14. SRR6884529.sra
  15. SRR6911324.sra
  16. SRR7072169.sra
  17. SRR9137587.sra
  18. SRR9109599.sra
  19. SRR5228682.sra
  20. SRR2079878.sra
  21. SRR2079857.sra
  22. SRR8216284.sra
  23. SRR9179250.sra
  24. SRR8902636.sra
  25. SRR8216279.sra
  26. SRR7072170.sra
  27. SRR5822167.sra
  28. SRR3970455.sra
  29. SRR1299039.sra
  30. SRR7184316.sra
  31. SRR8985031.sra
  32. SRR8985046.sra
  33. SRR7779699.sra
  34. SRR7503967.sra
  35. SRR5190400.sra
  36. SRR4030880.sra
  37. SRR5828096.sra
  38. SRR5828100.sra
  39. SRR5839742.sra
  40. SRR5839746.sra
  41. SRR6830245.sra
  42. SRR6830234.sra
  43. SRR6830225.sra
  44. SRR6830201.sra
  45. SRR6830216.sra
  46. SRR4933174.sra
  47. SRR5190502.sra
  48. SRR5190512.sra
  49. SRR7072163.sra
  50. SRR7009040.sra
  51. SRR5340698.sra
  52. SRR1299037.sra
  53. SRR3965980.sra
  54. SRR6380258.sra
  55. SRR5823222.sra
  56. SRR5823204.sra
  57. SRR5823188.sra
  58. SRR2032312.sra
  59. SRR2032322.sra
  60. SRR2032333.sra
  61. SRR5119412.sra
  62. SRR7900202.sra
  63. SRR7900205.sra
  64. SRR7900209.sra
  65. SRR7127919.sra
  66. SRR7127901.sra
  67. SRR6076932.sra
  68. SRR4045466.sra
  69. SRR7127910.sra
  70. SRR5824317.sra
  71. SRR5824320.sra
  72. SRR5824325.sra
  73. SRR9179219.sra
  74. SRR9179228.sra
  75. SRR9179247.sra
  76. SRR1289906.sra
  77. SRR1290657.sra
  78. SRR1290680.sra
  79. SRR5004517
  80. SRR5004746
  81. SRR5007601
  82. SRR5007738
  83. ERR569571
  84. ERR569564
  85. SRR6357659.sra
  86. SRR6357662.sra
  87. SRR6116750.sra
  88. SRR6116762.sra
  89. SRR6116778.sra
  90. ERR597330.sra
  91. ERR597353.sra
  92. SRR5190404.sra
  93. SRR8927005.sra
  94. SRR8927016.sra
  95. SRR7963328.sra
  96. SRR7963310.sra
  97. SRR7904023.sra
  98. SRR2005777.sra
  99. SRR4101252.sra
  100. SRR9211413.sra
  101. SRR8490105.sra
  102. SRR7779717.sra
  103. SRR7110440.sra
  104. SRR5089019.sra
  105. SRR1769980.sra
  106. SRR6076921.sra
  107. SRR4478735.sra
  108. SRR4478689.sra
  109. SRR4478643.sra
  110. SRR4478667.sra
  111. SRR8356628.sra
  112. SRR7779731.sra
  113. SRR5190474.sra
  114. SRR4096123.sra
  115. SRR2516426.sra
  116. SRR1821184.sra
  117. SRR1821188.sra

RNA-seq align to pangenome and run stringtie:

  1. fastp -i SRR9179250_1.fastq -I SRR9179250_2.fastq -o SRR9179250.clean_1.fastq -O SRR9179250.clean_2.fastq -w 4
  2. hisat2 -x /scratch/pawsey0149/hhu/mainwork/chi_pan/ref/pan_nonref.36.gt500.round2.fa -1 SRR9179250.clean_1.fastq -2 SRR9179250.clean_2.fastq -p 4 |samtools view -@ 3 -Sb - | samtools sort -@ 3 -o /scratch/pawsey0149/hhu/mainwork/chi_pan/rna_seq/summary_srr_id/select/fastq/pair/bam/SRR9179250.sort.bam
  3. samtools sort -@ 28 all.merge.bam -o all.sort.bam
  4. samtools index -@ 28 all.sort.bam
  5. /scratch/pawsey0149/hhu/tool/stringtie-2.1.4.Linux_x86_64/stringtie -p 16 -o merged.gtf /scratch/pawsey0149/hhu/mainwork/chi_pan/rna_seq/summary_srr_id/select/fastq/pair/bam/all.sort.bam

assembly the transcript and convert it into transcript cds fasta:

  1. util/cufflinks_gtf_genome_to_cdna_fasta.pl merged.gtf input/chi_masked.fa > transcripts.fasta
  2. util/cufflinks_gtf_to_alignment_gff3.pl merged.gtf > transcripts.gff3
  3. TransDecoder.LongOrfs -t transcripts.fasta
  4. TransDecoder.Predict -t transcripts.fasta
  5. util/cdna_alignment_orf_to_genome_orf.pl \
  6. transcripts.fasta.transdecoder.gff3 \
  7. transcripts.gff3 \
  8. transcripts.fasta > transcripts.fasta.transdecoder.genome.gff3

protein evidences

remove the highly similar sequences from the protein sequences.

  1. cat Aves_Nchicken.fasta Chicken.fasta human.fasta > subseq.fasta
  2. ~/biosoft/cd-hit-v4.6.8-2017-0621/cd-hit -i subseq.fasta -o subseq.dedup -M 6000

SNAP

Generate based on the previous notes, which resulted from the first round of the MAKER annotation.

Gene prediction based on BRAKER

BRAKER can used to generate the gene prdiction gene model.This step will genereate the augustus and genemark gene prediction model, which can be used for maker annotation.

  1. /group/pawsey0149/hhu/bioconda/miniconda2/bin/perl /group/pawsey0149/groupEes=chi_pan_36 --bam=all.sort.bam --cores 16

Maker

All the evidence is intergrated by Maker.

  1. #-----Genome (these are always required)
  2. genome=/ws/hhu/soybean_data/braker/chi_maker/nonref.36.gt500.round2.fa #genome sequence (fasta file or fasta embeded in GFF3 file)
  3. organism_type=eukaryotic #eukaryotic or prokaryotic. Default is eukaryotic
  4. #-----Re-annotation Using MAKER Derived GFF3
  5. maker_gff= #MAKER derived GFF3 file
  6. est_pass=0 #use ESTs in maker_gff: 1 = yes, 0 = no
  7. altest_pass=0 #use alternate organism ESTs in maker_gff: 1 = yes, 0 = no
  8. protein_pass=0 #use protein alignments in maker_gff: 1 = yes, 0 = no
  9. rm_pass=0 #use repeats in maker_gff: 1 = yes, 0 = no
  10. model_pass=0 #use gene models in maker_gff: 1 = yes, 0 = no
  11. pred_pass=0 #use ab-initio predictions in maker_gff: 1 = yes, 0 = no
  12. other_pass=0 #passthrough anyything else in maker_gff: 1 = yes, 0 = no
  13. #-----EST Evidence (for best results provide a file for at least one)
  14. est=/ws/hhu/soybean_data/braker/chi_maker/final.transcript1s.fasta.transdecoder.cds.red.fa #set of ESTs or assembled mRNA-seq in fasta format
  15. altest= #EST/cDNA sequence file in fasta format from an alternate organism
  16. est_gff= #aligned ESTs or mRNA-seq from an external GFF3 file
  17. altest_gff= #aligned ESTs from a closly relate species in GFF3 format
  18. #-----Protein Homology Evidence (for best results provide a file for at least one)
  19. protein=/ws/hhu/soybean_data/braker/chi_maker/subseq.dedup.fasta #protein sequence file in fasta format (i.e. from mutiple oransisms)
  20. protein_gff= #aligned protein homology evidence from an external GFF3 file
  21. #-----Repeat Masking (leave values blank to skip repeat masking)
  22. model_org=simple #select a model organism for RepBase masking in RepeatMasker
  23. rmlib= #provide an organism specific repeat library in fasta format for RepeatMasker
  24. repeat_protein=/ws/groupEnv/appbio/app/maker/data/te_proteins.fasta #provide a fasta file of transposable element proteins for RepeatRunner
  25. rm_gff=/ws/hhu/soybean_data/braker/chi_maker/full_mask.complex.reformat.gff3#pre-identified repeat elements from an external GFF3 file
  26. prok_rm=0 #forces MAKER to repeatmask prokaryotes (no reason to change this), 1 = yes, 0 = no
  27. softmask=1 #use soft-masking rather than hard-masking in BLAST (i.e. seg and dust filtering)
  28. #-----Gene Prediction
  29. snaphmm=/ws/hhu/soybean_data/braker/chi_maker/Pan_rnd1.zff.length50_aed0.25.hmm #SNAP HMM file
  30. gmhmm= #GeneMark HMM file
  31. augustus_species=chi_pan_36 #Augustus gene prediction species model
  32. fgenesh_par_file= #FGENESH parameter file
  33. pred_gff= #ab-initio predictions from an external GFF3 file
  34. model_gff= #annotated gene models from an external GFF3 file (annotation pass-through)
  35. est2genome=0 #infer gene predictions directly from ESTs, 1 = yes, 0 = no
  36. protein2genome=0 #infer predictions from protein homology, 1 = yes, 0 = no
  37. trna=0 #find tRNAs with tRNAscan, 1 = yes, 0 = no
  38. snoscan_rrna= #rRNA file to have Snoscan find snoRNAs
  39. unmask=0 #also run ab-initio prediction programs on unmasked sequence, 1 = yes, 0 = no
  40. #-----Other Annotation Feature Types (features MAKER doesn't recognize)
  41. other_gff= #extra features to pass-through to final MAKER generated GFF3 file
  42. #-----External Application Behavior Options
  43. alt_peptide=C #amino acid used to replace non-standard amino acids in BLAST databases
  44. cpus=1 #max number of cpus to use in BLAST and RepeatMasker (not for MPI, leave 1 when using MPI)
  45. #-----MAKER Behavior Options
  46. max_dna_len=100000 #length for dividing up contigs into chunks (increases/decreases memory usage)
  47. min_contig=1 #skip genome contigs below this length (under 10kb are often useless)
  48. pred_flank=200 #flank for extending evidence clusters sent to gene predictors
  49. pred_stats=0 #report AED and QI statistics for all predictions as well as models
  50. AED_threshold=1 #Maximum Annotation Edit Distance allowed (bound by 0 and 1)
  51. min_protein=0 #require at least this many amino acids in predicted proteins
  52. alt_splice=0 #Take extra steps to try and find alternative splicing, 1 = yes, 0 = no
  53. always_complete=0 #extra steps to force start and stop codons, 1 = yes, 0 = no
  54. map_forward=0 #map names and attributes forward from old GFF3 genes, 1 = yes, 0 = no
  55. keep_preds=0 #Concordance threshold to add unsupported gene prediction (bound by 0 and 1)
  56. split_hit=10000 #length for the splitting of hits (expected max intron size for evidence alignments)
  57. single_exon=0 #consider single exon EST evidence when generating annotations, 1 = yes, 0 = no
  58. single_length=250 #min length required for single exon ESTs if 'single_exon is enabled'
  59. correct_est_fusion=0 #limits use of ESTs in annotation to avoid fusion genes
  60. tries=2 #number of times to try a contig if there is a failure for some reason
  61. clean_try=0 #remove all data from previous run before retrying, 1 = yes, 0 = no
  62. clean_up=0 #removes theVoid directory with individual analysis files, 1 = yes, 0 = no
  63. TMP=/ws/hhu/soybean_data/braker/chi_maker/tmp/ #specify a directory other than the system default temporary directory for temporary files

Filter the redundant sequences

First extract thpse proteins with AA length longer than 50AA:

  1. perl remove_small.pl 49 chi_pan.all.maker.augustus_masked.proteins.fasta > chi_pan.all.maker.proteins_50aa.fasta

script:

  1. #!/usr/bin/perl
  2. #remove small sequences from fasta file
  3. use strict;
  4. use warnings;
  5. my $minlen = shift or die "Error: `minlen` parameter not provided\n";
  6. {
  7. local $/=">";
  8. while(<>) {
  9. chomp;
  10. next unless /\w/;
  11. s/>$//gs;
  12. my @chunk = split /\n/;
  13. my $header = shift @chunk;
  14. my $seqlen = length join "", @chunk;
  15. print ">$_" if($seqlen >= $minlen);
  16. }
  17. local $/="\n";
  18. }

Using cd-hits with 90% cut-off:

  1. cd-hit-2d -i Gallus_gallus.GRCg6a.pep.all.fa -i2 chi_pan.all.maker.proteins_50aa.fasta -o chi_pan.all.maker.proteins_50aa_rd_0.9.fa -c 0.9

Based on the filtered set of protein file, removed those filtred genes from the gff file.

rename the annotation gff file with newID

  1. # create naming table (there are additional options for naming beyond defaults)
  2. maker_map_ids --prefix PanGallus_Gene --justify 5 chi_pan.all.gff > rename.map
  3. # replace names in GFF files
  4. map_gff_ids rename.map chi_pan.all.gff
  5. grep "maker" chi_pan.all.gff >chi.pan.gff
  6. # replace names in FASTA headers
  7. map_fasta_ids rename.map chi_pan.all.maker.proteins_50aa_rd_0.9.fa
添加新批注
在作者公开此批注前,只有你和作者可见。
回复批注