[关闭]
@lakesea 2020-10-29T07:46:33.000000Z 字数 2609 阅读 1603

How to extract the longest transcript/protein from gff3

how_to


Use this script:

  1. ###get_the_longest_transcripts.py
  2. #!/usr/bin/env python3
  3. # v0.1:fixing the problem that the feature of exon/mRNA is before gene
  4. # v0.2: Use CDS to decide which transcript is primary
  5. # v0.3: Fix the bug that if the CDS_parent include '\r' , the lst_len will be 0
  6. # GFF must have CDS feature
  7. # GFF must have ID and Parent in column 9
  8. import re
  9. import fileinput
  10. from collections import defaultdict
  11. gene_dict = defaultdict(list)
  12. tx_pos_dict = defaultdict(list)
  13. CDS_dict = defaultdict(list)
  14. for line in fileinput.input():
  15. if line.startswith("#"):
  16. continue
  17. content = line.split("\t")
  18. if len(content) <= 8:
  19. continue
  20. #print(content)
  21. '''
  22. if content[2] == 'gene':
  23. gene_id = re.search(r'ID=(.*?)[;\n]', content[8]).group(1)
  24. #print(gene_id)
  25. if gene_id not in gene_dict:
  26. gene_dict[gene_id] = []
  27. '''
  28. if content[2] == "transcript" or content[2] == "mRNA":
  29. # use regual expression to extract the gene ID
  30. # match the pattern ID=xxxxx; or ID=xxxxx
  31. tx_id = re.search(r'ID=(.*?)[;\n]',content[8]).group(1)
  32. tx_parent = re.search(r'Parent=(.*?)[;\n]',content[8]).group(1)
  33. tx_parent = tx_parent.strip() # remove the 'r' and '\n'
  34. # if the parent of transcript is not in the gene_dict, create it rather than append
  35. if tx_parent in gene_dict:
  36. gene_dict[tx_parent].append(tx_id)
  37. else:
  38. gene_dict[tx_parent] = [tx_id]
  39. tx_pos_dict[tx_id] = [content[0],content[3], content[4], content[6] ]
  40. # GFF must have CDS feature
  41. if content[2] == 'CDS':
  42. width = int(content[4]) - int(content[3])
  43. CDS_parent = re.search(r'Parent=(.*?)[;\n]',content[8]).group(1)
  44. CDS_parent = CDS_parent.strip() # strip the '\r' and '\n'
  45. CDS_dict[CDS_parent].append(width)
  46. for gene, txs in gene_dict.items():
  47. tmp = 0
  48. for tx in txs:
  49. tx_len = sum(CDS_dict[tx])
  50. if tx_len > tmp:
  51. lst_tx = tx
  52. tmp = tx_len
  53. tx_chrom = tx_pos_dict[lst_tx][0]
  54. tx_start = tx_pos_dict[lst_tx][1]
  55. tx_end = tx_pos_dict[lst_tx][2]
  56. tx_strand = tx_pos_dict[lst_tx][3]
  57. print("{gene}\t{tx}\t{chrom}\t{start}\t{end}\t{strand}".format(gene=gene,tx=lst_tx,chrom=tx_chrom,start=tx_start,end=tx_end, strand=tx_strand))

get the longest transcript ID:

  1. zcat GWHACEE00000000.gff.gz|python3 /scratch/pawsey0149/hhu/train/GeneFamily20200803/script/get_longest_transcript.py|cut -f2

Transfer the protein id into original one:

  1. for i in `cat file.list`;do cut -f 19 ${i}.feature |grep -v ID > ${i}.rename.list;done
  2. for i in `cat file.list`;do awk 'NR==FNR{names[NR]=$0; next} /^>/{$1=">"names[++c]}1' ${i}.rename.list ${i}.Protein.faa > newname_${i}.Protein.fa;done
  3. for i in `cat file.list`;do ~/biosoft/seqtk/seqtk subseq newname_${i}.Protein.fa ${i}.longest_transcript.id.list >${i}.longest_pro.fa;done

blast to find the oil gene:

  1. blastp -db AT.oil.pro.fa -max_target_seqs 2 -outfmt "6 qseqid sseqid pident length qlen slen mismatch gapopen gaps qstart qend sstart send stitle evalue bitscore qcovs" -num_threads 24 -evalue 1e-10 -query Wm82.longest_pro.fa -out Wm82_vs_AToil.blastp.out
添加新批注
在作者公开此批注前,只有你和作者可见。
回复批注