@Rumia
2017-03-10T08:57:39.000000Z
字数 4026
阅读 879
A FASTQ file normally uses four lines per sequence.
A FASTQ file containing a single sequence might look like this:
@SEQ_IDGATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT+!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65
The character '!' represents the lowest quality while '~' is the highest. Here are the quality value characters in left-to-right increasing order of quality (ASCII):
!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
Example:
@HISEQ:210:H9L41ADXX:1:1215:6810:37655CTCCAGCACCAAAAAAGAAAAAAAAAAAAAGAAAAGAAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAG+?@?BDEADFDFAFGEGAEGG9FFE@@0;55',6>>3;5;;;A>A<55<?B><??C?<?AABA322(3<@??BBBAB?BC?@8CB@AC<(2>ABBC22898
fqz_comp v4.6. Author James Bonfield, 2011-2013The range coder is derived from Eugene Shelwien.To compress:fqz_comp [options] [input_file [output_file]]-Q <num> Perform lossy compression with all quality valuesbeing within 'num' distance from their original value.Default is lossless, i.e. "-q 0"-s <level> Sequence compression level. 1-9 [Def. 3]Specifying '+' on the end (eg -s5+) will usemodels of multiple sizes for improved compression.-b Use both strands in sequence hash table.-e Extra seq compression: 16-bit vs 8-bit counters.-q <level> Quality compression level. 1-3 [Def. 2]-n <level> Name compression level. 1-2 [Def. 2]-P Disable multi-threading-X Disable generation/verification of check sums-S SOLiD formatTo decompress:fqz_comp -d < foo.fqz > foo.fastqor fqz_comp -d foo.fqz foo.fastq
It can be assumed that the arg -s -q -n influence the COMP RATIO, but we do not know how. After performing a series of test, results are drawn and shown in figures below:
Result of a simple test:
[root@localhost fastq]# time fqz_comp -s6 -n1 -q3 -e -b ENCFF002EXT.fastq out.fqzNames 71134666 -> 3367344 (0.047)Bases 117600900 -> 12974412 (0.110)Quals 117600900 -> 28828099 (0.245)real 0m9.370suser 0m16.791ssys 0m0.478s
[root@localhost fastq]# time fqz_comp -d out.fqz out.revert.fqreal 0m12.860suser 0m20.844ssys 0m0.502s
The file size of ENCFF002EXT.fastq is 299MB, and the compressed file out.fqz is about 44MB, so the rough COMP RATIO is about 1/7. Time cost for compressing this test file is about 9.37sec, and the COMP SPEED is about 30MB/sec. However, it takes more time to decompress (time cost for decompressing is about 12.86sec).
Similar to fqzcomp, scalce divides a .fastq file into 3 streams (names/sequences/qualities) and process them respectively. While different from fqzcomp, scalce focus on the lossy compression of qualities.
Result of a simple test:
[root@localhost fastq]# scalce ENCFF002EXT.fastq -o outSCALCE 2.8 [pthreads; available cores=5]Buffer size: 131072K, bucket storage size: 4194304KAllocating pool... OK!Preprocessing FASTQ files ...Paired end #1, quality offset: 33read length: 100OKUsing 4 threads...Done with file ENCFF002EXT.fastq, 1176009 reads foundCreated temp file 0Merging results ... 1Generating 1 scalce file(s) from temp #0 compression: pigzshrink factor for 0: 1Written 234255 coresRead bit size: paired end 0 = 0.77Quality bit size: paired end 0 = 2.32Cleaning ...Statistics:Total number of reads: 1176009Read length: first end 100Unbucketed reads count: 35, bucketed percentage 100.00Lossy percentage: 0Time elapsed: 00:00:12Compression time: 00:00:04Original size: 298.76M, new size: 52.63M, compression factor: 5.68Done!
scalce compresses the origin file into 3 files namely out_1.scalcen out_1.scalceq out_1.scalcer (names/qualities/reads):
-rw-------. 1 root root 9.3M Mar 10 18:38 out_1.scalcen-rw-------. 1 root root 11M Mar 10 18:38 out_1.scalcer-rw-r--r--. 1 root root 33M Mar 10 18:38 out_1.scalceq