-
Notifications
You must be signed in to change notification settings - Fork 39
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
0.2.1_tech_preview, change readme to english version
- Loading branch information
Gen Li
committed
Mar 25, 2017
1 parent
dbc0e05
commit 5919e4d
Showing
4 changed files
with
286 additions
and
98 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,45 +1,41 @@ | ||
# GTX Compressor (直压上云技术预览版) | ||
# GTX Compressor (Technique preview version ) | ||
|
||
Powered by GTXLab of Genetalks. | ||
|
||
0.2技术预览版本下载地址: https://github.com/Genetalks/gtz/archive/0.2_tech_preview.tar.gz | ||
technique preview download URL:https://github.com/Genetalks/gtz/archive/0.2.1_tech_preview.tar.gz | ||
|
||
## 系统简介 | ||
[中文说明](https://github.com/Genetalks/gtz/blob/master/README_chs.md "Markdown"). | ||
|
||
GTX Compressor是Genetalks公司GTX Lab实验室开发的面向大型数据(数GB甚至数TB数据,尤其是生物信息数据)上云,而量身定制的复杂通用数据压缩打包系统,可以对任意基因测序数据以及数据目录进行高压缩率的快速打包,形成单个压缩数据文件,以方便存储档与远程传输、校验。区别于以往的压缩工具,GT Compressor系统着力于**高压缩率,高速率,方便的数据抽取**。 | ||
## System Overview | ||
|
||
GTX Compressor可以在AWS C4.8xlarge机器(或同配置服务器),**以超过114MB/s的速度,将接近200GB大小的33个质量数的FASTQ文件(NA12878_1.fastq),在13分钟内压缩到原大小的19%**,而对于X10等只有**7个质量数的FASTQ数据,其压缩率更可以达到5.5%**。 | ||
GTX Compressor is a fastq compressor and also can be used as a generic data compression system, developed by GTX Lab of Genetalks Cooperation, aiming at directly compressing large-scale data to the cloud (data with size up to several GB or even several terabytes, especially bioinformatics data). GTX Compressor can rapidly compress any gene sequencing files and directories with high compression rate, and generate a single compressed data files, thus facilitating the data storage, remote transmission, and verification. Different from the previous compression tools, GTX Compressor system focuses on ** high compression rate, high speed, and convenient data extraction ** | ||
|
||
** GTX Compressor提供“直压上云”功能 **。考虑商业使用时,用户不仅需要将测序产生的海量数据存储于本地,更迫切地寻求将数据快速稳定传输至云端的能力。 GTX Compressor的数据压缩引擎允许用户直接将fastq文件压缩存储到亚马逊AWS平台或者阿里云OSS平台,并保持与本地压缩相同的压缩速度与压缩效率。普通100Mbits Intenet线路,可以在短短30分钟内稳定地将200GB Fastq文件的直压上云。 | ||
GTX Compressor compresses the 33 qualities of FASTQ files (NA12878_1.fastq), with the size of approximately 200GB, to 19% of the original size, in less than 13 minutes, over the AWS R4.8xlarge machine (or the same configuration server) at a speed of more than 256MB/s. As the FASTQ data which is producted by X10 with only ** 7 qualities, GTX Compressor can gains 5.5% compression. ** | ||
|
||
## 系统亮点 | ||
** GTX Compressor provides "Directly compress to the cloud" function **. Out of commercial consideration, users not only need to store the massive data generated by gene sequencing locally, but also need to quickly and steadily transfer the data to the cloud. GTX Compressor system can compress the fastq files and concurrently transfer the compressed data to the Amazon AWS S3 platform or Ali cloud OSS platform, by supplying the same compression speed and compression rate with local compression. With ordinary 100Mbits Intenet line, GTX Compressor can directly compress 200GB Fastq file to the cloud in just 30 minutes. | ||
|
||
该数据打包压缩系统的特点: | ||
## System highlights | ||
|
||
- **高压缩比:** 采用Context Model压缩技术,配合多种优化的预测模型,平衡系统并发度与内存资源消耗后,能达到极高的压缩率。对FASTQ文件,压缩率最高可达5.58%。 | ||
GTX Compressor system features: | ||
|
||
- **高性能:** GTX compressor充分发挥了CPU的并发性以及新型Haswell CPU体系结构与AVX2、BMI2等指令集的计算能力,使得在普通服务器上的压缩速度,最高能够以接近114MB/s的输入流量输入数据并压缩完毕。 | ||
- ** high compression ratio: ** The system implements Context Model compression technology, with a variety of optimized predicting model, and balancing the system concurrent and memory resources consumption, thus achieving a high compression rate. For FASTQ files, GTX compressor gains up to 5.58% the compression rate. | ||
|
||
- **高速直压上云:** GTX compressor支持直压上云和从云端直接解压下载功能。普通的20核服务器,通过百兆Intenet线路,可以在短短30分钟内稳定地将200GB Fastq文件的直压上云。 | ||
- ** High performance: ** GTX compressor fully exploits the concurrency of the CPU, the new Haswell CPU architecture, and the computing power of the new instructions such as AVX2, BMI2, which makes GTX compressor gain high compression speed even on a common server, with the throughout of 114MB/s for the whole process of compression and transmission. | ||
|
||
- ** high-speed direct compression to the cloud: ** GTX compressor support direct compression to the cloud and direct decompression from the cloud. Over a common 20-core server with 100Mbits Intenet line, GTX Compressor can derectly compress 200GB Fastq file to the cloud in only 30 minutes. | ||
|
||
## System environment requirements | ||
- 64-bit Linux system (CentOS 6.5 or above, or Ubuntu 12.04 or more, and with Ububtu 14.04 and above 64-bit operating system recommended) | ||
- the host system with 4-core or more, and the minimum 8GB memory (to achieve maximum concurrency, the host system with 32-core 64GB memory is recommended, or that has the same configuration with the AWS C4.8xlarge machine) | ||
|
||
## 系统环境要求 | ||
|
||
- 64位 Linux 系统(CentOS 6.5以上或Ubuntu 12.04以上,推荐Ububtu 14.04及以上64位操作系统) | ||
## Installation Instruction | ||
GTX compressor system can be directly used by unpacking, and does not rely on any other library. | ||
The download package contains two tar.gz packages for the ubuntu version and the centos version respectly. Choose the corresponding tar.gz package, extract, and use gtz command for the extraction gtz_0.1_ubuntu_tech_preview directory or gtz_0.1_centos_tech_preview directory. | ||
|
||
- 4核以上,最小8GB内存的主机系统(若要达到最大并发性,推荐32核 64GB内存,或与AWS C4.8xlarge机器相同配置) | ||
|
||
## 安装说明 | ||
本系统采用开包即用的打包原则,不依赖当前系统其他任何库。 | ||
|
||
下载包内包含ubuntu版本和centos版本的两个tar.gz的包。选择对应tar.gz的包,解压后,gtz命令就在当前解压的gtz_0.1_ubuntu_tech_preview目录或gtz_0.1_centos_tech_preview目录中,直接使用即可。 | ||
|
||
|
||
## 命令行说明 | ||
|
||
执行 ./gtz -h,输出命令行帮助说明。 | ||
## Command line instructions | ||
|
||
./gtz -h, and get command line help instructions. | ||
|
||
``` | ||
USAGE: | ||
|
@@ -48,140 +44,145 @@ USAGE: | |
<string>] [-s <string>] [-c] [-n <string>] [-l <string>] [-i] | ||
[-d] [--delete] [-a] [-g <number>] [-o <string>] [--] [--version] | ||
[-h] <file names> ... | ||
``` | ||
|
||
通用选项说明: | ||
|
||
- -h:输出以上命令行帮助信息 | ||
- \-\-version:输出gt_compress程序的版本号 | ||
- \-\-access-key-id : 指定云平台用户ID | ||
- \-\-secret-access-key: 指定云平台用户密钥 | ||
- \-\-endpoint : 指定阿里云OSS平台的访问域名和数据中心 | ||
General Options Instruciton: | ||
- -h: Outputs the above command line help information | ||
- \-\- version: Outputs the version number of the gt_compress program | ||
- \-\- access-key-id: Specifies the cloud platform user ID | ||
- \-\- secret-access-key: Specifies the cloud platform user key | ||
- \-\- endpoint: Specifies the access domain name and data center of the Ali cloud OSS platform | ||
|
||
压缩选项说明: | ||
Compression Option Description: | ||
- -f, \-\- force | ||
- \-\-timeout: Specifies the upload timeout threshold | ||
- -i: Increases the index during the compression, which mainly used in the compressed file to quickly retrieve a section of the fastq file, and might reduce the compression speed | ||
- -a: append mode, the original file will be appended to the compressed data | ||
- -g: the speed-up compression in group, the more groups, the more need for cpu and memory, and the faster compression. If you do not specify this value, the program will automatically select the optimal value based on cpu and memory. | ||
- -o: Specifies the compressed file name. When not specified, the default is out.gtz | ||
- file_name: the file or directory need to be compressed. If not specified, the system will read data from the standard input. | ||
|
||
- -f, \-\-force : 强制删除容器内的object | ||
- \-\-timeout : 指定上传超时阀值 | ||
- -i:压缩时增加索引,主要用于在压缩文件中快速检索fastq文件的某段内容,该选项会降低压缩速度 | ||
- -a:追加模式,本次压缩的内容会追加到压缩文件中 | ||
- -g:分组加速压缩,分组越多,需要的cpu和内存越多,压缩速度越快。不指定该值时,程序会根据cpu和内存自动选择最优值 | ||
- -o:指定压缩文件名,不指定时,默认为out.gtz | ||
- file_name:需要压缩的文件或目录, 若不指定,则从标准输入中读入数据 | ||
Decompression Option Description: | ||
- -d, \-\-decode: decompression mode, required | ||
- \-\-list: List all compressed file names in the archive, used together with the -d parameter | ||
- -e, \-\-extract: decompresses and extract the target files specified (The file names are separated by ":") in the compressed file. Must used together with the -d parameter | ||
- -f, \-\-force: Forcely delete the object within the container | ||
- \-\-timeout: Specifies the download timeout value | ||
- -c, \-\-stdout: output to console(standard output) | ||
- file_name: the file to be decompressed | ||
|
||
### Examples: | ||
|
||
解压选项说明: | ||
Configure environment variables: | ||
|
||
- -d,\-\-decode : 解压模式 | ||
--list : 列出压缩包中所有的压缩文件名,与-d参数一起使用 | ||
-e, --extract : 解压压缩包中指定的压缩文件,文件名之间用冒号:分割,与-d参数一起使用 | ||
- -f, \-\-force : 强制删除容器内的object | ||
- \-\-timeout : 指定下载超时阀值 | ||
- -c,\-\-stdout : 解压数据输出至标准输出 | ||
- -o:指定输出文件名,使用-n或-l时需要指定该选项,否则不需要该选项 | ||
- file_name:需要压缩的文件, 若不指定,则从标准输入中读入数据 | ||
export access_key_id=xxxxxx | ||
|
||
export secret_access_key=xxxxxx | ||
|
||
### 示例: | ||
export endpoint=xxxxxx (Only set when transfering to OSS) | ||
|
||
配置环境变量: | ||
### Compression examples | ||
|
||
export access_key_id=xxxxxx | ||
Direct compression to Ali OSS: | ||
|
||
export secret_access_key=xxxxxx | ||
./gtz -o oss://gtz/out.gtz source.fastq | ||
|
||
or | ||
|
||
export endpoint=xxxxxx (该环境变量只有上传至OSS时才需设置) | ||
zcat source.fastq.gz | ./gtz -o oss://gt-compress/out.gtz | ||
|
||
### 压缩举例 | ||
Direct compression to AWS S3 | ||
|
||
直压阿里OSS: | ||
./gtz -o s3://gtz/out.gtz source.fastq | ||
|
||
./gtz -o oss://gtz/out.gtz source.fastq | ||
or: | ||
|
||
或者 | ||
# zcat 通过管道将fastq的数据送入gtz加压,zcat解压出来的fastq数据流在 out.gtz 中将以stdin这个文件名存在 | ||
zcat source.fastq.gz | ./gtz -o oss://gt-compress/out.gtz | ||
zcat source.fastq.gz | ./gtz -o s3://gt-compress/out.gtz | ||
|
||
直压AWS S3: | ||
Direct compression locally | ||
|
||
./gtz -o s3://gtz/out.gtz source.fastq | ||
./gtz -o gtz/out.gtz source.fastq | ||
|
||
或者: | ||
# zcat 通过管道将fastq的数据送入gtz加压,zcat解压出来的fastq数据流在 out.gtz 中将以stdin这个文件名存在 | ||
zcat source.fastq.gz | ./gtz -o s3://gt-compress/out.gtz | ||
or: | ||
|
||
压缩到本地 | ||
zcat source.fastq.gz | ./gtz -o gtz/out.gtz | ||
|
||
./gtz -o gtz/out.gtz source.fastq | ||
|
||
或者 | ||
# zcat 通过管道将fastq的数据送入gtz加压,zcat解压出来的fastq数据流在 out.gtz 中将以stdin这个文件名存在 | ||
zcat source.fastq.gz | ./gtz -o gtz/out.gtz | ||
### Add files to the compressed package | ||
|
||
### 追加文件进压缩包 | ||
./gtz -a -o oss://gtz/out.gtz /A/source2.fastq # -a denotes it is the additional mode | ||
|
||
./gtz -a -o oss://gtz/out.gtz /A/source2.fastq # -a 指当前是追加模式 | ||
./gtz -a -o s3://gtz/out.gtz /A/source2.fastq # -a denotes it is the additional mode | ||
|
||
./gtz -a -o s3://gtz/out.gtz /A/source2.fastq # -a 指当前是追加模式 | ||
./gtz -a -o gtz /out.gtz /A/source2.fastq # -a denotes it is the additional mode | ||
|
||
./gtz -a -o gtz/out.gtz /A/source2.fastq # -a 指当前是追加模式 | ||
|
||
### 查看压缩包里包含的文件 | ||
### View the files contained in the compressed gtz file | ||
|
||
./gtz_0.2.0_ubuntu_release/gtz --list -d oss://gtz/out.gtz | ||
|
||
./gtz_0.2.0_ubuntu_release/gtz --list -d s3://gtz/out.gtz | ||
|
||
./gtz_0.2.0_ubuntu_release/gtz --list -d gtz/out.gtz | ||
|
||
### 解压举例 | ||
|
||
从阿里 OSS 解压: | ||
### Decompression examples | ||
|
||
|
||
Direct decompression from Ali OSS | ||
|
||
./gtz -d oss://gtz/out.gtz | ||
|
||
或者 单独抽取几个文件: | ||
# -e 代表抽取文件,后面要抽取的文件名称间,用 ":" 隔开 | ||
Decompress several files separately: | ||
|
||
# -e denotes the target decompression files, seperated by ":" | ||
./gtz -e source.fastq:/A/source2.fastq -d oss://gtz/out.gtz | ||
|
||
或者某个文件到管道: | ||
# -c 代表输出到console, -e 代表抽取其中的某个文件 | ||
Decompress the target firles to the tube: | ||
|
||
# -c denotes output files to the console; -e denotes the target decompression file. | ||
./gtz -c -e source.fastq -d oss://gtz/out.gtz > myfile.txt | ||
或者 | ||
|
||
or | ||
|
||
./gtz -c -e source.fastq -d oss://gtz/out.gtz | gzip -c > source.gz | ||
|
||
从AWS S3 解压: | ||
|
||
Direct decompression from AWS S3 | ||
|
||
./gtz -d s3://gtz/out.gtz | ||
|
||
或者 单独抽取几个文件: | ||
# -e 代表抽取文件,后面要抽取的文件名称间,用 ":" 隔开 | ||
Decompress several files separately: | ||
|
||
# -e denotes the target decompression files, seperated by ":" | ||
./gtz -e source.fastq:/A/source2.fastq -d s3://gtz/out.gtz | ||
|
||
或者某个文件到管道: | ||
# -c 代表输出到console, -e 代表抽取其中的某个文件 | ||
Decompress the target firles to the tube: | ||
|
||
# -c denote output files to the console; -e denotes the target decompression file. | ||
./gtz -c -e source.fastq -d s3://gtz/out.gtz > myfile.txt | ||
或者 | ||
or | ||
./gtz -c -e source.fastq -d s3://gtz/out.gtz | gzip -c > source.gz | ||
|
||
从本地文件: | ||
Direct decompression locally | ||
|
||
./gtz -d ./gtz/out.gtz | ||
|
||
或者 单独抽取几个文件: | ||
# -e 代表抽取文件,后面要抽取的文件名称间,用 ":" 隔开 | ||
Decompress several files separately: | ||
|
||
# -e denotes the target decompression files, seperated by ":" | ||
./gtz -e source.fastq:/A/source2.fastq -d gtz/out.gtz | ||
|
||
或者某个文件到管道: | ||
# -c 代表输出到console, -e 代表抽取其中的某个文件 | ||
Decompress the target firles to the tube: | ||
|
||
# -c denote output files to the console; -e denotes the target decompression file. | ||
./gtz -c -e source.fastq -d gtz/out.gtz > myfile.txt | ||
或者 | ||
or | ||
./gtz -c -e source.fastq -d gtz/out.gtz | gzip -c > myfastq.gz | ||
|
||
|
||
## contact us | ||
|
||
If you have any questions, feel free to contact: [email protected], or commit an issus on Github. | ||
|
||
## 联系我们 | ||
|
||
使用中有任何问题请联系: [email protected] |
Oops, something went wrong.