That is README on delite process which will help with use 'sra_delite.sh' The delite process is three stage process : 1) unpacking original KAR archive, which always includes downloading it from remote repository 2) editing resulting database, which could include rename columns and change metadata 3) packing modified KAR archive with/without reduced data, testing resulting KAR archive with VDB-DIFF program It is should note here, that user could perform testing of resulting KAR archive separately from stage #3/ Contents: I. Script requirements, environment and configuring. II. Script command line. III. Script configuration file IV. Unpacking original KAR archive V. Editing resulting database V|. Exporting data V|-|. Testing data VII. Status VIII. Physical requirements (important, read it) IX. Error codes X. Limitations: what to expect and what to not XI. Building, installing, and running from scratch XII. Building, installing, and running from docker image I. Script requirements, environment and configuring. ============================================================================= First step in script execution is parsing command line arguments. Some of arguments will be interpreted imediately, for configuration. Other arguments will be treated after environment will be set and script will start working. Generally, script will have large comment sections, which will describe which part of processing it belongs. You may navigate throug these section by searching line "###<<>>###" Script is self-configurable. That means it keeps all necessary information for executing. User could change internal settings in two ways: provide configuration file in standard place, so script will pick it up, or user can use '--config' parameter in command script to set appropriate config file. The format of config file will be described later Script presumes that it is starts from directory where following standard VDB utilities are located : kar+ kar+meta vdb-lock vdb-unlock vdb-validate prefetch vdb-diff vdb-dump make-read-filter If one of these utilities does not exists, or permissions for execution for that utility are missed, script will exit with error message. You may alter location of VDB utilities by exporting DELITE_BIN_DIR environment variable which points to directory with utilities. DELITE_BIN_DIR variable has higher priority than local utilities, however there will be a message from script that it is using ateternate bin directory. User also can set that variable in configuration file. Some utilities from that list may create temporary files, usually they do it in directory /tmp, or at location defined by environment variable TEMPDIR. User can change that behaviour by setting value in configuration file: USE_OWN_TEMPDIR=1 in that case, script will ask utilities to put temporary data at the same directory where delite process is performed. Script will require two mandatory parameters path to source KAR archive, which could be accession, and path to directory, which will be used as working directory. Script will/may create different directories in working dir. II. Script command line ============================================================================= The script command line: sra_delite.sh action [ options ] or sra_delite.sh -h | --help Action is a word which defines a procedure which script will follow. There is list of possible actions : import - script will download and/or unpack archive to working directory delite - script will perform DELITE on database content export - script will create 'delited' KAR archive test - script will test 'delited' KAR archive status - script will report some status, or whatever. Options could be different for each action, many of them will be discussed later. There is a list of options. -h | --help - prints help message. --accession - accession String, for 'import' action only, there should be defined one of '--accession' or '--archive' --archive - path to SRA archive String, for 'import' action only, there should be defined one of '--accession' or '--archive' --target - path to directory, where script will put it's output. String, mandatory. --config - path to existing configuration file. String, optional. --schema - path to directory with schemas to use for delite String, mandatory for 'delite' action only. --force - flag to force process, does not matter what --skiptest - flag to skip using vdb-diff to test resulting archive III. Script configuration file ============================================================================= Script configuration file is a simple text file. Script can recognize following entries in configuration file : commentaries, empty strings, translations, excludes and environment variable definitions. There is example of configuration file : ### Standard configuration file. ### '#'# character in beginning of line is treated as a commentary ### Schema traslations ### Syntax: translate SCHEMA_NAME OLD_VER NEW_VER translate NCBI:SRA:GenericFastq:consensus_nanopore 1.0 2.0 translate NCBI:SRA:GenericFastq:sequence 1.0 2.0 translate NCBI:SRA:GenericFastq:sequence_log_odds 1.0 2 translate NCBI:SRA:GenericFastq:sequence_nanopore 1.0 2.0 translate NCBI:SRA:GenericFastq:sequence_no_name 1.0 2.0 ### Columns to drop ### Syntax: exclude COLUMN_NAME exclude QUALITY exclude QUALITY2 ### Rejected platforms ### Syntax: reject PLATFORM_NAME ["optional platform description"] ## reject SRA_PLATFORM_454 "454 architecture run" reject SRA_PLATFORM_ABSOLID "colorspace run" ### Columns to skip during vdb-diff testing ### Syntax: diff-exclude COLUMN_NAME diff-exclude CLIPPED_QUALITY diff-exclude SAM_QUALITY ### Environment definition section. # DELITE_BIN_DIR=/place/with/precompiled/files # USE_OWN_TEMPDIR=1 Each line of configuration file is a separate entty. Script does not support multiline entries. Each line of Configuration file started with character '#' will be treated as a commentary Empty line is a line which may contain 'space', 'tab', and/or 'newline' symbols. Because script is using internal bas 'read' command, it will automatically trim these character. Line started with word 'translate' will be treated as translation definition. It should contain exactly four words, and it's format is: Line started with word 'exclude' will be treated as definition of column to drop during delite process. It should contain exactly two words, and it's format is: Line started with word 'reject' will be treated as definition of architecture which should not be delited. For now it is only colorspace platform OBSOLID, That line should contain two or three parameters, and format is: ["commentary which will be shown in messages"] Line started with word 'diff-exclude" will be treated as a name of column which will be excluded from testing by 'vdb-diff' command. It should contain exactly two words, and it's format is: Line, which consist from single word and contains symbol '=' inside will be treated as environment variable definition. It's format is: = NOTE, SPACES ARE NOT ALLOWED IN ENVIRONMENT VARIABLE DEFINITION If user does not like to keep config files, he may edit script directly, the configuration part of script is exactly the same as configuration file, and it is located in bash function 'print_config_to_stdout()' User can provide configuration file in two ways: with argument '--config CONFIG', or put default config file 'sra_delite.kfg' in the directory with script. At the first, script will try to load user defined config, next it will try to load default config, and if there is none of such it will use hardcoded parameters. IV. Unpacking original KAR archive ============================================================================= Action 'import' is responsible for unpacking original KAR archive. That action requires at least two parameters, and it's syntax is following: sra_delite.sh import [ --force ] [ --config CONFIG ] --accession ACCESSION --target TARGET or sra_delite.sh import [ --force ] [ --config CONFIG ] --archive PATH --target TARGET The flag --force is optional. If TARGET directory exists, script will reject to work unless that flag is provided. In that case the old TARGET directory and all it's content will be destroyed. The ACCESSION parameter is accession name on existing KAR archive, which will be resolved, and downloaded. The PAT parameter is path to existing downloaded KAR archive The TARGET parameter is a reference to directory, which will be created by script, and the content of SOURCE KAR archive will be unpacked into it's subdirectory 'orig', so full path of that objec will be TARGET/orig The downloading process is performed by 'prefetch' utility, which will resolve and download local copy of KAR archive, as a references which archive depends on. The unpacking process is performed by kar+ utility. V. Editing resulting database ============================================================================= Action 'delite' is responsible for editing unpacked database. That action requires at least two parameters, and it's syntax is following: sra_delite.sh delite [ --config CONFIG ] --schema SCHEMA --target TARGET --schema parameter of that script shows path to directory where VDB schemas are stored. It is mandatory parameter. If You does not know where schemas are located, try to use vdb-config utility, or ask senpai. --target parameter is the same as target parameter for action 'import'. During execution that 'action' script will scan TARGET/orig directory and perform following: Rename all "QUALITY" columns to "ORIGINAL_QUALITY" For all tables and databases, which includes columns configured fof dropping, will be updated schema, if there exists translation for that schema version. V|. Exporting data ============================================================================= Action 'export' will export delited data into KAR archive and test result. There is syntax for that command: sra_delite.sh export [ --config CONFIG ] --target TARGET [--force] [--skiptest] By default that command will create KAR archive with name "TARGET/new.kar". That archive will have modified schemas and all columns, listed in configuration, will be dropped from archive. In regular mode, if there already exists KAR archive, script will report error and will exit. To force script work and overwrite files, user should use '--force' option The resulting "TARGET/new.kar" archive will be tested in two ways. First test will be done by 'vdb-validate' utility. That test will check structure of archive, and consistency of schemas, it could take several minutes. The second test will be done by 'vdb-dump' utility. That test will perform record-by-record data comparation for both original and new KAR archives. It is longest test and can take more than several minutes. User can skip testing by adding flag '--skiptest'. Even if user skipped test, he cound test resulting data later. V|-|. Testing data ============================================================================= Action 'test' will test exported delited KAR archive. There is syntax for that command: sra_delite.sh test [ --config CONFIG ] --target TARGET User should be ready that it could be lengtly procedure. It was introduced for the case if there will be two different queues for deliting and testing results. V|I. Status ============================================================================= Action 'status' will display status report on targeted directory. Syntax of that command is : sra_delite.sh status --target TARGET V|II. Physical requirements ============================================================================= The delite process is quite lightweight. All utilities used does not require more than 1GB(usually less than 200MB) of virtual memory, and uses not more than 50MB(usually less than 20MB) of resident memory. However, testing process could be both long and consume a lot of memory, our tests shows up to 10G resident memory for some runs. The delite process is very sencitive to disk space. In default case it will require 3X disk space than size of original SRA archive. User could estimate necessary disk space by this formula : REQUIRED_SIZE= 3 * ORIGINAL_KAR_SIZE IX. Error codes ============================================================================= If script failed and it will return some error code. There are three types of errors, which script could return: 100-122 - invalid setup or configuration 80-89 - invalid data to process or processed 60-73 - vdb/sra/kdv program exited with non-zero return code There is a list of all error codes which script can return. 100 - invalid orguments, missed parameters 101 - invalid ACCESSION format 102 - parameter configuration file defined, but file does not exist 103 - error in configuration file 104 - can not stat necessary executable or has not permissions to execute 105 - can not stat file or directory 106 - directory or file exist 107 - can not delete file or directory 108 - can not copy file or directory 109 - can not create file or directory 110 - can not rename file or directory 80 - unsupported architecture 81 - try to process rejected run 82 - try to process delited run 83 - run object has not QUALITY column 84 - can not find schema for table 85 - run check failed 86 - try to kar undelited run 87 - corrupted input data 88 - invalid output KAR archive 60 - prefetch failed 61 - 'ln -s' failed 62 - kar+ failed 63 - kar+meta failed 64 - make_read_filter failed 65 - vdb-unlock failed 66 - vdb-lock failed 67 - vdb-validate failed 68 - vdb-diff failed 69 - signal handled X. Limitations: what to expect and what to not ============================================================================= Not every run can be delited. There are many limitations and exceptions. First it should be noted here that run can be delited only if it has QUALITY column. That means, that unpacked run should have one of these two : RUNXXX/col/QUALITY /* run is a SEQUENCE table */ or RUNXXX/tbl/SEQUENCE/col/QUALITY /* run is a database */ After delite process downloaded kar archive with run and unpacked it, the first thing scritp does, is checking of existance of QUALITY directory. If script is unable to locate that directory, it will exit with return code 83. Next condition which run should comply with is: there should be substutute schema defined for table which contains quality column. User should remember that there are exists runs with two defined quality columns: SEQUENCE/col/QUALITY and CONSENSUS/col/QUALITY. If run contains two quality columns, there should exists substitute for schema for both. Colorspace runs can not be delited, and it is second check which script is doing before starting delite process. If run PLATFORM type is SRA_PLATFORM_ABSOLID then run is colorspace, and script will exit with return code 80. User can configure list of platforms to reject in configuration file. Next two check are necessary to recognize non ABSOLID colorspace runs : script will check if run contains column CS_NATIVE, or it has physical column CSREAD. If run do have, script will fall with return code 80. There are two old type Illumina tables which script will not process, because they aren't supported : NCBI:SRA:Illumina:tbl:q4 and NCBI:SRA:Illumina:tbl:q1:v2. In the case if that type of table was detected, script will exit with exit code 80. TRACE type archives can not be delited and script will exit with code 80. Delite process itself consists of several steps, during which different utility could be involved to modify metadata, schema, and/or build read filter. After each step, script will try to read QUALITY from freshly editet run, and if qualities are unreadable, script will exit with code 85 there will be After that, script will run standard checks as for: vdb-validate and vdb-diff to be sure that delite process finished correctly XI. Building, installing, and running from scratch ============================================================================= For working, delite process needed ten binaries, which are listed in first chapter of that README file. User could use precompiled binaries, which he could download from NCBI place , or build binaries by itself. To build binaries, user should download, install, configure, and build 'ngs', 'ncbi-vdb', and 'sra-tools' packages. In commands of shell it will look like : # mkdir DELITE # cd DELITE # git clone https://github.com/ncbi/ngc.git # git pull # cd ngc # ./configure # cd ngc-sdk # make # cd ../ngc-java # make # cd ../.. # git clone https://github.com/ncbi/ncb-vdb.git # git pull # cd ncbi-vdb # ./configure # ./make # cd .. # git clone https://github.com/ncbi/sra-tools.git # git pull # cd sra-tools # ./configure # make After that user may pick up new binaries and use them for delite process More details about building these packages user can read in NCBI documentation Delite process is performed by 'sra_delite.sh' script, which is a part of sra-tools package, and located in directory: sra-tools/tools/kar There is also example of 'sra_delite.kfg' file, which user can use if he need to modify delite process. Delite is needed valid schemas for correct processing. If user does not have access to schemas, he could use as source of schemas content of directory: ncbi-vdb/interfaces By default sra_delite.sh script will look for a binaries in the same directory where it is located. The simplest way to install all necessary files for delite is to create directory, where user could put sra_delite.sh, and builded binaries. In the cases, user had to use precompiled binaries already installed somewhere, it should copy sra_delite.kfg file to directory with script and add here a line: DELITE_BIN_DIR=/place/with/precompiled/files User may also want to copy directory ncbi-vdb/interfaces somewhere close to that directory. Now, when everything is ready for delite processing, user could start delite processing. In simple it will look like that: # some_locateion/sra_delite.sh import --accession SRR000001 --target target_directory # some_locateion/sra_delite.sh delite --schema path_to_schemas --target target_directory # some_locateion/sra_delite.sh export --target target_directory More details about delite process user can find in previous parts of that README file XII. Building, installing, and running from docker image on AWS cloud ============================================================================= To build and run delite from docker container user should first to start an instance on AWS. After that user should login to AWS instance and check that docker and git are present, by running commands 'docker' and 'git'. If 'docker' command is not found, user should install it. There is how to do that. sudo yum update -y sudo amazon-linux-extras install docker sudo service docker start sudo usermod -a -G docker ec2-user docker info ## if after that command will appear message "Can not connect" ## user should reboot host, and start docker again User could find good source of information on Docker here: https://docs.aws.amazon.com/AmazonECS/latest/developerguide/docker-basics.html To install git, user should do that command: sudo yum install git Now everything is ready for delite process. First step user should download docker file, or copy it. The simplest way to download docker file is clone sra-toolkit package from GITHUB: git clone -b engineering https://github.com/ncbi/sra-tools.git User could copy docker file to his working directory, and delete sra-toolkit package after. cp sra-toolkit/build/docker/Dockerfile.delite . rm -rf sra-tools Now user should build docker image: docker build -f Dockerfile.delite -t sratoolkit:delite . Next, user should create output directory for delited runs and everything is ready for deliting: mkdir ~/output The delite commands for docker case will looks like that: docker run -v ~/output/:/output:rw --rm sratoolkit:delite sra_delite.sh import --accession SRR000001 --target /output/SRR000001 docker run -v ~/output/:/output:rw --rm sratoolkit:delite sra_delite.sh delite --target /output/SRR000001 --schema /etc/ncbi/schema docker run -v ~/output/:/output:rw --rm sratoolkit:delite sra_delite.sh export --target /output/SRR000001 docker run -v ~/output/:/output:rw --rm sratoolkit:delite vdb-dump -R 1 /output/SRR000001/new.kar To simplify delite process there is script "delite_docker.sh", which embeds all delite commands, which could safe time on typing. User can pass several accessions as a prameter to script and they will be delited sequentially: docker run -v ~/output/:/output:rw --rm sratoolkit:delite delite_docker.sh SRR000001 SRR000002 In the case of error, script will return error code of failed sra_delite.sh command. If user want to separate delite process and testing results, he can use "delite_docker.sh" command with "--skiptest" flag, and perform testing after with "delite_test_docker.sh" command: docker run -v ~/output/:/output:rw --rm sratoolkit:delite delite_docker.sh --skiptest SRR000001 SRR000002 docker run -v ~/output/:/output:rw --rm sratoolkit:delite delite_test_docker.sh SRR000001 SRR000002 If user want to delite already downloaded KAR archive, he can use "delite_docker_local.sh" command It's syntax is following: delite_test_docker.sh [--skiptest] path_to_kar_archive output_directory There is an example of that command: docker run -v ~/output/:/output:rw -v ~/input/:/input:rw --rm sratoolkit:delite delite_docker_local.sh /input/SRR000001.sra /output/OUTPUT Script will read data from /input/SRR00001.sra file and write them to /output/OUTPUT directory NOTE: if there are results of previous delite process for accession, script will exit with error message. User is responsible for deleting these before calling script. AWS users should remember that there could be problem with certificate. To avoid that problem, user should add following mount to docker command : -v /etc/pki:/etc/pki:ro -v /etc/ssl:/etc/ssl:ro For example: docker run -v ~/output/:/output:rw -v /etc/pki:/etc/pki:ro -v /etc/ssl:/etc/ssl:ro --rm sratoolkit:delite delite_docker.sh SRR000001 This command will allow to mount certificates from computer to docker image. ENJOY