Class picard.sam.PositionBasedDownsampleSamTest

all > picard.sam > PositionBasedDownsampleSamTest

tests

failures

ignored

1.622s

duration

100%

successful

Tests
Standard output
Standard error

Tests

Test	Duration	Result
TestBuilder	0.083s	passed
TestInvalidArguments[0](-1.0)	0.010s	passed
TestInvalidArguments[1](-1.0E-5)	0.009s	passed
TestInvalidArguments[2](-5.0)	0.008s	passed
TestInvalidArguments[3](1.00001)	0.008s	passed
TestInvalidArguments[4](5.0)	0.008s	passed
TestInvalidArguments[5](50.0)	0.008s	passed
TestInvalidArguments[6](1.7976931348623157E308)	0.008s	passed
TestInvalidArguments[7](Infinity)	0.008s	passed
TestInvalidArguments[8](-Infinity)	0.008s	passed
TestInvalidTwice[0](true)	0.172s	passed
TestInvalidTwice[1](false)	0.128s	passed
testDownsampleSingleTile[0](0.3)	0.115s	passed
testDownsampleSingleTile[1](0.4)	0.116s	passed
testDownsampleSingleTile[2](0.5)	0.124s	passed
testDownsampleSingleTile[3](0.6)	0.147s	passed
testDownsampleSingleTile[4](0.7)	0.151s	passed
testDownsampleSingleTile[5](0.7999999999999999)	0.160s	passed
testDownsampleSingleTile[6](0.8999999999999999)	0.173s	passed
testDownsampleSingleTile[7](0.9999999999999999)	0.178s	passed

Standard output

No errors found
No errors found
No errors found
No errors found
No errors found
No errors found
No errors found
No errors found
No errors found

Standard error

WARNING 2025-04-24 13:04:11 ValidateSamFile NM validation cannot be performed without the reference. All other validations will still occur.
USAGE: PositionBasedDownsampleSam [arguments]

<h3>Summary</h3>
Class to downsample a SAM/BAM/CRAM file based on the position of the read in a flowcell. As with DownsampleSam, all the
reads with the same queryname are either kept or dropped as a unit.

<h3>Details</h3>
The downsampling is _not_ random (and there is no random seed). It is deterministically determined by the position of
each read within its tile. Specifically, it draws an ellipse that covers a FRACTION of the total tile's area and of all
the edges of the tile. It uses this area to determine whether to keep or drop the record. Since reads with the same name
have the same position (mates, secondary and supplemental alignments), the decision will be the same for all of them.
The main concern of this downsampling method is that due to "optical duplicates" downsampling randomly can create a
result that has a different optical duplicate rate, and therefore a different estimated library size (when running
MarkDuplicates). This method keeps (physically) close read together, so that (except for reads near the boundary of the
circle) optical duplicates are kept or dropped as a group. By default the program expects the read names to have 5 or 7
fields separated by colons (:) and it takes the last two to indicate the x and y coordinates of the reads within the
tile whence it was sequenced. See DEFAULT_READ_NAME_REGEX for more detail. The program traverses the INPUT twice: first
to find out the size of each of the tiles, and next to perform the downsampling. Downsampling invalidates the duplicate
flag because duplicate reads before downsampling may not all remain duplicated after downsampling. Thus, the default
setting also removes the duplicate information.

Example

java -jar picard.jar PositionBasedDownsampleSam \
I=input.bam \
O=downsampled.bam \
FRACTION=0.1

Caveats

Note 1: This method is <b>technology and read-name dependent</b>. If the read-names do not have coordinate information
embedded in them, or if your BAM contains reads from multiple technologies (flowcell versions, sequencing machines) this
will not work properly. It has been designed to work with Illumina technology and reads-names. Consider modifying {@link
#READ_NAME_REGEX} in other cases.

Note 2: The code has been designed to simulate, as accurately as possible, sequencing less, <b>not</b> for getting an
exact downsampled fraction (Use {@link DownsampleSam} for that.) In particular, since the reads may be distributed
non-evenly within the lanes/tiles, the resulting downsampling percentage will not be accurately determined by the input
argument FRACTION.

Note 3:Consider running {@link MarkDuplicates} after downsampling in order to "expose" the duplicates whose
representative has been downsampled away.

Note 4:The downsampling assumes a uniform distribution of reads in the flowcell. Input already downsampled with
PositionBasedDownsampleSam violates this assumption. To guard against such input, PositionBasedDownsampleSam always
places a PG record in the header of its output, and aborts whenever it finds such a PG record in its input.
Version:null

Required Arguments:

--FRACTION,-F <Double> The (approximate) fraction of reads to be kept, between 0 and 1. Required.

--INPUT,-I <File> The input SAM/BAM/CRAM file to downsample. Required.

--OUTPUT,-O <File> The output, downsampled, SAM/BAM/CRAM file. Required.

Optional Arguments:

--ALLOW_MULTIPLE_DOWNSAMPLING_DESPITE_WARNINGS <Boolean>
Allow downsampling again despite this being a bad idea with possibly unexpected results.
Default value: false. Possible values: {true, false}

--arguments_file <File> read one or more arguments files and add them to the command line This argument may be
specified 0 or more times. Default value: null.

--COMPRESSION_LEVEL <Integer> Compression level for all compressed files created (e.g. BAM and VCF). Default value: 5.

--CREATE_INDEX <Boolean> Whether to create an index when writing VCF or coordinate sorted BAM output. Default
value: false. Possible values: {true, false}

--CREATE_MD5_FILE <Boolean> Whether to create an MD5 digest for any BAM or FASTQ files created. Default value:
false. Possible values: {true, false}

--help,-h <Boolean> display the help message Default value: false. Possible values: {true, false}

--MAX_RECORDS_IN_RAM <Integer>When writing files that need to be sorted, this will specify the number of records stored
in RAM before spilling to disk. Increasing this number reduces the number of file handles
needed to sort the file, and increases the amount of RAM needed. Default value: 100.

--QUIET <Boolean> Whether to suppress job-summary info on System.err. Default value: false. Possible
values: {true, false}

--READ_NAME_REGEX <String> Use these regular expressions to parse read names in the input SAM file. Read names are
parsed to extract three variables: tile/region, x coordinate and y coordinate. The x and y
coordinates are used to determine the downsample decision. Set this option to null to
disable optical duplicate detection, e.g. for RNA-seq The regular expression should
contain three capture groups for the three variables, in order. It must match the entire
read name. Note that if the default regex is specified, a regex match is not actually
done, but instead the read name is split on colons (:). For 5 element names, the 3rd, 4th
and 5th elements are assumed to be tile, x and y values. For 7 element names (CASAVA 1.8),
the 5th, 6th, and 7th elements are assumed to be tile, x and y values. Default value:
<optimized capture of last three ':' separated fields as numeric values>.

--REFERENCE_SEQUENCE,-R <PicardHtsPath>
Reference sequence file. Default value: null.

--REMOVE_DUPLICATE_INFORMATION <Boolean>
Determines whether the duplicate tag should be reset since the downsampling requires
re-marking duplicates. Default value: true. Possible values: {true, false}

--STOP_AFTER <Long> Stop after processing N reads, mainly for debugging. Default value: null.

--TMP_DIR <File> One or more directories with space available to be used by this program for temporary
storage of working files This argument may be specified 0 or more times. Default value:
null.

--USE_JDK_DEFLATER,-use_jdk_deflater <Boolean>
Use the JDK Deflater instead of the Intel Deflater for writing compressed output Default
value: false. Possible values: {true, false}

--USE_JDK_INFLATER,-use_jdk_inflater <Boolean>
Use the JDK Inflater instead of the Intel Inflater for reading compressed input Default
value: false. Possible values: {true, false}

--VALIDATION_STRINGENCY <ValidationStringency>
Validation stringency for all SAM files read by this program. Setting stringency to
SILENT can improve performance when processing a BAM file in which variable-length data
(read, qualities, tags) do not otherwise need to be decoded. Default value: STRICT.
Possible values: {STRICT, LENIENT, SILENT}

--VERBOSITY <LogLevel> Control verbosity of logging. Default value: INFO. Possible values: {ERROR, WARNING,
INFO, DEBUG}

--version <Boolean> display the version number for this tool Default value: false. Possible values: {true,
false}

Advanced Arguments:

--showHidden <Boolean> display hidden arguments Default value: false. Possible values: {true, false}

FRACTION must be a value between 0 and 1, found: -1.0
USAGE: PositionBasedDownsampleSam [arguments]