Class picard.sam.PositionBasedDownsampleSamTest

20

tests

0

failures

0

ignored

1.622s

duration

100%

successful

Tests

Test Duration Result
TestBuilder 0.083s passed
TestInvalidArguments[0](-1.0) 0.010s passed
TestInvalidArguments[1](-1.0E-5) 0.009s passed
TestInvalidArguments[2](-5.0) 0.008s passed
TestInvalidArguments[3](1.00001) 0.008s passed
TestInvalidArguments[4](5.0) 0.008s passed
TestInvalidArguments[5](50.0) 0.008s passed
TestInvalidArguments[6](1.7976931348623157E308) 0.008s passed
TestInvalidArguments[7](Infinity) 0.008s passed
TestInvalidArguments[8](-Infinity) 0.008s passed
TestInvalidTwice[0](true) 0.172s passed
TestInvalidTwice[1](false) 0.128s passed
testDownsampleSingleTile[0](0.3) 0.115s passed
testDownsampleSingleTile[1](0.4) 0.116s passed
testDownsampleSingleTile[2](0.5) 0.124s passed
testDownsampleSingleTile[3](0.6) 0.147s passed
testDownsampleSingleTile[4](0.7) 0.151s passed
testDownsampleSingleTile[5](0.7999999999999999) 0.160s passed
testDownsampleSingleTile[6](0.8999999999999999) 0.173s passed
testDownsampleSingleTile[7](0.9999999999999999) 0.178s passed

Standard output

No errors found
No errors found
No errors found
No errors found
No errors found
No errors found
No errors found
No errors found
No errors found

Standard error

WARNING	2025-04-24 13:04:11	ValidateSamFile	NM validation cannot be performed without the reference. All other validations will still occur.
USAGE: PositionBasedDownsampleSam [arguments]

<h3>Summary</h3>
Class to downsample a SAM/BAM/CRAM file based on the position of the read in a flowcell. As with DownsampleSam, all the
reads with the same queryname are either kept or dropped as a unit.

<h3>Details</h3>
The downsampling is _not_ random (and there is no random seed). It is deterministically determined by the position of
each read within its tile. Specifically, it draws an ellipse that covers a FRACTION of the total tile's area and of all
the edges of the tile. It uses this area to determine whether to keep or drop the record. Since reads with the same name
have the same position (mates, secondary and supplemental alignments), the decision will be the same for all of them.
The main concern of this downsampling method is that due to "optical duplicates" downsampling randomly can create a
result that has a different optical duplicate rate, and therefore a different estimated library size (when running
MarkDuplicates). This method keeps (physically) close read together, so that (except for reads near the boundary of the
circle) optical duplicates are kept or dropped as a group. By default the program expects the read names to have 5 or 7
fields separated by colons (:) and it takes the last two to indicate the x and y coordinates of the reads within the
tile whence it was sequenced. See DEFAULT_READ_NAME_REGEX for more detail. The program traverses the INPUT twice: first
to find out the size of each of the tiles, and next to perform the downsampling. Downsampling invalidates the duplicate
flag because duplicate reads before downsampling may not all remain duplicated after downsampling. Thus, the default
setting also removes the duplicate information. 

Example

java -jar picard.jar PositionBasedDownsampleSam \
I=input.bam \
O=downsampled.bam \
FRACTION=0.1

Caveats

Note 1: This method is <b>technology and read-name dependent</b>. If the read-names do not have coordinate information
embedded in them, or if your BAM contains reads from multiple technologies (flowcell versions, sequencing machines) this
will not work properly. It has been designed to work with Illumina technology and reads-names. Consider modifying {@link
#READ_NAME_REGEX} in other cases. 

Note 2: The code has been designed to simulate, as accurately as possible, sequencing less, <b>not</b> for getting an
exact downsampled fraction (Use {@link DownsampleSam} for that.) In particular, since the reads may be distributed
non-evenly within the lanes/tiles, the resulting downsampling percentage will not be accurately determined by the input
argument FRACTION. 

Note 3:Consider running {@link MarkDuplicates} after downsampling in order to "expose" the duplicates whose
representative has been downsampled away.

Note 4:The downsampling assumes a uniform distribution of reads in the flowcell. Input already downsampled with
PositionBasedDownsampleSam violates this assumption. To guard against such input, PositionBasedDownsampleSam always
places a PG record in the header of its output, and aborts whenever it finds such a PG record in its input.
Version:null


Required Arguments:

--FRACTION,-F <Double>        The (approximate) fraction of reads to be kept, between 0 and 1.  Required. 

--INPUT,-I <File>             The input SAM/BAM/CRAM file to downsample.  Required. 

--OUTPUT,-O <File>            The output, downsampled, SAM/BAM/CRAM file.  Required. 


Optional Arguments:

--ALLOW_MULTIPLE_DOWNSAMPLING_DESPITE_WARNINGS <Boolean>
                              Allow downsampling again despite this being a bad idea with possibly unexpected results. 
                              Default value: false. Possible values: {true, false} 

--arguments_file <File>       read one or more arguments files and add them to the command line  This argument may be
                              specified 0 or more times. Default value: null. 

--COMPRESSION_LEVEL <Integer> Compression level for all compressed files created (e.g. BAM and VCF).  Default value: 5. 

--CREATE_INDEX <Boolean>      Whether to create an index when writing VCF or coordinate sorted BAM output.  Default
                              value: false. Possible values: {true, false} 

--CREATE_MD5_FILE <Boolean>   Whether to create an MD5 digest for any BAM or FASTQ files created.    Default value:
                              false. Possible values: {true, false} 

--help,-h <Boolean>           display the help message  Default value: false. Possible values: {true, false} 

--MAX_RECORDS_IN_RAM <Integer>When writing files that need to be sorted, this will specify the number of records stored
                              in RAM before spilling to disk. Increasing this number reduces the number of file handles
                              needed to sort the file, and increases the amount of RAM needed.  Default value: 100. 

--QUIET <Boolean>             Whether to suppress job-summary info on System.err.  Default value: false. Possible
                              values: {true, false} 

--READ_NAME_REGEX <String>    Use these regular expressions to parse read names in the input SAM file. Read names are
                              parsed to extract three variables: tile/region, x coordinate and y coordinate. The x and y
                              coordinates are used to determine the downsample decision. Set this option to null to
                              disable optical duplicate detection, e.g. for RNA-seq The regular expression should
                              contain three capture groups for the three variables, in order. It must match the entire
                              read name. Note that if the default regex is specified, a regex match is not actually
                              done, but instead the read name is split on colons (:). For 5 element names, the 3rd, 4th
                              and 5th elements are assumed to be tile, x and y values. For 7 element names (CASAVA 1.8),
                              the 5th, 6th, and 7th elements are assumed to be tile, x and y values.  Default value:
                              <optimized capture of last three ':' separated fields as numeric values>. 

--REFERENCE_SEQUENCE,-R <PicardHtsPath>
                              Reference sequence file.  Default value: null. 

--REMOVE_DUPLICATE_INFORMATION <Boolean>
                              Determines whether the duplicate tag should be reset since the downsampling requires
                              re-marking duplicates.  Default value: true. Possible values: {true, false} 

--STOP_AFTER <Long>           Stop after processing N reads, mainly for debugging.  Default value: null. 

--TMP_DIR <File>              One or more directories with space available to be used by this program for temporary
                              storage of working files  This argument may be specified 0 or more times. Default value:
                              null. 

--USE_JDK_DEFLATER,-use_jdk_deflater <Boolean>
                              Use the JDK Deflater instead of the Intel Deflater for writing compressed output  Default
                              value: false. Possible values: {true, false} 

--USE_JDK_INFLATER,-use_jdk_inflater <Boolean>
                              Use the JDK Inflater instead of the Intel Inflater for reading compressed input  Default
                              value: false. Possible values: {true, false} 

--VALIDATION_STRINGENCY <ValidationStringency>
                              Validation stringency for all SAM files read by this program.  Setting stringency to
                              SILENT can improve performance when processing a BAM file in which variable-length data
                              (read, qualities, tags) do not otherwise need to be decoded.  Default value: STRICT.
                              Possible values: {STRICT, LENIENT, SILENT} 

--VERBOSITY <LogLevel>        Control verbosity of logging.  Default value: INFO. Possible values: {ERROR, WARNING,
                              INFO, DEBUG} 

--version <Boolean>           display the version number for this tool  Default value: false. Possible values: {true,
                              false} 


Advanced Arguments:

--showHidden <Boolean>        display hidden arguments  Default value: false. Possible values: {true, false} 

FRACTION must be a value between 0 and 1, found: -1.0
USAGE: PositionBasedDownsampleSam [arguments]

<h3>Summary</h3>
Class to downsample a SAM/BAM/CRAM file based on the position of the read in a flowcell. As with DownsampleSam, all the
reads with the same queryname are either kept or dropped as a unit.

<h3>Details</h3>
The downsampling is _not_ random (and there is no random seed). It is deterministically determined by the position of
each read within its tile. Specifically, it draws an ellipse that covers a FRACTION of the total tile's area and of all
the edges of the tile. It uses this area to determine whether to keep or drop the record. Since reads with the same name
have the same position (mates, secondary and supplemental alignments), the decision will be the same for all of them.
The main concern of this downsampling method is that due to "optical duplicates" downsampling randomly can create a
result that has a different optical duplicate rate, and therefore a different estimated library size (when running
MarkDuplicates). This method keeps (physically) close read together, so that (except for reads near the boundary of the
circle) optical duplicates are kept or dropped as a group. By default the program expects the read names to have 5 or 7
fields separated by colons (:) and it takes the last two to indicate the x and y coordinates of the reads within the
tile whence it was sequenced. See DEFAULT_READ_NAME_REGEX for more detail. The program traverses the INPUT twice: first
to find out the size of each of the tiles, and next to perform the downsampling. Downsampling invalidates the duplicate
flag because duplicate reads before downsampling may not all remain duplicated after downsampling. Thus, the default
setting also removes the duplicate information. 

Example

java -jar picard.jar PositionBasedDownsampleSam \
I=input.bam \
O=downsampled.bam \
FRACTION=0.1

Caveats

Note 1: This method is <b>technology and read-name dependent</b>. If the read-names do not have coordinate information
embedded in them, or if your BAM contains reads from multiple technologies (flowcell versions, sequencing machines) this
will not work properly. It has been designed to work with Illumina technology and reads-names. Consider modifying {@link
#READ_NAME_REGEX} in other cases. 

Note 2: The code has been designed to simulate, as accurately as possible, sequencing less, <b>not</b> for getting an
exact downsampled fraction (Use {@link DownsampleSam} for that.) In particular, since the reads may be distributed
non-evenly within the lanes/tiles, the resulting downsampling percentage will not be accurately determined by the input
argument FRACTION. 

Note 3:Consider running {@link MarkDuplicates} after downsampling in order to "expose" the duplicates whose
representative has been downsampled away.

Note 4:The downsampling assumes a uniform distribution of reads in the flowcell. Input already downsampled with
PositionBasedDownsampleSam violates this assumption. To guard against such input, PositionBasedDownsampleSam always
places a PG record in the header of its output, and aborts whenever it finds such a PG record in its input.
Version:null


Required Arguments:

--FRACTION,-F <Double>        The (approximate) fraction of reads to be kept, between 0 and 1.  Required. 

--INPUT,-I <File>             The input SAM/BAM/CRAM file to downsample.  Required. 

--OUTPUT,-O <File>            The output, downsampled, SAM/BAM/CRAM file.  Required. 


Optional Arguments:

--ALLOW_MULTIPLE_DOWNSAMPLING_DESPITE_WARNINGS <Boolean>
                              Allow downsampling again despite this being a bad idea with possibly unexpected results. 
                              Default value: false. Possible values: {true, false} 

--arguments_file <File>       read one or more arguments files and add them to the command line  This argument may be
                              specified 0 or more times. Default value: null. 

--COMPRESSION_LEVEL <Integer> Compression level for all compressed files created (e.g. BAM and VCF).  Default value: 5. 

--CREATE_INDEX <Boolean>      Whether to create an index when writing VCF or coordinate sorted BAM output.  Default
                              value: false. Possible values: {true, false} 

--CREATE_MD5_FILE <Boolean>   Whether to create an MD5 digest for any BAM or FASTQ files created.    Default value:
                              false. Possible values: {true, false} 

--help,-h <Boolean>           display the help message  Default value: false. Possible values: {true, false} 

--MAX_RECORDS_IN_RAM <Integer>When writing files that need to be sorted, this will specify the number of records stored
                              in RAM before spilling to disk. Increasing this number reduces the number of file handles
                              needed to sort the file, and increases the amount of RAM needed.  Default value: 100. 

--QUIET <Boolean>             Whether to suppress job-summary info on System.err.  Default value: false. Possible
                              values: {true, false} 

--READ_NAME_REGEX <String>    Use these regular expressions to parse read names in the input SAM file. Read names are
                              parsed to extract three variables: tile/region, x coordinate and y coordinate. The x and y
                              coordinates are used to determine the downsample decision. Set this option to null to
                              disable optical duplicate detection, e.g. for RNA-seq The regular expression should
                              contain three capture groups for the three variables, in order. It must match the entire
                              read name. Note that if the default regex is specified, a regex match is not actually
                              done, but instead the read name is split on colons (:). For 5 element names, the 3rd, 4th
                              and 5th elements are assumed to be tile, x and y values. For 7 element names (CASAVA 1.8),
                              the 5th, 6th, and 7th elements are assumed to be tile, x and y values.  Default value:
                              <optimized capture of last three ':' separated fields as numeric values>. 

--REFERENCE_SEQUENCE,-R <PicardHtsPath>
                              Reference sequence file.  Default value: null. 

--REMOVE_DUPLICATE_INFORMATION <Boolean>
                              Determines whether the duplicate tag should be reset since the downsampling requires
                              re-marking duplicates.  Default value: true. Possible values: {true, false} 

--STOP_AFTER <Long>           Stop after processing N reads, mainly for debugging.  Default value: null. 

--TMP_DIR <File>              One or more directories with space available to be used by this program for temporary
                              storage of working files  This argument may be specified 0 or more times. Default value:
                              null. 

--USE_JDK_DEFLATER,-use_jdk_deflater <Boolean>
                              Use the JDK Deflater instead of the Intel Deflater for writing compressed output  Default
                              value: false. Possible values: {true, false} 

--USE_JDK_INFLATER,-use_jdk_inflater <Boolean>
                              Use the JDK Inflater instead of the Intel Inflater for reading compressed input  Default
                              value: false. Possible values: {true, false} 

--VALIDATION_STRINGENCY <ValidationStringency>
                              Validation stringency for all SAM files read by this program.  Setting stringency to
                              SILENT can improve performance when processing a BAM file in which variable-length data
                              (read, qualities, tags) do not otherwise need to be decoded.  Default value: STRICT.
                              Possible values: {STRICT, LENIENT, SILENT} 

--VERBOSITY <LogLevel>        Control verbosity of logging.  Default value: INFO. Possible values: {ERROR, WARNING,
                              INFO, DEBUG} 

--version <Boolean>           display the version number for this tool  Default value: false. Possible values: {true,
                              false} 


Advanced Arguments:

--showHidden <Boolean>        display hidden arguments  Default value: false. Possible values: {true, false} 

FRACTION must be a value between 0 and 1, found: -1.0E-5
USAGE: PositionBasedDownsampleSam [arguments]

<h3>Summary</h3>
Class to downsample a SAM/BAM/CRAM file based on the position of the read in a flowcell. As with DownsampleSam, all the
reads with the same queryname are either kept or dropped as a unit.

<h3>Details</h3>
The downsampling is _not_ random (and there is no random seed). It is deterministically determined by the position of
each read within its tile. Specifically, it draws an ellipse that covers a FRACTION of the total tile's area and of all
the edges of the tile. It uses this area to determine whether to keep or drop the record. Since reads with the same name
have the same position (mates, secondary and supplemental alignments), the decision will be the same for all of them.
The main concern of this downsampling method is that due to "optical duplicates" downsampling randomly can create a
result that has a different optical duplicate rate, and therefore a different estimated library size (when running
MarkDuplicates). This method keeps (physically) close read together, so that (except for reads near the boundary of the
circle) optical duplicates are kept or dropped as a group. By default the program expects the read names to have 5 or 7
fields separated by colons (:) and it takes the last two to indicate the x and y coordinates of the reads within the
tile whence it was sequenced. See DEFAULT_READ_NAME_REGEX for more detail. The program traverses the INPUT twice: first
to find out the size of each of the tiles, and next to perform the downsampling. Downsampling invalidates the duplicate
flag because duplicate reads before downsampling may not all remain duplicated after downsampling. Thus, the default
setting also removes the duplicate information. 

Example

java -jar picard.jar PositionBasedDownsampleSam \
I=input.bam \
O=downsampled.bam \
FRACTION=0.1

Caveats

Note 1: This method is <b>technology and read-name dependent</b>. If the read-names do not have coordinate information
embedded in them, or if your BAM contains reads from multiple technologies (flowcell versions, sequencing machines) this
will not work properly. It has been designed to work with Illumina technology and reads-names. Consider modifying {@link
#READ_NAME_REGEX} in other cases. 

Note 2: The code has been designed to simulate, as accurately as possible, sequencing less, <b>not</b> for getting an
exact downsampled fraction (Use {@link DownsampleSam} for that.) In particular, since the reads may be distributed
non-evenly within the lanes/tiles, the resulting downsampling percentage will not be accurately determined by the input
argument FRACTION. 

Note 3:Consider running {@link MarkDuplicates} after downsampling in order to "expose" the duplicates whose
representative has been downsampled away.

Note 4:The downsampling assumes a uniform distribution of reads in the flowcell. Input already downsampled with
PositionBasedDownsampleSam violates this assumption. To guard against such input, PositionBasedDownsampleSam always
places a PG record in the header of its output, and aborts whenever it finds such a PG record in its input.
Version:null


Required Arguments:

--FRACTION,-F <Double>        The (approximate) fraction of reads to be kept, between 0 and 1.  Required. 

--INPUT,-I <File>             The input SAM/BAM/CRAM file to downsample.  Required. 

--OUTPUT,-O <File>            The output, downsampled, SAM/BAM/CRAM file.  Required. 


Optional Arguments:

--ALLOW_MULTIPLE_DOWNSAMPLING_DESPITE_WARNINGS <Boolean>
                              Allow downsampling again despite this being a bad idea with possibly unexpected results. 
                              Default value: false. Possible values: {true, false} 

--arguments_file <File>       read one or more arguments files and add them to the command line  This argument may be
                              specified 0 or more times. Default value: null. 

--COMPRESSION_LEVEL <Integer> Compression level for all compressed files created (e.g. BAM and VCF).  Default value: 5. 

--CREATE_INDEX <Boolean>      Whether to create an index when writing VCF or coordinate sorted BAM output.  Default
                              value: false. Possible values: {true, false} 

--CREATE_MD5_FILE <Boolean>   Whether to create an MD5 digest for any BAM or FASTQ files created.    Default value:
                              false. Possible values: {true, false} 

--help,-h <Boolean>           display the help message  Default value: false. Possible values: {true, false} 

--MAX_RECORDS_IN_RAM <Integer>When writing files that need to be sorted, this will specify the number of records stored
                              in RAM before spilling to disk. Increasing this number reduces the number of file handles
                              needed to sort the file, and increases the amount of RAM needed.  Default value: 100. 

--QUIET <Boolean>             Whether to suppress job-summary info on System.err.  Default value: false. Possible
                              values: {true, false} 

--READ_NAME_REGEX <String>    Use these regular expressions to parse read names in the input SAM file. Read names are
                              parsed to extract three variables: tile/region, x coordinate and y coordinate. The x and y
                              coordinates are used to determine the downsample decision. Set this option to null to
                              disable optical duplicate detection, e.g. for RNA-seq The regular expression should
                              contain three capture groups for the three variables, in order. It must match the entire
                              read name. Note that if the default regex is specified, a regex match is not actually
                              done, but instead the read name is split on colons (:). For 5 element names, the 3rd, 4th
                              and 5th elements are assumed to be tile, x and y values. For 7 element names (CASAVA 1.8),
                              the 5th, 6th, and 7th elements are assumed to be tile, x and y values.  Default value:
                              <optimized capture of last three ':' separated fields as numeric values>. 

--REFERENCE_SEQUENCE,-R <PicardHtsPath>
                              Reference sequence file.  Default value: null. 

--REMOVE_DUPLICATE_INFORMATION <Boolean>
                              Determines whether the duplicate tag should be reset since the downsampling requires
                              re-marking duplicates.  Default value: true. Possible values: {true, false} 

--STOP_AFTER <Long>           Stop after processing N reads, mainly for debugging.  Default value: null. 

--TMP_DIR <File>              One or more directories with space available to be used by this program for temporary
                              storage of working files  This argument may be specified 0 or more times. Default value:
                              null. 

--USE_JDK_DEFLATER,-use_jdk_deflater <Boolean>
                              Use the JDK Deflater instead of the Intel Deflater for writing compressed output  Default
                              value: false. Possible values: {true, false} 

--USE_JDK_INFLATER,-use_jdk_inflater <Boolean>
                              Use the JDK Inflater instead of the Intel Inflater for reading compressed input  Default
                              value: false. Possible values: {true, false} 

--VALIDATION_STRINGENCY <ValidationStringency>
                              Validation stringency for all SAM files read by this program.  Setting stringency to
                              SILENT can improve performance when processing a BAM file in which variable-length data
                              (read, qualities, tags) do not otherwise need to be decoded.  Default value: STRICT.
                              Possible values: {STRICT, LENIENT, SILENT} 

--VERBOSITY <LogLevel>        Control verbosity of logging.  Default value: INFO. Possible values: {ERROR, WARNING,
                              INFO, DEBUG} 

--version <Boolean>           display the version number for this tool  Default value: false. Possible values: {true,
                              false} 


Advanced Arguments:

--showHidden <Boolean>        display hidden arguments  Default value: false. Possible values: {true, false} 

FRACTION must be a value between 0 and 1, found: -5.0
USAGE: PositionBasedDownsampleSam [arguments]

<h3>Summary</h3>
Class to downsample a SAM/BAM/CRAM file based on the position of the read in a flowcell. As with DownsampleSam, all the
reads with the same queryname are either kept or dropped as a unit.

<h3>Details</h3>
The downsampling is _not_ random (and there is no random seed). It is deterministically determined by the position of
each read within its tile. Specifically, it draws an ellipse that covers a FRACTION of the total tile's area and of all
the edges of the tile. It uses this area to determine whether to keep or drop the record. Since reads with the same name
have the same position (mates, secondary and supplemental alignments), the decision will be the same for all of them.
The main concern of this downsampling method is that due to "optical duplicates" downsampling randomly can create a
result that has a different optical duplicate rate, and therefore a different estimated library size (when running
MarkDuplicates). This method keeps (physically) close read together, so that (except for reads near the boundary of the
circle) optical duplicates are kept or dropped as a group. By default the program expects the read names to have 5 or 7
fields separated by colons (:) and it takes the last two to indicate the x and y coordinates of the reads within the
tile whence it was sequenced. See DEFAULT_READ_NAME_REGEX for more detail. The program traverses the INPUT twice: first
to find out the size of each of the tiles, and next to perform the downsampling. Downsampling invalidates the duplicate
flag because duplicate reads before downsampling may not all remain duplicated after downsampling. Thus, the default
setting also removes the duplicate information. 

Example

java -jar picard.jar PositionBasedDownsampleSam \
I=input.bam \
O=downsampled.bam \
FRACTION=0.1

Caveats

Note 1: This method is <b>technology and read-name dependent</b>. If the read-names do not have coordinate information
embedded in them, or if your BAM contains reads from multiple technologies (flowcell versions, sequencing machines) this
will not work properly. It has been designed to work with Illumina technology and reads-names. Consider modifying {@link
#READ_NAME_REGEX} in other cases. 

Note 2: The code has been designed to simulate, as accurately as possible, sequencing less, <b>not</b> for getting an
exact downsampled fraction (Use {@link DownsampleSam} for that.) In particular, since the reads may be distributed
non-evenly within the lanes/tiles, the resulting downsampling percentage will not be accurately determined by the input
argument FRACTION. 

Note 3:Consider running {@link MarkDuplicates} after downsampling in order to "expose" the duplicates whose
representative has been downsampled away.

Note 4:The downsampling assumes a uniform distribution of reads in the flowcell. Input already downsampled with
PositionBasedDownsampleSam violates this assumption. To guard against such input, PositionBasedDownsampleSam always
places a PG record in the header of its output, and aborts whenever it finds such a PG record in its input.
Version:null


Required Arguments:

--FRACTION,-F <Double>        The (approximate) fraction of reads to be kept, between 0 and 1.  Required. 

--INPUT,-I <File>             The input SAM/BAM/CRAM file to downsample.  Required. 

--OUTPUT,-O <File>            The output, downsampled, SAM/BAM/CRAM file.  Required. 


Optional Arguments:

--ALLOW_MULTIPLE_DOWNSAMPLING_DESPITE_WARNINGS <Boolean>
                              Allow downsampling again despite this being a bad idea with possibly unexpected results. 
                              Default value: false. Possible values: {true, false} 

--arguments_file <File>       read one or more arguments files and add them to the command line  This argument may be
                              specified 0 or more times. Default value: null. 

--COMPRESSION_LEVEL <Integer> Compression level for all compressed files created (e.g. BAM and VCF).  Default value: 5. 

--CREATE_INDEX <Boolean>      Whether to create an index when writing VCF or coordinate sorted BAM output.  Default
                              value: false. Possible values: {true, false} 

--CREATE_MD5_FILE <Boolean>   Whether to create an MD5 digest for any BAM or FASTQ files created.    Default value:
                              false. Possible values: {true, false} 

--help,-h <Boolean>           display the help message  Default value: false. Possible values: {true, false} 

--MAX_RECORDS_IN_RAM <Integer>When writing files that need to be sorted, this will specify the number of records stored
                              in RAM before spilling to disk. Increasing this number reduces the number of file handles
                              needed to sort the file, and increases the amount of RAM needed.  Default value: 100. 

--QUIET <Boolean>             Whether to suppress job-summary info on System.err.  Default value: false. Possible
                              values: {true, false} 

--READ_NAME_REGEX <String>    Use these regular expressions to parse read names in the input SAM file. Read names are
                              parsed to extract three variables: tile/region, x coordinate and y coordinate. The x and y
                              coordinates are used to determine the downsample decision. Set this option to null to
                              disable optical duplicate detection, e.g. for RNA-seq The regular expression should
                              contain three capture groups for the three variables, in order. It must match the entire
                              read name. Note that if the default regex is specified, a regex match is not actually
                              done, but instead the read name is split on colons (:). For 5 element names, the 3rd, 4th
                              and 5th elements are assumed to be tile, x and y values. For 7 element names (CASAVA 1.8),
                              the 5th, 6th, and 7th elements are assumed to be tile, x and y values.  Default value:
                              <optimized capture of last three ':' separated fields as numeric values>. 

--REFERENCE_SEQUENCE,-R <PicardHtsPath>
                              Reference sequence file.  Default value: null. 

--REMOVE_DUPLICATE_INFORMATION <Boolean>
                              Determines whether the duplicate tag should be reset since the downsampling requires
                              re-marking duplicates.  Default value: true. Possible values: {true, false} 

--STOP_AFTER <Long>           Stop after processing N reads, mainly for debugging.  Default value: null. 

--TMP_DIR <File>              One or more directories with space available to be used by this program for temporary
                              storage of working files  This argument may be specified 0 or more times. Default value:
                              null. 

--USE_JDK_DEFLATER,-use_jdk_deflater <Boolean>
                              Use the JDK Deflater instead of the Intel Deflater for writing compressed output  Default
                              value: false. Possible values: {true, false} 

--USE_JDK_INFLATER,-use_jdk_inflater <Boolean>
                              Use the JDK Inflater instead of the Intel Inflater for reading compressed input  Default
                              value: false. Possible values: {true, false} 

--VALIDATION_STRINGENCY <ValidationStringency>
                              Validation stringency for all SAM files read by this program.  Setting stringency to
                              SILENT can improve performance when processing a BAM file in which variable-length data
                              (read, qualities, tags) do not otherwise need to be decoded.  Default value: STRICT.
                              Possible values: {STRICT, LENIENT, SILENT} 

--VERBOSITY <LogLevel>        Control verbosity of logging.  Default value: INFO. Possible values: {ERROR, WARNING,
                              INFO, DEBUG} 

--version <Boolean>           display the version number for this tool  Default value: false. Possible values: {true,
                              false} 


Advanced Arguments:

--showHidden <Boolean>        display hidden arguments  Default value: false. Possible values: {true, false} 

FRACTION must be a value between 0 and 1, found: 1.00001
USAGE: PositionBasedDownsampleSam [arguments]

<h3>Summary</h3>
Class to downsample a SAM/BAM/CRAM file based on the position of the read in a flowcell. As with DownsampleSam, all the
reads with the same queryname are either kept or dropped as a unit.

<h3>Details</h3>
The downsampling is _not_ random (and there is no random seed). It is deterministically determined by the position of
each read within its tile. Specifically, it draws an ellipse that covers a FRACTION of the total tile's area and of all
the edges of the tile. It uses this area to determine whether to keep or drop the record. Since reads with the same name
have the same position (mates, secondary and supplemental alignments), the decision will be the same for all of them.
The main concern of this downsampling method is that due to "optical duplicates" downsampling randomly can create a
result that has a different optical duplicate rate, and therefore a different estimated library size (when running
MarkDuplicates). This method keeps (physically) close read together, so that (except for reads near the boundary of the
circle) optical duplicates are kept or dropped as a group. By default the program expects the read names to have 5 or 7
fields separated by colons (:) and it takes the last two to indicate the x and y coordinates of the reads within the
tile whence it was sequenced. See DEFAULT_READ_NAME_REGEX for more detail. The program traverses the INPUT twice: first
to find out the size of each of the tiles, and next to perform the downsampling. Downsampling invalidates the duplicate
flag because duplicate reads before downsampling may not all remain duplicated after downsampling. Thus, the default
setting also removes the duplicate information. 

Example

java -jar picard.jar PositionBasedDownsampleSam \
I=input.bam \
O=downsampled.bam \
FRACTION=0.1

Caveats

Note 1: This method is <b>technology and read-name dependent</b>. If the read-names do not have coordinate information
embedded in them, or if your BAM contains reads from multiple technologies (flowcell versions, sequencing machines) this
will not work properly. It has been designed to work with Illumina technology and reads-names. Consider modifying {@link
#READ_NAME_REGEX} in other cases. 

Note 2: The code has been designed to simulate, as accurately as possible, sequencing less, <b>not</b> for getting an
exact downsampled fraction (Use {@link DownsampleSam} for that.) In particular, since the reads may be distributed
non-evenly within the lanes/tiles, the resulting downsampling percentage will not be accurately determined by the input
argument FRACTION. 

Note 3:Consider running {@link MarkDuplicates} after downsampling in order to "expose" the duplicates whose
representative has been downsampled away.

Note 4:The downsampling assumes a uniform distribution of reads in the flowcell. Input already downsampled with
PositionBasedDownsampleSam violates this assumption. To guard against such input, PositionBasedDownsampleSam always
places a PG record in the header of its output, and aborts whenever it finds such a PG record in its input.
Version:null


Required Arguments:

--FRACTION,-F <Double>        The (approximate) fraction of reads to be kept, between 0 and 1.  Required. 

--INPUT,-I <File>             The input SAM/BAM/CRAM file to downsample.  Required. 

--OUTPUT,-O <File>            The output, downsampled, SAM/BAM/CRAM file.  Required. 


Optional Arguments:

--ALLOW_MULTIPLE_DOWNSAMPLING_DESPITE_WARNINGS <Boolean>
                              Allow downsampling again despite this being a bad idea with possibly unexpected results. 
                              Default value: false. Possible values: {true, false} 

--arguments_file <File>       read one or more arguments files and add them to the command line  This argument may be
                              specified 0 or more times. Default value: null. 

--COMPRESSION_LEVEL <Integer> Compression level for all compressed files created (e.g. BAM and VCF).  Default value: 5. 

--CREATE_INDEX <Boolean>      Whether to create an index when writing VCF or coordinate sorted BAM output.  Default
                              value: false. Possible values: {true, false} 

--CREATE_MD5_FILE <Boolean>   Whether to create an MD5 digest for any BAM or FASTQ files created.    Default value:
                              false. Possible values: {true, false} 

--help,-h <Boolean>           display the help message  Default value: false. Possible values: {true, false} 

--MAX_RECORDS_IN_RAM <Integer>When writing files that need to be sorted, this will specify the number of records stored
                              in RAM before spilling to disk. Increasing this number reduces the number of file handles
                              needed to sort the file, and increases the amount of RAM needed.  Default value: 100. 

--QUIET <Boolean>             Whether to suppress job-summary info on System.err.  Default value: false. Possible
                              values: {true, false} 

--READ_NAME_REGEX <String>    Use these regular expressions to parse read names in the input SAM file. Read names are
                              parsed to extract three variables: tile/region, x coordinate and y coordinate. The x and y
                              coordinates are used to determine the downsample decision. Set this option to null to
                              disable optical duplicate detection, e.g. for RNA-seq The regular expression should
                              contain three capture groups for the three variables, in order. It must match the entire
                              read name. Note that if the default regex is specified, a regex match is not actually
                              done, but instead the read name is split on colons (:). For 5 element names, the 3rd, 4th
                              and 5th elements are assumed to be tile, x and y values. For 7 element names (CASAVA 1.8),
                              the 5th, 6th, and 7th elements are assumed to be tile, x and y values.  Default value:
                              <optimized capture of last three ':' separated fields as numeric values>. 

--REFERENCE_SEQUENCE,-R <PicardHtsPath>
                              Reference sequence file.  Default value: null. 

--REMOVE_DUPLICATE_INFORMATION <Boolean>
                              Determines whether the duplicate tag should be reset since the downsampling requires
                              re-marking duplicates.  Default value: true. Possible values: {true, false} 

--STOP_AFTER <Long>           Stop after processing N reads, mainly for debugging.  Default value: null. 

--TMP_DIR <File>              One or more directories with space available to be used by this program for temporary
                              storage of working files  This argument may be specified 0 or more times. Default value:
                              null. 

--USE_JDK_DEFLATER,-use_jdk_deflater <Boolean>
                              Use the JDK Deflater instead of the Intel Deflater for writing compressed output  Default
                              value: false. Possible values: {true, false} 

--USE_JDK_INFLATER,-use_jdk_inflater <Boolean>
                              Use the JDK Inflater instead of the Intel Inflater for reading compressed input  Default
                              value: false. Possible values: {true, false} 

--VALIDATION_STRINGENCY <ValidationStringency>
                              Validation stringency for all SAM files read by this program.  Setting stringency to
                              SILENT can improve performance when processing a BAM file in which variable-length data
                              (read, qualities, tags) do not otherwise need to be decoded.  Default value: STRICT.
                              Possible values: {STRICT, LENIENT, SILENT} 

--VERBOSITY <LogLevel>        Control verbosity of logging.  Default value: INFO. Possible values: {ERROR, WARNING,
                              INFO, DEBUG} 

--version <Boolean>           display the version number for this tool  Default value: false. Possible values: {true,
                              false} 


Advanced Arguments:

--showHidden <Boolean>        display hidden arguments  Default value: false. Possible values: {true, false} 

FRACTION must be a value between 0 and 1, found: 5.0
USAGE: PositionBasedDownsampleSam [arguments]

<h3>Summary</h3>
Class to downsample a SAM/BAM/CRAM file based on the position of the read in a flowcell. As with DownsampleSam, all the
reads with the same queryname are either kept or dropped as a unit.

<h3>Details</h3>
The downsampling is _not_ random (and there is no random seed). It is deterministically determined by the position of
each read within its tile. Specifically, it draws an ellipse that covers a FRACTION of the total tile's area and of all
the edges of the tile. It uses this area to determine whether to keep or drop the record. Since reads with the same name
have the same position (mates, secondary and supplemental alignments), the decision will be the same for all of them.
The main concern of this downsampling method is that due to "optical duplicates" downsampling randomly can create a
result that has a different optical duplicate rate, and therefore a different estimated library size (when running
MarkDuplicates). This method keeps (physically) close read together, so that (except for reads near the boundary of the
circle) optical duplicates are kept or dropped as a group. By default the program expects the read names to have 5 or 7
fields separated by colons (:) and it takes the last two to indicate the x and y coordinates of the reads within the
tile whence it was sequenced. See DEFAULT_READ_NAME_REGEX for more detail. The program traverses the INPUT twice: first
to find out the size of each of the tiles, and next to perform the downsampling. Downsampling invalidates the duplicate
flag because duplicate reads before downsampling may not all remain duplicated after downsampling. Thus, the default
setting also removes the duplicate information. 

Example

java -jar picard.jar PositionBasedDownsampleSam \
I=input.bam \
O=downsampled.bam \
FRACTION=0.1

Caveats

Note 1: This method is <b>technology and read-name dependent</b>. If the read-names do not have coordinate information
embedded in them, or if your BAM contains reads from multiple technologies (flowcell versions, sequencing machines) this
will not work properly. It has been designed to work with Illumina technology and reads-names. Consider modifying {@link
#READ_NAME_REGEX} in other cases. 

Note 2: The code has been designed to simulate, as accurately as possible, sequencing less, <b>not</b> for getting an
exact downsampled fraction (Use {@link DownsampleSam} for that.) In particular, since the reads may be distributed
non-evenly within the lanes/tiles, the resulting downsampling percentage will not be accurately determined by the input
argument FRACTION. 

Note 3:Consider running {@link MarkDuplicates} after downsampling in order to "expose" the duplicates whose
representative has been downsampled away.

Note 4:The downsampling assumes a uniform distribution of reads in the flowcell. Input already downsampled with
PositionBasedDownsampleSam violates this assumption. To guard against such input, PositionBasedDownsampleSam always
places a PG record in the header of its output, and aborts whenever it finds such a PG record in its input.
Version:null


Required Arguments:

--FRACTION,-F <Double>        The (approximate) fraction of reads to be kept, between 0 and 1.  Required. 

--INPUT,-I <File>             The input SAM/BAM/CRAM file to downsample.  Required. 

--OUTPUT,-O <File>            The output, downsampled, SAM/BAM/CRAM file.  Required. 


Optional Arguments:

--ALLOW_MULTIPLE_DOWNSAMPLING_DESPITE_WARNINGS <Boolean>
                              Allow downsampling again despite this being a bad idea with possibly unexpected results. 
                              Default value: false. Possible values: {true, false} 

--arguments_file <File>       read one or more arguments files and add them to the command line  This argument may be
                              specified 0 or more times. Default value: null. 

--COMPRESSION_LEVEL <Integer> Compression level for all compressed files created (e.g. BAM and VCF).  Default value: 5. 

--CREATE_INDEX <Boolean>      Whether to create an index when writing VCF or coordinate sorted BAM output.  Default
                              value: false. Possible values: {true, false} 

--CREATE_MD5_FILE <Boolean>   Whether to create an MD5 digest for any BAM or FASTQ files created.    Default value:
                              false. Possible values: {true, false} 

--help,-h <Boolean>           display the help message  Default value: false. Possible values: {true, false} 

--MAX_RECORDS_IN_RAM <Integer>When writing files that need to be sorted, this will specify the number of records stored
                              in RAM before spilling to disk. Increasing this number reduces the number of file handles
                              needed to sort the file, and increases the amount of RAM needed.  Default value: 100. 

--QUIET <Boolean>             Whether to suppress job-summary info on System.err.  Default value: false. Possible
                              values: {true, false} 

--READ_NAME_REGEX <String>    Use these regular expressions to parse read names in the input SAM file. Read names are
                              parsed to extract three variables: tile/region, x coordinate and y coordinate. The x and y
                              coordinates are used to determine the downsample decision. Set this option to null to
                              disable optical duplicate detection, e.g. for RNA-seq The regular expression should
                              contain three capture groups for the three variables, in order. It must match the entire
                              read name. Note that if the default regex is specified, a regex match is not actually
                              done, but instead the read name is split on colons (:). For 5 element names, the 3rd, 4th
                              and 5th elements are assumed to be tile, x and y values. For 7 element names (CASAVA 1.8),
                              the 5th, 6th, and 7th elements are assumed to be tile, x and y values.  Default value:
                              <optimized capture of last three ':' separated fields as numeric values>. 

--REFERENCE_SEQUENCE,-R <PicardHtsPath>
                              Reference sequence file.  Default value: null. 

--REMOVE_DUPLICATE_INFORMATION <Boolean>
                              Determines whether the duplicate tag should be reset since the downsampling requires
                              re-marking duplicates.  Default value: true. Possible values: {true, false} 

--STOP_AFTER <Long>           Stop after processing N reads, mainly for debugging.  Default value: null. 

--TMP_DIR <File>              One or more directories with space available to be used by this program for temporary
                              storage of working files  This argument may be specified 0 or more times. Default value:
                              null. 

--USE_JDK_DEFLATER,-use_jdk_deflater <Boolean>
                              Use the JDK Deflater instead of the Intel Deflater for writing compressed output  Default
                              value: false. Possible values: {true, false} 

--USE_JDK_INFLATER,-use_jdk_inflater <Boolean>
                              Use the JDK Inflater instead of the Intel Inflater for reading compressed input  Default
                              value: false. Possible values: {true, false} 

--VALIDATION_STRINGENCY <ValidationStringency>
                              Validation stringency for all SAM files read by this program.  Setting stringency to
                              SILENT can improve performance when processing a BAM file in which variable-length data
                              (read, qualities, tags) do not otherwise need to be decoded.  Default value: STRICT.
                              Possible values: {STRICT, LENIENT, SILENT} 

--VERBOSITY <LogLevel>        Control verbosity of logging.  Default value: INFO. Possible values: {ERROR, WARNING,
                              INFO, DEBUG} 

--version <Boolean>           display the version number for this tool  Default value: false. Possible values: {true,
                              false} 


Advanced Arguments:

--showHidden <Boolean>        display hidden arguments  Default value: false. Possible values: {true, false} 

FRACTION must be a value between 0 and 1, found: 50.0
USAGE: PositionBasedDownsampleSam [arguments]

<h3>Summary</h3>
Class to downsample a SAM/BAM/CRAM file based on the position of the read in a flowcell. As with DownsampleSam, all the
reads with the same queryname are either kept or dropped as a unit.

<h3>Details</h3>
The downsampling is _not_ random (and there is no random seed). It is deterministically determined by the position of
each read within its tile. Specifically, it draws an ellipse that covers a FRACTION of the total tile's area and of all
the edges of the tile. It uses this area to determine whether to keep or drop the record. Since reads with the same name
have the same position (mates, secondary and supplemental alignments), the decision will be the same for all of them.
The main concern of this downsampling method is that due to "optical duplicates" downsampling randomly can create a
result that has a different optical duplicate rate, and therefore a different estimated library size (when running
MarkDuplicates). This method keeps (physically) close read together, so that (except for reads near the boundary of the
circle) optical duplicates are kept or dropped as a group. By default the program expects the read names to have 5 or 7
fields separated by colons (:) and it takes the last two to indicate the x and y coordinates of the reads within the
tile whence it was sequenced. See DEFAULT_READ_NAME_REGEX for more detail. The program traverses the INPUT twice: first
to find out the size of each of the tiles, and next to perform the downsampling. Downsampling invalidates the duplicate
flag because duplicate reads before downsampling may not all remain duplicated after downsampling. Thus, the default
setting also removes the duplicate information. 

Example

java -jar picard.jar PositionBasedDownsampleSam \
I=input.bam \
O=downsampled.bam \
FRACTION=0.1

Caveats

Note 1: This method is <b>technology and read-name dependent</b>. If the read-names do not have coordinate information
embedded in them, or if your BAM contains reads from multiple technologies (flowcell versions, sequencing machines) this
will not work properly. It has been designed to work with Illumina technology and reads-names. Consider modifying {@link
#READ_NAME_REGEX} in other cases. 

Note 2: The code has been designed to simulate, as accurately as possible, sequencing less, <b>not</b> for getting an
exact downsampled fraction (Use {@link DownsampleSam} for that.) In particular, since the reads may be distributed
non-evenly within the lanes/tiles, the resulting downsampling percentage will not be accurately determined by the input
argument FRACTION. 

Note 3:Consider running {@link MarkDuplicates} after downsampling in order to "expose" the duplicates whose
representative has been downsampled away.

Note 4:The downsampling assumes a uniform distribution of reads in the flowcell. Input already downsampled with
PositionBasedDownsampleSam violates this assumption. To guard against such input, PositionBasedDownsampleSam always
places a PG record in the header of its output, and aborts whenever it finds such a PG record in its input.
Version:null


Required Arguments:

--FRACTION,-F <Double>        The (approximate) fraction of reads to be kept, between 0 and 1.  Required. 

--INPUT,-I <File>             The input SAM/BAM/CRAM file to downsample.  Required. 

--OUTPUT,-O <File>            The output, downsampled, SAM/BAM/CRAM file.  Required. 


Optional Arguments:

--ALLOW_MULTIPLE_DOWNSAMPLING_DESPITE_WARNINGS <Boolean>
                              Allow downsampling again despite this being a bad idea with possibly unexpected results. 
                              Default value: false. Possible values: {true, false} 

--arguments_file <File>       read one or more arguments files and add them to the command line  This argument may be
                              specified 0 or more times. Default value: null. 

--COMPRESSION_LEVEL <Integer> Compression level for all compressed files created (e.g. BAM and VCF).  Default value: 5. 

--CREATE_INDEX <Boolean>      Whether to create an index when writing VCF or coordinate sorted BAM output.  Default
                              value: false. Possible values: {true, false} 

--CREATE_MD5_FILE <Boolean>   Whether to create an MD5 digest for any BAM or FASTQ files created.    Default value:
                              false. Possible values: {true, false} 

--help,-h <Boolean>           display the help message  Default value: false. Possible values: {true, false} 

--MAX_RECORDS_IN_RAM <Integer>When writing files that need to be sorted, this will specify the number of records stored
                              in RAM before spilling to disk. Increasing this number reduces the number of file handles
                              needed to sort the file, and increases the amount of RAM needed.  Default value: 100. 

--QUIET <Boolean>             Whether to suppress job-summary info on System.err.  Default value: false. Possible
                              values: {true, false} 

--READ_NAME_REGEX <String>    Use these regular expressions to parse read names in the input SAM file. Read names are
                              parsed to extract three variables: tile/region, x coordinate and y coordinate. The x and y
                              coordinates are used to determine the downsample decision. Set this option to null to
                              disable optical duplicate detection, e.g. for RNA-seq The regular expression should
                              contain three capture groups for the three variables, in order. It must match the entire
                              read name. Note that if the default regex is specified, a regex match is not actually
                              done, but instead the read name is split on colons (:). For 5 element names, the 3rd, 4th
                              and 5th elements are assumed to be tile, x and y values. For 7 element names (CASAVA 1.8),
                              the 5th, 6th, and 7th elements are assumed to be tile, x and y values.  Default value:
                              <optimized capture of last three ':' separated fields as numeric values>. 

--REFERENCE_SEQUENCE,-R <PicardHtsPath>
                              Reference sequence file.  Default value: null. 

--REMOVE_DUPLICATE_INFORMATION <Boolean>
                              Determines whether the duplicate tag should be reset since the downsampling requires
                              re-marking duplicates.  Default value: true. Possible values: {true, false} 

--STOP_AFTER <Long>           Stop after processing N reads, mainly for debugging.  Default value: null. 

--TMP_DIR <File>              One or more directories with space available to be used by this program for temporary
                              storage of working files  This argument may be specified 0 or more times. Default value:
                              null. 

--USE_JDK_DEFLATER,-use_jdk_deflater <Boolean>
                              Use the JDK Deflater instead of the Intel Deflater for writing compressed output  Default
                              value: false. Possible values: {true, false} 

--USE_JDK_INFLATER,-use_jdk_inflater <Boolean>
                              Use the JDK Inflater instead of the Intel Inflater for reading compressed input  Default
                              value: false. Possible values: {true, false} 

--VALIDATION_STRINGENCY <ValidationStringency>
                              Validation stringency for all SAM files read by this program.  Setting stringency to
                              SILENT can improve performance when processing a BAM file in which variable-length data
                              (read, qualities, tags) do not otherwise need to be decoded.  Default value: STRICT.
                              Possible values: {STRICT, LENIENT, SILENT} 

--VERBOSITY <LogLevel>        Control verbosity of logging.  Default value: INFO. Possible values: {ERROR, WARNING,
                              INFO, DEBUG} 

--version <Boolean>           display the version number for this tool  Default value: false. Possible values: {true,
                              false} 


Advanced Arguments:

--showHidden <Boolean>        display hidden arguments  Default value: false. Possible values: {true, false} 

FRACTION must be a value between 0 and 1, found: 1.7976931348623157E308
USAGE: PositionBasedDownsampleSam [arguments]

<h3>Summary</h3>
Class to downsample a SAM/BAM/CRAM file based on the position of the read in a flowcell. As with DownsampleSam, all the
reads with the same queryname are either kept or dropped as a unit.

<h3>Details</h3>
The downsampling is _not_ random (and there is no random seed). It is deterministically determined by the position of
each read within its tile. Specifically, it draws an ellipse that covers a FRACTION of the total tile's area and of all
the edges of the tile. It uses this area to determine whether to keep or drop the record. Since reads with the same name
have the same position (mates, secondary and supplemental alignments), the decision will be the same for all of them.
The main concern of this downsampling method is that due to "optical duplicates" downsampling randomly can create a
result that has a different optical duplicate rate, and therefore a different estimated library size (when running
MarkDuplicates). This method keeps (physically) close read together, so that (except for reads near the boundary of the
circle) optical duplicates are kept or dropped as a group. By default the program expects the read names to have 5 or 7
fields separated by colons (:) and it takes the last two to indicate the x and y coordinates of the reads within the
tile whence it was sequenced. See DEFAULT_READ_NAME_REGEX for more detail. The program traverses the INPUT twice: first
to find out the size of each of the tiles, and next to perform the downsampling. Downsampling invalidates the duplicate
flag because duplicate reads before downsampling may not all remain duplicated after downsampling. Thus, the default
setting also removes the duplicate information. 

Example

java -jar picard.jar PositionBasedDownsampleSam \
I=input.bam \
O=downsampled.bam \
FRACTION=0.1

Caveats

Note 1: This method is <b>technology and read-name dependent</b>. If the read-names do not have coordinate information
embedded in them, or if your BAM contains reads from multiple technologies (flowcell versions, sequencing machines) this
will not work properly. It has been designed to work with Illumina technology and reads-names. Consider modifying {@link
#READ_NAME_REGEX} in other cases. 

Note 2: The code has been designed to simulate, as accurately as possible, sequencing less, <b>not</b> for getting an
exact downsampled fraction (Use {@link DownsampleSam} for that.) In particular, since the reads may be distributed
non-evenly within the lanes/tiles, the resulting downsampling percentage will not be accurately determined by the input
argument FRACTION. 

Note 3:Consider running {@link MarkDuplicates} after downsampling in order to "expose" the duplicates whose
representative has been downsampled away.

Note 4:The downsampling assumes a uniform distribution of reads in the flowcell. Input already downsampled with
PositionBasedDownsampleSam violates this assumption. To guard against such input, PositionBasedDownsampleSam always
places a PG record in the header of its output, and aborts whenever it finds such a PG record in its input.
Version:null


Required Arguments:

--FRACTION,-F <Double>        The (approximate) fraction of reads to be kept, between 0 and 1.  Required. 

--INPUT,-I <File>             The input SAM/BAM/CRAM file to downsample.  Required. 

--OUTPUT,-O <File>            The output, downsampled, SAM/BAM/CRAM file.  Required. 


Optional Arguments:

--ALLOW_MULTIPLE_DOWNSAMPLING_DESPITE_WARNINGS <Boolean>
                              Allow downsampling again despite this being a bad idea with possibly unexpected results. 
                              Default value: false. Possible values: {true, false} 

--arguments_file <File>       read one or more arguments files and add them to the command line  This argument may be
                              specified 0 or more times. Default value: null. 

--COMPRESSION_LEVEL <Integer> Compression level for all compressed files created (e.g. BAM and VCF).  Default value: 5. 

--CREATE_INDEX <Boolean>      Whether to create an index when writing VCF or coordinate sorted BAM output.  Default
                              value: false. Possible values: {true, false} 

--CREATE_MD5_FILE <Boolean>   Whether to create an MD5 digest for any BAM or FASTQ files created.    Default value:
                              false. Possible values: {true, false} 

--help,-h <Boolean>           display the help message  Default value: false. Possible values: {true, false} 

--MAX_RECORDS_IN_RAM <Integer>When writing files that need to be sorted, this will specify the number of records stored
                              in RAM before spilling to disk. Increasing this number reduces the number of file handles
                              needed to sort the file, and increases the amount of RAM needed.  Default value: 100. 

--QUIET <Boolean>             Whether to suppress job-summary info on System.err.  Default value: false. Possible
                              values: {true, false} 

--READ_NAME_REGEX <String>    Use these regular expressions to parse read names in the input SAM file. Read names are
                              parsed to extract three variables: tile/region, x coordinate and y coordinate. The x and y
                              coordinates are used to determine the downsample decision. Set this option to null to
                              disable optical duplicate detection, e.g. for RNA-seq The regular expression should
                              contain three capture groups for the three variables, in order. It must match the entire
                              read name. Note that if the default regex is specified, a regex match is not actually
                              done, but instead the read name is split on colons (:). For 5 element names, the 3rd, 4th
                              and 5th elements are assumed to be tile, x and y values. For 7 element names (CASAVA 1.8),
                              the 5th, 6th, and 7th elements are assumed to be tile, x and y values.  Default value:
                              <optimized capture of last three ':' separated fields as numeric values>. 

--REFERENCE_SEQUENCE,-R <PicardHtsPath>
                              Reference sequence file.  Default value: null. 

--REMOVE_DUPLICATE_INFORMATION <Boolean>
                              Determines whether the duplicate tag should be reset since the downsampling requires
                              re-marking duplicates.  Default value: true. Possible values: {true, false} 

--STOP_AFTER <Long>           Stop after processing N reads, mainly for debugging.  Default value: null. 

--TMP_DIR <File>              One or more directories with space available to be used by this program for temporary
                              storage of working files  This argument may be specified 0 or more times. Default value:
                              null. 

--USE_JDK_DEFLATER,-use_jdk_deflater <Boolean>
                              Use the JDK Deflater instead of the Intel Deflater for writing compressed output  Default
                              value: false. Possible values: {true, false} 

--USE_JDK_INFLATER,-use_jdk_inflater <Boolean>
                              Use the JDK Inflater instead of the Intel Inflater for reading compressed input  Default
                              value: false. Possible values: {true, false} 

--VALIDATION_STRINGENCY <ValidationStringency>
                              Validation stringency for all SAM files read by this program.  Setting stringency to
                              SILENT can improve performance when processing a BAM file in which variable-length data
                              (read, qualities, tags) do not otherwise need to be decoded.  Default value: STRICT.
                              Possible values: {STRICT, LENIENT, SILENT} 

--VERBOSITY <LogLevel>        Control verbosity of logging.  Default value: INFO. Possible values: {ERROR, WARNING,
                              INFO, DEBUG} 

--version <Boolean>           display the version number for this tool  Default value: false. Possible values: {true,
                              false} 


Advanced Arguments:

--showHidden <Boolean>        display hidden arguments  Default value: false. Possible values: {true, false} 

FRACTION must be a value between 0 and 1, found: Infinity
USAGE: PositionBasedDownsampleSam [arguments]

<h3>Summary</h3>
Class to downsample a SAM/BAM/CRAM file based on the position of the read in a flowcell. As with DownsampleSam, all the
reads with the same queryname are either kept or dropped as a unit.

<h3>Details</h3>
The downsampling is _not_ random (and there is no random seed). It is deterministically determined by the position of
each read within its tile. Specifically, it draws an ellipse that covers a FRACTION of the total tile's area and of all
the edges of the tile. It uses this area to determine whether to keep or drop the record. Since reads with the same name
have the same position (mates, secondary and supplemental alignments), the decision will be the same for all of them.
The main concern of this downsampling method is that due to "optical duplicates" downsampling randomly can create a
result that has a different optical duplicate rate, and therefore a different estimated library size (when running
MarkDuplicates). This method keeps (physically) close read together, so that (except for reads near the boundary of the
circle) optical duplicates are kept or dropped as a group. By default the program expects the read names to have 5 or 7
fields separated by colons (:) and it takes the last two to indicate the x and y coordinates of the reads within the
tile whence it was sequenced. See DEFAULT_READ_NAME_REGEX for more detail. The program traverses the INPUT twice: first
to find out the size of each of the tiles, and next to perform the downsampling. Downsampling invalidates the duplicate
flag because duplicate reads before downsampling may not all remain duplicated after downsampling. Thus, the default
setting also removes the duplicate information. 

Example

java -jar picard.jar PositionBasedDownsampleSam \
I=input.bam \
O=downsampled.bam \
FRACTION=0.1

Caveats

Note 1: This method is <b>technology and read-name dependent</b>. If the read-names do not have coordinate information
embedded in them, or if your BAM contains reads from multiple technologies (flowcell versions, sequencing machines) this
will not work properly. It has been designed to work with Illumina technology and reads-names. Consider modifying {@link
#READ_NAME_REGEX} in other cases. 

Note 2: The code has been designed to simulate, as accurately as possible, sequencing less, <b>not</b> for getting an
exact downsampled fraction (Use {@link DownsampleSam} for that.) In particular, since the reads may be distributed
non-evenly within the lanes/tiles, the resulting downsampling percentage will not be accurately determined by the input
argument FRACTION. 

Note 3:Consider running {@link MarkDuplicates} after downsampling in order to "expose" the duplicates whose
representative has been downsampled away.

Note 4:The downsampling assumes a uniform distribution of reads in the flowcell. Input already downsampled with
PositionBasedDownsampleSam violates this assumption. To guard against such input, PositionBasedDownsampleSam always
places a PG record in the header of its output, and aborts whenever it finds such a PG record in its input.
Version:null


Required Arguments:

--FRACTION,-F <Double>        The (approximate) fraction of reads to be kept, between 0 and 1.  Required. 

--INPUT,-I <File>             The input SAM/BAM/CRAM file to downsample.  Required. 

--OUTPUT,-O <File>            The output, downsampled, SAM/BAM/CRAM file.  Required. 


Optional Arguments:

--ALLOW_MULTIPLE_DOWNSAMPLING_DESPITE_WARNINGS <Boolean>
                              Allow downsampling again despite this being a bad idea with possibly unexpected results. 
                              Default value: false. Possible values: {true, false} 

--arguments_file <File>       read one or more arguments files and add them to the command line  This argument may be
                              specified 0 or more times. Default value: null. 

--COMPRESSION_LEVEL <Integer> Compression level for all compressed files created (e.g. BAM and VCF).  Default value: 5. 

--CREATE_INDEX <Boolean>      Whether to create an index when writing VCF or coordinate sorted BAM output.  Default
                              value: false. Possible values: {true, false} 

--CREATE_MD5_FILE <Boolean>   Whether to create an MD5 digest for any BAM or FASTQ files created.    Default value:
                              false. Possible values: {true, false} 

--help,-h <Boolean>           display the help message  Default value: false. Possible values: {true, false} 

--MAX_RECORDS_IN_RAM <Integer>When writing files that need to be sorted, this will specify the number of records stored
                              in RAM before spilling to disk. Increasing this number reduces the number of file handles
                              needed to sort the file, and increases the amount of RAM needed.  Default value: 100. 

--QUIET <Boolean>             Whether to suppress job-summary info on System.err.  Default value: false. Possible
                              values: {true, false} 

--READ_NAME_REGEX <String>    Use these regular expressions to parse read names in the input SAM file. Read names are
                              parsed to extract three variables: tile/region, x coordinate and y coordinate. The x and y
                              coordinates are used to determine the downsample decision. Set this option to null to
                              disable optical duplicate detection, e.g. for RNA-seq The regular expression should
                              contain three capture groups for the three variables, in order. It must match the entire
                              read name. Note that if the default regex is specified, a regex match is not actually
                              done, but instead the read name is split on colons (:). For 5 element names, the 3rd, 4th
                              and 5th elements are assumed to be tile, x and y values. For 7 element names (CASAVA 1.8),
                              the 5th, 6th, and 7th elements are assumed to be tile, x and y values.  Default value:
                              <optimized capture of last three ':' separated fields as numeric values>. 

--REFERENCE_SEQUENCE,-R <PicardHtsPath>
                              Reference sequence file.  Default value: null. 

--REMOVE_DUPLICATE_INFORMATION <Boolean>
                              Determines whether the duplicate tag should be reset since the downsampling requires
                              re-marking duplicates.  Default value: true. Possible values: {true, false} 

--STOP_AFTER <Long>           Stop after processing N reads, mainly for debugging.  Default value: null. 

--TMP_DIR <File>              One or more directories with space available to be used by this program for temporary
                              storage of working files  This argument may be specified 0 or more times. Default value:
                              null. 

--USE_JDK_DEFLATER,-use_jdk_deflater <Boolean>
                              Use the JDK Deflater instead of the Intel Deflater for writing compressed output  Default
                              value: false. Possible values: {true, false} 

--USE_JDK_INFLATER,-use_jdk_inflater <Boolean>
                              Use the JDK Inflater instead of the Intel Inflater for reading compressed input  Default
                              value: false. Possible values: {true, false} 

--VALIDATION_STRINGENCY <ValidationStringency>
                              Validation stringency for all SAM files read by this program.  Setting stringency to
                              SILENT can improve performance when processing a BAM file in which variable-length data
                              (read, qualities, tags) do not otherwise need to be decoded.  Default value: STRICT.
                              Possible values: {STRICT, LENIENT, SILENT} 

--VERBOSITY <LogLevel>        Control verbosity of logging.  Default value: INFO. Possible values: {ERROR, WARNING,
                              INFO, DEBUG} 

--version <Boolean>           display the version number for this tool  Default value: false. Possible values: {true,
                              false} 


Advanced Arguments:

--showHidden <Boolean>        display hidden arguments  Default value: false. Possible values: {true, false} 

FRACTION must be a value between 0 and 1, found: -Infinity
[Thu Apr 24 13:04:11 EDT 2025] PositionBasedDownsampleSam --INPUT /tmp/pds_test_PositionalDownsampling5160644569884654405/PositionalDownsampleSam8241000921554732497.bam --OUTPUT /tmp/pds_test_PositionalDownsampling5160644569884654405/PositionalDownsampleSam17091562803111839686.bam --FRACTION 0.1 --REMOVE_DUPLICATE_INFORMATION true --READ_NAME_REGEX <optimized capture of last three ':' separated fields as numeric values> --ALLOW_MULTIPLE_DOWNSAMPLING_DESPITE_WARNINGS false --VERBOSITY INFO --QUIET false --VALIDATION_STRINGENCY STRICT --COMPRESSION_LEVEL 5 --MAX_RECORDS_IN_RAM 100 --CREATE_INDEX false --CREATE_MD5_FILE false --help false --version false --showHidden false --USE_JDK_DEFLATER false --USE_JDK_INFLATER false
[Thu Apr 24 13:04:11 EDT 2025] Executing as root@lovelace on Linux 6.1.0-28-amd64 amd64; OpenJDK 64-Bit Server VM 17.0.13+11-Debian-2deb12u1; Deflater: Intel; Inflater: Intel; Provider GCS is available; Picard version: Version:null
INFO	2025-04-24 13:04:11	PositionBasedDownsampleSam	Checking to see if input file has been downsampled with this program before.
INFO	2025-04-24 13:04:11	PositionBasedDownsampleSam	Starting first pass. Examining read distribution in tiles.
INFO	2025-04-24 13:04:11	PositionBasedDownsampleSam	First pass done.
INFO	2025-04-24 13:04:11	PositionBasedDownsampleSam	Starting second pass. Outputting reads.
INFO	2025-04-24 13:04:12	PositionBasedDownsampleSam	Second pass done.
WARNING	2025-04-24 13:04:12	PositionBasedDownsampleSam	You've requested FRACTION=0.100000, the resulting downsampling resulted in a rate of 0.069400.
INFO	2025-04-24 13:04:12	PositionBasedDownsampleSam	Finished! Kept 1388 out of 20000 reads (P=0.0694000).
[Thu Apr 24 13:04:12 EDT 2025] picard.sam.PositionBasedDownsampleSam done. Elapsed time: 0.00 minutes.
Runtime.totalMemory()=1073741824
[Thu Apr 24 13:04:12 EDT 2025] PositionBasedDownsampleSam --INPUT /tmp/pds_test_PositionalDownsampling5160644569884654405/PositionalDownsampleSam17091562803111839686.bam --OUTPUT /tmp/pds_test_PositionalDownsampling5160644569884654405/PositionalDownsampleSam7485707036275958588.bam --FRACTION 0.1 --ALLOW_MULTIPLE_DOWNSAMPLING_DESPITE_WARNINGS true --REMOVE_DUPLICATE_INFORMATION true --READ_NAME_REGEX <optimized capture of last three ':' separated fields as numeric values> --VERBOSITY INFO --QUIET false --VALIDATION_STRINGENCY STRICT --COMPRESSION_LEVEL 5 --MAX_RECORDS_IN_RAM 100 --CREATE_INDEX false --CREATE_MD5_FILE false --help false --version false --showHidden false --USE_JDK_DEFLATER false --USE_JDK_INFLATER false
[Thu Apr 24 13:04:12 EDT 2025] Executing as root@lovelace on Linux 6.1.0-28-amd64 amd64; OpenJDK 64-Bit Server VM 17.0.13+11-Debian-2deb12u1; Deflater: Intel; Inflater: Intel; Provider GCS is available; Picard version: Version:null
INFO	2025-04-24 13:04:12	PositionBasedDownsampleSam	Checking to see if input file has been downsampled with this program before.
WARNING	2025-04-24 13:04:12	PositionBasedDownsampleSam	Found previous Program Record that indicates that this file has been downsampled already with this program. Operation not supported! Previous PG: SAMProgramRecord{PN=PositionBasedDownsampleSam, CL=PositionBasedDownsampleSam --INPUT /tmp/pds_test_PositionalDownsampling5160644569884654405/PositionalDownsampleSam8241000921554732497.bam --OUTPUT /tmp/pds_test_PositionalDownsampling5160644569884654405/PositionalDownsampleSam17091562803111839686.bam --FRACTION 0.1 --REMOVE_DUPLICATE_INFORMATION true --READ_NAME_REGEX <optimized capture of last three ':' separated fields as numeric values> --ALLOW_MULTIPLE_DOWNSAMPLING_DESPITE_WARNINGS false --VERBOSITY INFO --QUIET false --VALIDATION_STRINGENCY STRICT --COMPRESSION_LEVEL 5 --MAX_RECORDS_IN_RAM 100 --CREATE_INDEX false --CREATE_MD5_FILE false --help false --version false --showHidden false --USE_JDK_DEFLATER false --USE_JDK_INFLATER false, VN=Version:null}
INFO	2025-04-24 13:04:12	PositionBasedDownsampleSam	Starting first pass. Examining read distribution in tiles.
INFO	2025-04-24 13:04:12	PositionBasedDownsampleSam	First pass done.
INFO	2025-04-24 13:04:12	PositionBasedDownsampleSam	Starting second pass. Outputting reads.
INFO	2025-04-24 13:04:12	PositionBasedDownsampleSam	Second pass done.
WARNING	2025-04-24 13:04:12	PositionBasedDownsampleSam	You've requested FRACTION=0.100000, the resulting downsampling resulted in a rate of 0.998559.
INFO	2025-04-24 13:04:12	PositionBasedDownsampleSam	Finished! Kept 1386 out of 1388 reads (P=0.998559).
[Thu Apr 24 13:04:12 EDT 2025] picard.sam.PositionBasedDownsampleSam done. Elapsed time: 0.00 minutes.
Runtime.totalMemory()=1073741824
[Thu Apr 24 13:04:12 EDT 2025] PositionBasedDownsampleSam --INPUT /tmp/pds_test_PositionalDownsampling5160644569884654405/PositionalDownsampleSam8241000921554732497.bam --OUTPUT /tmp/pds_test_PositionalDownsampling5160644569884654405/PositionalDownsampleSam14680955743806128770.bam --FRACTION 0.1 --REMOVE_DUPLICATE_INFORMATION true --READ_NAME_REGEX <optimized capture of last three ':' separated fields as numeric values> --ALLOW_MULTIPLE_DOWNSAMPLING_DESPITE_WARNINGS false --VERBOSITY INFO --QUIET false --VALIDATION_STRINGENCY STRICT --COMPRESSION_LEVEL 5 --MAX_RECORDS_IN_RAM 100 --CREATE_INDEX false --CREATE_MD5_FILE false --help false --version false --showHidden false --USE_JDK_DEFLATER false --USE_JDK_INFLATER false
[Thu Apr 24 13:04:12 EDT 2025] Executing as root@lovelace on Linux 6.1.0-28-amd64 amd64; OpenJDK 64-Bit Server VM 17.0.13+11-Debian-2deb12u1; Deflater: Intel; Inflater: Intel; Provider GCS is available; Picard version: Version:null
INFO	2025-04-24 13:04:12	PositionBasedDownsampleSam	Checking to see if input file has been downsampled with this program before.
INFO	2025-04-24 13:04:12	PositionBasedDownsampleSam	Starting first pass. Examining read distribution in tiles.
INFO	2025-04-24 13:04:12	PositionBasedDownsampleSam	First pass done.
INFO	2025-04-24 13:04:12	PositionBasedDownsampleSam	Starting second pass. Outputting reads.
INFO	2025-04-24 13:04:12	PositionBasedDownsampleSam	Second pass done.
WARNING	2025-04-24 13:04:12	PositionBasedDownsampleSam	You've requested FRACTION=0.100000, the resulting downsampling resulted in a rate of 0.069400.
INFO	2025-04-24 13:04:12	PositionBasedDownsampleSam	Finished! Kept 1388 out of 20000 reads (P=0.0694000).
[Thu Apr 24 13:04:12 EDT 2025] picard.sam.PositionBasedDownsampleSam done. Elapsed time: 0.00 minutes.
Runtime.totalMemory()=1073741824
[Thu Apr 24 13:04:12 EDT 2025] PositionBasedDownsampleSam --INPUT /tmp/pds_test_PositionalDownsampling5160644569884654405/PositionalDownsampleSam14680955743806128770.bam --OUTPUT /tmp/pds_test_PositionalDownsampling5160644569884654405/PositionalDownsampleSam684682119236562693.bam --FRACTION 0.1 --REMOVE_DUPLICATE_INFORMATION true --READ_NAME_REGEX <optimized capture of last three ':' separated fields as numeric values> --ALLOW_MULTIPLE_DOWNSAMPLING_DESPITE_WARNINGS false --VERBOSITY INFO --QUIET false --VALIDATION_STRINGENCY STRICT --COMPRESSION_LEVEL 5 --MAX_RECORDS_IN_RAM 100 --CREATE_INDEX false --CREATE_MD5_FILE false --help false --version false --showHidden false --USE_JDK_DEFLATER false --USE_JDK_INFLATER false
[Thu Apr 24 13:04:12 EDT 2025] Executing as root@lovelace on Linux 6.1.0-28-amd64 amd64; OpenJDK 64-Bit Server VM 17.0.13+11-Debian-2deb12u1; Deflater: Intel; Inflater: Intel; Provider GCS is available; Picard version: Version:null
INFO	2025-04-24 13:04:12	PositionBasedDownsampleSam	Checking to see if input file has been downsampled with this program before.
ERROR	2025-04-24 13:04:12	PositionBasedDownsampleSam	Found previous Program Record that indicates that this file has been downsampled already with this program. Operation not supported! Previous PG: SAMProgramRecord{PN=PositionBasedDownsampleSam, CL=PositionBasedDownsampleSam --INPUT /tmp/pds_test_PositionalDownsampling5160644569884654405/PositionalDownsampleSam8241000921554732497.bam --OUTPUT /tmp/pds_test_PositionalDownsampling5160644569884654405/PositionalDownsampleSam14680955743806128770.bam --FRACTION 0.1 --REMOVE_DUPLICATE_INFORMATION true --READ_NAME_REGEX <optimized capture of last three ':' separated fields as numeric values> --ALLOW_MULTIPLE_DOWNSAMPLING_DESPITE_WARNINGS false --VERBOSITY INFO --QUIET false --VALIDATION_STRINGENCY STRICT --COMPRESSION_LEVEL 5 --MAX_RECORDS_IN_RAM 100 --CREATE_INDEX false --CREATE_MD5_FILE false --help false --version false --showHidden false --USE_JDK_DEFLATER false --USE_JDK_INFLATER false, VN=Version:null}
[Thu Apr 24 13:04:12 EDT 2025] picard.sam.PositionBasedDownsampleSam done. Elapsed time: 0.00 minutes.
Runtime.totalMemory()=1073741824
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
[Thu Apr 24 13:04:12 EDT 2025] PositionBasedDownsampleSam --INPUT /tmp/pds_test_PositionalDownsampling5160644569884654405/PositionalDownsampleSam8241000921554732497.bam --OUTPUT /tmp/pds_test_PositionalDownsampling5160644569884654405/PositionalDownsampleSam11134014865551519153.bam --FRACTION 0.3 --CREATE_INDEX true --REMOVE_DUPLICATE_INFORMATION true --READ_NAME_REGEX <optimized capture of last three ':' separated fields as numeric values> --ALLOW_MULTIPLE_DOWNSAMPLING_DESPITE_WARNINGS false --VERBOSITY INFO --QUIET false --VALIDATION_STRINGENCY STRICT --COMPRESSION_LEVEL 5 --MAX_RECORDS_IN_RAM 100 --CREATE_MD5_FILE false --help false --version false --showHidden false --USE_JDK_DEFLATER false --USE_JDK_INFLATER false
[Thu Apr 24 13:04:12 EDT 2025] Executing as root@lovelace on Linux 6.1.0-28-amd64 amd64; OpenJDK 64-Bit Server VM 17.0.13+11-Debian-2deb12u1; Deflater: Intel; Inflater: Intel; Provider GCS is available; Picard version: Version:null
INFO	2025-04-24 13:04:12	PositionBasedDownsampleSam	Checking to see if input file has been downsampled with this program before.
INFO	2025-04-24 13:04:12	PositionBasedDownsampleSam	Starting first pass. Examining read distribution in tiles.
INFO	2025-04-24 13:04:12	PositionBasedDownsampleSam	First pass done.
INFO	2025-04-24 13:04:12	PositionBasedDownsampleSam	Starting second pass. Outputting reads.
INFO	2025-04-24 13:04:12	PositionBasedDownsampleSam	Second pass done.
INFO	2025-04-24 13:04:12	PositionBasedDownsampleSam	Finished! Kept 5544 out of 20000 reads (P=0.277200).
[Thu Apr 24 13:04:12 EDT 2025] picard.sam.PositionBasedDownsampleSam done. Elapsed time: 0.00 minutes.
Runtime.totalMemory()=1073741824
WARNING	2025-04-24 13:04:12	ValidateSamFile	NM validation cannot be performed without the reference. All other validations will still occur.
[Thu Apr 24 13:04:12 EDT 2025] PositionBasedDownsampleSam --INPUT /tmp/pds_test_PositionalDownsampling5160644569884654405/PositionalDownsampleSam8241000921554732497.bam --OUTPUT /tmp/pds_test_PositionalDownsampling5160644569884654405/PositionalDownsampleSam14804667383810650798.bam --FRACTION 0.4 --CREATE_INDEX true --REMOVE_DUPLICATE_INFORMATION true --READ_NAME_REGEX <optimized capture of last three ':' separated fields as numeric values> --ALLOW_MULTIPLE_DOWNSAMPLING_DESPITE_WARNINGS false --VERBOSITY INFO --QUIET false --VALIDATION_STRINGENCY STRICT --COMPRESSION_LEVEL 5 --MAX_RECORDS_IN_RAM 100 --CREATE_MD5_FILE false --help false --version false --showHidden false --USE_JDK_DEFLATER false --USE_JDK_INFLATER false
[Thu Apr 24 13:04:12 EDT 2025] Executing as root@lovelace on Linux 6.1.0-28-amd64 amd64; OpenJDK 64-Bit Server VM 17.0.13+11-Debian-2deb12u1; Deflater: Intel; Inflater: Intel; Provider GCS is available; Picard version: Version:null
INFO	2025-04-24 13:04:12	PositionBasedDownsampleSam	Checking to see if input file has been downsampled with this program before.
INFO	2025-04-24 13:04:12	PositionBasedDownsampleSam	Starting first pass. Examining read distribution in tiles.
INFO	2025-04-24 13:04:12	PositionBasedDownsampleSam	First pass done.
INFO	2025-04-24 13:04:12	PositionBasedDownsampleSam	Starting second pass. Outputting reads.
INFO	2025-04-24 13:04:12	PositionBasedDownsampleSam	Second pass done.
INFO	2025-04-24 13:04:12	PositionBasedDownsampleSam	Finished! Kept 7652 out of 20000 reads (P=0.382600).
[Thu Apr 24 13:04:12 EDT 2025] picard.sam.PositionBasedDownsampleSam done. Elapsed time: 0.00 minutes.
Runtime.totalMemory()=1073741824
WARNING	2025-04-24 13:04:12	ValidateSamFile	NM validation cannot be performed without the reference. All other validations will still occur.
[Thu Apr 24 13:04:12 EDT 2025] PositionBasedDownsampleSam --INPUT /tmp/pds_test_PositionalDownsampling5160644569884654405/PositionalDownsampleSam8241000921554732497.bam --OUTPUT /tmp/pds_test_PositionalDownsampling5160644569884654405/PositionalDownsampleSam8763103151718330851.bam --FRACTION 0.5 --CREATE_INDEX true --REMOVE_DUPLICATE_INFORMATION true --READ_NAME_REGEX <optimized capture of last three ':' separated fields as numeric values> --ALLOW_MULTIPLE_DOWNSAMPLING_DESPITE_WARNINGS false --VERBOSITY INFO --QUIET false --VALIDATION_STRINGENCY STRICT --COMPRESSION_LEVEL 5 --MAX_RECORDS_IN_RAM 100 --CREATE_MD5_FILE false --help false --version false --showHidden false --USE_JDK_DEFLATER false --USE_JDK_INFLATER false
[Thu Apr 24 13:04:12 EDT 2025] Executing as root@lovelace on Linux 6.1.0-28-amd64 amd64; OpenJDK 64-Bit Server VM 17.0.13+11-Debian-2deb12u1; Deflater: Intel; Inflater: Intel; Provider GCS is available; Picard version: Version:null
INFO	2025-04-24 13:04:12	PositionBasedDownsampleSam	Checking to see if input file has been downsampled with this program before.
INFO	2025-04-24 13:04:12	PositionBasedDownsampleSam	Starting first pass. Examining read distribution in tiles.
INFO	2025-04-24 13:04:12	PositionBasedDownsampleSam	First pass done.
INFO	2025-04-24 13:04:12	PositionBasedDownsampleSam	Starting second pass. Outputting reads.
INFO	2025-04-24 13:04:12	PositionBasedDownsampleSam	Second pass done.
INFO	2025-04-24 13:04:12	PositionBasedDownsampleSam	Finished! Kept 9688 out of 20000 reads (P=0.484400).
[Thu Apr 24 13:04:12 EDT 2025] picard.sam.PositionBasedDownsampleSam done. Elapsed time: 0.00 minutes.
Runtime.totalMemory()=1073741824
WARNING	2025-04-24 13:04:12	ValidateSamFile	NM validation cannot be performed without the reference. All other validations will still occur.
[Thu Apr 24 13:04:12 EDT 2025] PositionBasedDownsampleSam --INPUT /tmp/pds_test_PositionalDownsampling5160644569884654405/PositionalDownsampleSam8241000921554732497.bam --OUTPUT /tmp/pds_test_PositionalDownsampling5160644569884654405/PositionalDownsampleSam1555394520304180532.bam --FRACTION 0.6 --CREATE_INDEX true --REMOVE_DUPLICATE_INFORMATION true --READ_NAME_REGEX <optimized capture of last three ':' separated fields as numeric values> --ALLOW_MULTIPLE_DOWNSAMPLING_DESPITE_WARNINGS false --VERBOSITY INFO --QUIET false --VALIDATION_STRINGENCY STRICT --COMPRESSION_LEVEL 5 --MAX_RECORDS_IN_RAM 100 --CREATE_MD5_FILE false --help false --version false --showHidden false --USE_JDK_DEFLATER false --USE_JDK_INFLATER false
[Thu Apr 24 13:04:12 EDT 2025] Executing as root@lovelace on Linux 6.1.0-28-amd64 amd64; OpenJDK 64-Bit Server VM 17.0.13+11-Debian-2deb12u1; Deflater: Intel; Inflater: Intel; Provider GCS is available; Picard version: Version:null
INFO	2025-04-24 13:04:12	PositionBasedDownsampleSam	Checking to see if input file has been downsampled with this program before.
INFO	2025-04-24 13:04:12	PositionBasedDownsampleSam	Starting first pass. Examining read distribution in tiles.
INFO	2025-04-24 13:04:12	PositionBasedDownsampleSam	First pass done.
INFO	2025-04-24 13:04:12	PositionBasedDownsampleSam	Starting second pass. Outputting reads.
INFO	2025-04-24 13:04:12	PositionBasedDownsampleSam	Second pass done.
INFO	2025-04-24 13:04:12	PositionBasedDownsampleSam	Finished! Kept 12348 out of 20000 reads (P=0.617400).
[Thu Apr 24 13:04:12 EDT 2025] picard.sam.PositionBasedDownsampleSam done. Elapsed time: 0.00 minutes.
Runtime.totalMemory()=1073741824
WARNING	2025-04-24 13:04:12	ValidateSamFile	NM validation cannot be performed without the reference. All other validations will still occur.
[Thu Apr 24 13:04:12 EDT 2025] PositionBasedDownsampleSam --INPUT /tmp/pds_test_PositionalDownsampling5160644569884654405/PositionalDownsampleSam8241000921554732497.bam --OUTPUT /tmp/pds_test_PositionalDownsampling5160644569884654405/PositionalDownsampleSam3293118460970238390.bam --FRACTION 0.7 --CREATE_INDEX true --REMOVE_DUPLICATE_INFORMATION true --READ_NAME_REGEX <optimized capture of last three ':' separated fields as numeric values> --ALLOW_MULTIPLE_DOWNSAMPLING_DESPITE_WARNINGS false --VERBOSITY INFO --QUIET false --VALIDATION_STRINGENCY STRICT --COMPRESSION_LEVEL 5 --MAX_RECORDS_IN_RAM 100 --CREATE_MD5_FILE false --help false --version false --showHidden false --USE_JDK_DEFLATER false --USE_JDK_INFLATER false
[Thu Apr 24 13:04:12 EDT 2025] Executing as root@lovelace on Linux 6.1.0-28-amd64 amd64; OpenJDK 64-Bit Server VM 17.0.13+11-Debian-2deb12u1; Deflater: Intel; Inflater: Intel; Provider GCS is available; Picard version: Version:null
INFO	2025-04-24 13:04:12	PositionBasedDownsampleSam	Checking to see if input file has been downsampled with this program before.
INFO	2025-04-24 13:04:12	PositionBasedDownsampleSam	Starting first pass. Examining read distribution in tiles.
INFO	2025-04-24 13:04:12	PositionBasedDownsampleSam	First pass done.
INFO	2025-04-24 13:04:12	PositionBasedDownsampleSam	Starting second pass. Outputting reads.
INFO	2025-04-24 13:04:12	PositionBasedDownsampleSam	Second pass done.
INFO	2025-04-24 13:04:12	PositionBasedDownsampleSam	Finished! Kept 14456 out of 20000 reads (P=0.722800).
[Thu Apr 24 13:04:12 EDT 2025] picard.sam.PositionBasedDownsampleSam done. Elapsed time: 0.00 minutes.
Runtime.totalMemory()=1073741824
WARNING	2025-04-24 13:04:12	ValidateSamFile	NM validation cannot be performed without the reference. All other validations will still occur.
[Thu Apr 24 13:04:12 EDT 2025] PositionBasedDownsampleSam --INPUT /tmp/pds_test_PositionalDownsampling5160644569884654405/PositionalDownsampleSam8241000921554732497.bam --OUTPUT /tmp/pds_test_PositionalDownsampling5160644569884654405/PositionalDownsampleSam5465010466467136963.bam --FRACTION 0.7999999999999999 --CREATE_INDEX true --REMOVE_DUPLICATE_INFORMATION true --READ_NAME_REGEX <optimized capture of last three ':' separated fields as numeric values> --ALLOW_MULTIPLE_DOWNSAMPLING_DESPITE_WARNINGS false --VERBOSITY INFO --QUIET false --VALIDATION_STRINGENCY STRICT --COMPRESSION_LEVEL 5 --MAX_RECORDS_IN_RAM 100 --CREATE_MD5_FILE false --help false --version false --showHidden false --USE_JDK_DEFLATER false --USE_JDK_INFLATER false
[Thu Apr 24 13:04:12 EDT 2025] Executing as root@lovelace on Linux 6.1.0-28-amd64 amd64; OpenJDK 64-Bit Server VM 17.0.13+11-Debian-2deb12u1; Deflater: Intel; Inflater: Intel; Provider GCS is available; Picard version: Version:null
INFO	2025-04-24 13:04:12	PositionBasedDownsampleSam	Checking to see if input file has been downsampled with this program before.
INFO	2025-04-24 13:04:12	PositionBasedDownsampleSam	Starting first pass. Examining read distribution in tiles.
INFO	2025-04-24 13:04:12	PositionBasedDownsampleSam	First pass done.
INFO	2025-04-24 13:04:12	PositionBasedDownsampleSam	Starting second pass. Outputting reads.
INFO	2025-04-24 13:04:12	PositionBasedDownsampleSam	Second pass done.
INFO	2025-04-24 13:04:12	PositionBasedDownsampleSam	Finished! Kept 16642 out of 20000 reads (P=0.832100).
[Thu Apr 24 13:04:12 EDT 2025] picard.sam.PositionBasedDownsampleSam done. Elapsed time: 0.00 minutes.
Runtime.totalMemory()=1073741824
WARNING	2025-04-24 13:04:12	ValidateSamFile	NM validation cannot be performed without the reference. All other validations will still occur.
[Thu Apr 24 13:04:13 EDT 2025] PositionBasedDownsampleSam --INPUT /tmp/pds_test_PositionalDownsampling5160644569884654405/PositionalDownsampleSam8241000921554732497.bam --OUTPUT /tmp/pds_test_PositionalDownsampling5160644569884654405/PositionalDownsampleSam8768151596868067586.bam --FRACTION 0.8999999999999999 --CREATE_INDEX true --REMOVE_DUPLICATE_INFORMATION true --READ_NAME_REGEX <optimized capture of last three ':' separated fields as numeric values> --ALLOW_MULTIPLE_DOWNSAMPLING_DESPITE_WARNINGS false --VERBOSITY INFO --QUIET false --VALIDATION_STRINGENCY STRICT --COMPRESSION_LEVEL 5 --MAX_RECORDS_IN_RAM 100 --CREATE_MD5_FILE false --help false --version false --showHidden false --USE_JDK_DEFLATER false --USE_JDK_INFLATER false
[Thu Apr 24 13:04:13 EDT 2025] Executing as root@lovelace on Linux 6.1.0-28-amd64 amd64; OpenJDK 64-Bit Server VM 17.0.13+11-Debian-2deb12u1; Deflater: Intel; Inflater: Intel; Provider GCS is available; Picard version: Version:null
INFO	2025-04-24 13:04:13	PositionBasedDownsampleSam	Checking to see if input file has been downsampled with this program before.
INFO	2025-04-24 13:04:13	PositionBasedDownsampleSam	Starting first pass. Examining read distribution in tiles.
INFO	2025-04-24 13:04:13	PositionBasedDownsampleSam	First pass done.
INFO	2025-04-24 13:04:13	PositionBasedDownsampleSam	Starting second pass. Outputting reads.
INFO	2025-04-24 13:04:13	PositionBasedDownsampleSam	Second pass done.
INFO	2025-04-24 13:04:13	PositionBasedDownsampleSam	Finished! Kept 18612 out of 20000 reads (P=0.930600).
[Thu Apr 24 13:04:13 EDT 2025] picard.sam.PositionBasedDownsampleSam done. Elapsed time: 0.00 minutes.
Runtime.totalMemory()=1073741824
WARNING	2025-04-24 13:04:13	ValidateSamFile	NM validation cannot be performed without the reference. All other validations will still occur.
[Thu Apr 24 13:04:13 EDT 2025] PositionBasedDownsampleSam --INPUT /tmp/pds_test_PositionalDownsampling5160644569884654405/PositionalDownsampleSam8241000921554732497.bam --OUTPUT /tmp/pds_test_PositionalDownsampling5160644569884654405/PositionalDownsampleSam11367191584493318599.bam --FRACTION 0.9999999999999999 --CREATE_INDEX true --REMOVE_DUPLICATE_INFORMATION true --READ_NAME_REGEX <optimized capture of last three ':' separated fields as numeric values> --ALLOW_MULTIPLE_DOWNSAMPLING_DESPITE_WARNINGS false --VERBOSITY INFO --QUIET false --VALIDATION_STRINGENCY STRICT --COMPRESSION_LEVEL 5 --MAX_RECORDS_IN_RAM 100 --CREATE_MD5_FILE false --help false --version false --showHidden false --USE_JDK_DEFLATER false --USE_JDK_INFLATER false
[Thu Apr 24 13:04:13 EDT 2025] Executing as root@lovelace on Linux 6.1.0-28-amd64 amd64; OpenJDK 64-Bit Server VM 17.0.13+11-Debian-2deb12u1; Deflater: Intel; Inflater: Intel; Provider GCS is available; Picard version: Version:null
INFO	2025-04-24 13:04:13	PositionBasedDownsampleSam	Checking to see if input file has been downsampled with this program before.
INFO	2025-04-24 13:04:13	PositionBasedDownsampleSam	Starting first pass. Examining read distribution in tiles.
INFO	2025-04-24 13:04:13	PositionBasedDownsampleSam	First pass done.
INFO	2025-04-24 13:04:13	PositionBasedDownsampleSam	Starting second pass. Outputting reads.
INFO	2025-04-24 13:04:13	PositionBasedDownsampleSam	Second pass done.
INFO	2025-04-24 13:04:13	PositionBasedDownsampleSam	Finished! Kept 20000 out of 20000 reads (P=1.00000).
[Thu Apr 24 13:04:13 EDT 2025] picard.sam.PositionBasedDownsampleSam done. Elapsed time: 0.00 minutes.
Runtime.totalMemory()=1073741824
WARNING	2025-04-24 13:04:13	ValidateSamFile	NM validation cannot be performed without the reference. All other validations will still occur.