WARNING 2025-04-24 13:04:11 ValidateSamFile NM validation cannot be performed without the reference. All other validations will still occur.
USAGE: PositionBasedDownsampleSam [arguments]
<h3>Summary</h3>
Class to downsample a SAM/BAM/CRAM file based on the position of the read in a flowcell. As with DownsampleSam, all the
reads with the same queryname are either kept or dropped as a unit.
<h3>Details</h3>
The downsampling is _not_ random (and there is no random seed). It is deterministically determined by the position of
each read within its tile. Specifically, it draws an ellipse that covers a FRACTION of the total tile's area and of all
the edges of the tile. It uses this area to determine whether to keep or drop the record. Since reads with the same name
have the same position (mates, secondary and supplemental alignments), the decision will be the same for all of them.
The main concern of this downsampling method is that due to "optical duplicates" downsampling randomly can create a
result that has a different optical duplicate rate, and therefore a different estimated library size (when running
MarkDuplicates). This method keeps (physically) close read together, so that (except for reads near the boundary of the
circle) optical duplicates are kept or dropped as a group. By default the program expects the read names to have 5 or 7
fields separated by colons (:) and it takes the last two to indicate the x and y coordinates of the reads within the
tile whence it was sequenced. See DEFAULT_READ_NAME_REGEX for more detail. The program traverses the INPUT twice: first
to find out the size of each of the tiles, and next to perform the downsampling. Downsampling invalidates the duplicate
flag because duplicate reads before downsampling may not all remain duplicated after downsampling. Thus, the default
setting also removes the duplicate information.
Example
java -jar picard.jar PositionBasedDownsampleSam \
I=input.bam \
O=downsampled.bam \
FRACTION=0.1
Caveats
Note 1: This method is <b>technology and read-name dependent</b>. If the read-names do not have coordinate information
embedded in them, or if your BAM contains reads from multiple technologies (flowcell versions, sequencing machines) this
will not work properly. It has been designed to work with Illumina technology and reads-names. Consider modifying {@link
#READ_NAME_REGEX} in other cases.
Note 2: The code has been designed to simulate, as accurately as possible, sequencing less, <b>not</b> for getting an
exact downsampled fraction (Use {@link DownsampleSam} for that.) In particular, since the reads may be distributed
non-evenly within the lanes/tiles, the resulting downsampling percentage will not be accurately determined by the input
argument FRACTION.
Note 3:Consider running {@link MarkDuplicates} after downsampling in order to "expose" the duplicates whose
representative has been downsampled away.
Note 4:The downsampling assumes a uniform distribution of reads in the flowcell. Input already downsampled with
PositionBasedDownsampleSam violates this assumption. To guard against such input, PositionBasedDownsampleSam always
places a PG record in the header of its output, and aborts whenever it finds such a PG record in its input.
Version:null
Required Arguments:
--FRACTION,-F <Double> The (approximate) fraction of reads to be kept, between 0 and 1. Required.
--INPUT,-I <File> The input SAM/BAM/CRAM file to downsample. Required.
--OUTPUT,-O <File> The output, downsampled, SAM/BAM/CRAM file. Required.
Optional Arguments:
--ALLOW_MULTIPLE_DOWNSAMPLING_DESPITE_WARNINGS <Boolean>
Allow downsampling again despite this being a bad idea with possibly unexpected results.
Default value: false. Possible values: {true, false}
--arguments_file <File> read one or more arguments files and add them to the command line This argument may be
specified 0 or more times. Default value: null.
--COMPRESSION_LEVEL <Integer> Compression level for all compressed files created (e.g. BAM and VCF). Default value: 5.
--CREATE_INDEX <Boolean> Whether to create an index when writing VCF or coordinate sorted BAM output. Default
value: false. Possible values: {true, false}
--CREATE_MD5_FILE <Boolean> Whether to create an MD5 digest for any BAM or FASTQ files created. Default value:
false. Possible values: {true, false}
--help,-h <Boolean> display the help message Default value: false. Possible values: {true, false}
--MAX_RECORDS_IN_RAM <Integer>When writing files that need to be sorted, this will specify the number of records stored
in RAM before spilling to disk. Increasing this number reduces the number of file handles
needed to sort the file, and increases the amount of RAM needed. Default value: 100.
--QUIET <Boolean> Whether to suppress job-summary info on System.err. Default value: false. Possible
values: {true, false}
--READ_NAME_REGEX <String> Use these regular expressions to parse read names in the input SAM file. Read names are
parsed to extract three variables: tile/region, x coordinate and y coordinate. The x and y
coordinates are used to determine the downsample decision. Set this option to null to
disable optical duplicate detection, e.g. for RNA-seq The regular expression should
contain three capture groups for the three variables, in order. It must match the entire
read name. Note that if the default regex is specified, a regex match is not actually
done, but instead the read name is split on colons (:). For 5 element names, the 3rd, 4th
and 5th elements are assumed to be tile, x and y values. For 7 element names (CASAVA 1.8),
the 5th, 6th, and 7th elements are assumed to be tile, x and y values. Default value:
<optimized capture of last three ':' separated fields as numeric values>.
--REFERENCE_SEQUENCE,-R <PicardHtsPath>
Reference sequence file. Default value: null.
--REMOVE_DUPLICATE_INFORMATION <Boolean>
Determines whether the duplicate tag should be reset since the downsampling requires
re-marking duplicates. Default value: true. Possible values: {true, false}
--STOP_AFTER <Long> Stop after processing N reads, mainly for debugging. Default value: null.
--TMP_DIR <File> One or more directories with space available to be used by this program for temporary
storage of working files This argument may be specified 0 or more times. Default value:
null.
--USE_JDK_DEFLATER,-use_jdk_deflater <Boolean>
Use the JDK Deflater instead of the Intel Deflater for writing compressed output Default
value: false. Possible values: {true, false}
--USE_JDK_INFLATER,-use_jdk_inflater <Boolean>
Use the JDK Inflater instead of the Intel Inflater for reading compressed input Default
value: false. Possible values: {true, false}
--VALIDATION_STRINGENCY <ValidationStringency>
Validation stringency for all SAM files read by this program. Setting stringency to
SILENT can improve performance when processing a BAM file in which variable-length data
(read, qualities, tags) do not otherwise need to be decoded. Default value: STRICT.
Possible values: {STRICT, LENIENT, SILENT}
--VERBOSITY <LogLevel> Control verbosity of logging. Default value: INFO. Possible values: {ERROR, WARNING,
INFO, DEBUG}
--version <Boolean> display the version number for this tool Default value: false. Possible values: {true,
false}
Advanced Arguments:
--showHidden <Boolean> display hidden arguments Default value: false. Possible values: {true, false}
FRACTION must be a value between 0 and 1, found: -1.0
USAGE: PositionBasedDownsampleSam [arguments]
<h3>Summary</h3>
Class to downsample a SAM/BAM/CRAM file based on the position of the read in a flowcell. As with DownsampleSam, all the
reads with the same queryname are either kept or dropped as a unit.
<h3>Details</h3>
The downsampling is _not_ random (and there is no random seed). It is deterministically determined by the position of
each read within its tile. Specifically, it draws an ellipse that covers a FRACTION of the total tile's area and of all
the edges of the tile. It uses this area to determine whether to keep or drop the record. Since reads with the same name
have the same position (mates, secondary and supplemental alignments), the decision will be the same for all of them.
The main concern of this downsampling method is that due to "optical duplicates" downsampling randomly can create a
result that has a different optical duplicate rate, and therefore a different estimated library size (when running
MarkDuplicates). This method keeps (physically) close read together, so that (except for reads near the boundary of the
circle) optical duplicates are kept or dropped as a group. By default the program expects the read names to have 5 or 7
fields separated by colons (:) and it takes the last two to indicate the x and y coordinates of the reads within the
tile whence it was sequenced. See DEFAULT_READ_NAME_REGEX for more detail. The program traverses the INPUT twice: first
to find out the size of each of the tiles, and next to perform the downsampling. Downsampling invalidates the duplicate
flag because duplicate reads before downsampling may not all remain duplicated after downsampling. Thus, the default
setting also removes the duplicate information.
Example
java -jar picard.jar PositionBasedDownsampleSam \
I=input.bam \
O=downsampled.bam \
FRACTION=0.1
Caveats
Note 1: This method is <b>technology and read-name dependent</b>. If the read-names do not have coordinate information
embedded in them, or if your BAM contains reads from multiple technologies (flowcell versions, sequencing machines) this
will not work properly. It has been designed to work with Illumina technology and reads-names. Consider modifying {@link
#READ_NAME_REGEX} in other cases.
Note 2: The code has been designed to simulate, as accurately as possible, sequencing less, <b>not</b> for getting an
exact downsampled fraction (Use {@link DownsampleSam} for that.) In particular, since the reads may be distributed
non-evenly within the lanes/tiles, the resulting downsampling percentage will not be accurately determined by the input
argument FRACTION.
Note 3:Consider running {@link MarkDuplicates} after downsampling in order to "expose" the duplicates whose
representative has been downsampled away.
Note 4:The downsampling assumes a uniform distribution of reads in the flowcell. Input already downsampled with
PositionBasedDownsampleSam violates this assumption. To guard against such input, PositionBasedDownsampleSam always
places a PG record in the header of its output, and aborts whenever it finds such a PG record in its input.
Version:null
Required Arguments:
--FRACTION,-F <Double> The (approximate) fraction of reads to be kept, between 0 and 1. Required.
--INPUT,-I <File> The input SAM/BAM/CRAM file to downsample. Required.
--OUTPUT,-O <File> The output, downsampled, SAM/BAM/CRAM file. Required.
Optional Arguments:
--ALLOW_MULTIPLE_DOWNSAMPLING_DESPITE_WARNINGS <Boolean>
Allow downsampling again despite this being a bad idea with possibly unexpected results.
Default value: false. Possible values: {true, false}
--arguments_file <File> read one or more arguments files and add them to the command line This argument may be
specified 0 or more times. Default value: null.
--COMPRESSION_LEVEL <Integer> Compression level for all compressed files created (e.g. BAM and VCF). Default value: 5.
--CREATE_INDEX <Boolean> Whether to create an index when writing VCF or coordinate sorted BAM output. Default
value: false. Possible values: {true, false}
--CREATE_MD5_FILE <Boolean> Whether to create an MD5 digest for any BAM or FASTQ files created. Default value:
false. Possible values: {true, false}
--help,-h <Boolean> display the help message Default value: false. Possible values: {true, false}
--MAX_RECORDS_IN_RAM <Integer>When writing files that need to be sorted, this will specify the number of records stored
in RAM before spilling to disk. Increasing this number reduces the number of file handles
needed to sort the file, and increases the amount of RAM needed. Default value: 100.
--QUIET <Boolean> Whether to suppress job-summary info on System.err. Default value: false. Possible
values: {true, false}
--READ_NAME_REGEX <String> Use these regular expressions to parse read names in the input SAM file. Read names are
parsed to extract three variables: tile/region, x coordinate and y coordinate. The x and y
coordinates are used to determine the downsample decision. Set this option to null to
disable optical duplicate detection, e.g. for RNA-seq The regular expression should
contain three capture groups for the three variables, in order. It must match the entire
read name. Note that if the default regex is specified, a regex match is not actually
done, but instead the read name is split on colons (:). For 5 element names, the 3rd, 4th
and 5th elements are assumed to be tile, x and y values. For 7 element names (CASAVA 1.8),
the 5th, 6th, and 7th elements are assumed to be tile, x and y values. Default value:
<optimized capture of last three ':' separated fields as numeric values>.
--REFERENCE_SEQUENCE,-R <PicardHtsPath>
Reference sequence file. Default value: null.
--REMOVE_DUPLICATE_INFORMATION <Boolean>
Determines whether the duplicate tag should be reset since the downsampling requires
re-marking duplicates. Default value: true. Possible values: {true, false}
--STOP_AFTER <Long> Stop after processing N reads, mainly for debugging. Default value: null.
--TMP_DIR <File> One or more directories with space available to be used by this program for temporary
storage of working files This argument may be specified 0 or more times. Default value:
null.
--USE_JDK_DEFLATER,-use_jdk_deflater <Boolean>
Use the JDK Deflater instead of the Intel Deflater for writing compressed output Default
value: false. Possible values: {true, false}
--USE_JDK_INFLATER,-use_jdk_inflater <Boolean>
Use the JDK Inflater instead of the Intel Inflater for reading compressed input Default
value: false. Possible values: {true, false}
--VALIDATION_STRINGENCY <ValidationStringency>
Validation stringency for all SAM files read by this program. Setting stringency to
SILENT can improve performance when processing a BAM file in which variable-length data
(read, qualities, tags) do not otherwise need to be decoded. Default value: STRICT.
Possible values: {STRICT, LENIENT, SILENT}
--VERBOSITY <LogLevel> Control verbosity of logging. Default value: INFO. Possible values: {ERROR, WARNING,
INFO, DEBUG}
--version <Boolean> display the version number for this tool Default value: false. Possible values: {true,
false}
Advanced Arguments:
--showHidden <Boolean> display hidden arguments Default value: false. Possible values: {true, false}
FRACTION must be a value between 0 and 1, found: -1.0E-5
USAGE: PositionBasedDownsampleSam [arguments]
<h3>Summary</h3>
Class to downsample a SAM/BAM/CRAM file based on the position of the read in a flowcell. As with DownsampleSam, all the
reads with the same queryname are either kept or dropped as a unit.
<h3>Details</h3>
The downsampling is _not_ random (and there is no random seed). It is deterministically determined by the position of
each read within its tile. Specifically, it draws an ellipse that covers a FRACTION of the total tile's area and of all
the edges of the tile. It uses this area to determine whether to keep or drop the record. Since reads with the same name
have the same position (mates, secondary and supplemental alignments), the decision will be the same for all of them.
The main concern of this downsampling method is that due to "optical duplicates" downsampling randomly can create a
result that has a different optical duplicate rate, and therefore a different estimated library size (when running
MarkDuplicates). This method keeps (physically) close read together, so that (except for reads near the boundary of the
circle) optical duplicates are kept or dropped as a group. By default the program expects the read names to have 5 or 7
fields separated by colons (:) and it takes the last two to indicate the x and y coordinates of the reads within the
tile whence it was sequenced. See DEFAULT_READ_NAME_REGEX for more detail. The program traverses the INPUT twice: first
to find out the size of each of the tiles, and next to perform the downsampling. Downsampling invalidates the duplicate
flag because duplicate reads before downsampling may not all remain duplicated after downsampling. Thus, the default
setting also removes the duplicate information.
Example
java -jar picard.jar PositionBasedDownsampleSam \
I=input.bam \
O=downsampled.bam \
FRACTION=0.1
Caveats
Note 1: This method is <b>technology and read-name dependent</b>. If the read-names do not have coordinate information
embedded in them, or if your BAM contains reads from multiple technologies (flowcell versions, sequencing machines) this
will not work properly. It has been designed to work with Illumina technology and reads-names. Consider modifying {@link
#READ_NAME_REGEX} in other cases.
Note 2: The code has been designed to simulate, as accurately as possible, sequencing less, <b>not</b> for getting an
exact downsampled fraction (Use {@link DownsampleSam} for that.) In particular, since the reads may be distributed
non-evenly within the lanes/tiles, the resulting downsampling percentage will not be accurately determined by the input
argument FRACTION.
Note 3:Consider running {@link MarkDuplicates} after downsampling in order to "expose" the duplicates whose
representative has been downsampled away.
Note 4:The downsampling assumes a uniform distribution of reads in the flowcell. Input already downsampled with
PositionBasedDownsampleSam violates this assumption. To guard against such input, PositionBasedDownsampleSam always
places a PG record in the header of its output, and aborts whenever it finds such a PG record in its input.
Version:null
Required Arguments:
--FRACTION,-F <Double> The (approximate) fraction of reads to be kept, between 0 and 1. Required.
--INPUT,-I <File> The input SAM/BAM/CRAM file to downsample. Required.
--OUTPUT,-O <File> The output, downsampled, SAM/BAM/CRAM file. Required.
Optional Arguments:
--ALLOW_MULTIPLE_DOWNSAMPLING_DESPITE_WARNINGS <Boolean>
Allow downsampling again despite this being a bad idea with possibly unexpected results.
Default value: false. Possible values: {true, false}
--arguments_file <File> read one or more arguments files and add them to the command line This argument may be
specified 0 or more times. Default value: null.
--COMPRESSION_LEVEL <Integer> Compression level for all compressed files created (e.g. BAM and VCF). Default value: 5.
--CREATE_INDEX <Boolean> Whether to create an index when writing VCF or coordinate sorted BAM output. Default
value: false. Possible values: {true, false}
--CREATE_MD5_FILE <Boolean> Whether to create an MD5 digest for any BAM or FASTQ files created. Default value:
false. Possible values: {true, false}
--help,-h <Boolean> display the help message Default value: false. Possible values: {true, false}
--MAX_RECORDS_IN_RAM <Integer>When writing files that need to be sorted, this will specify the number of records stored
in RAM before spilling to disk. Increasing this number reduces the number of file handles
needed to sort the file, and increases the amount of RAM needed. Default value: 100.
--QUIET <Boolean> Whether to suppress job-summary info on System.err. Default value: false. Possible
values: {true, false}
--READ_NAME_REGEX <String> Use these regular expressions to parse read names in the input SAM file. Read names are
parsed to extract three variables: tile/region, x coordinate and y coordinate. The x and y
coordinates are used to determine the downsample decision. Set this option to null to
disable optical duplicate detection, e.g. for RNA-seq The regular expression should
contain three capture groups for the three variables, in order. It must match the entire
read name. Note that if the default regex is specified, a regex match is not actually
done, but instead the read name is split on colons (:). For 5 element names, the 3rd, 4th
and 5th elements are assumed to be tile, x and y values. For 7 element names (CASAVA 1.8),
the 5th, 6th, and 7th elements are assumed to be tile, x and y values. Default value:
<optimized capture of last three ':' separated fields as numeric values>.
--REFERENCE_SEQUENCE,-R <PicardHtsPath>
Reference sequence file. Default value: null.
--REMOVE_DUPLICATE_INFORMATION <Boolean>
Determines whether the duplicate tag should be reset since the downsampling requires
re-marking duplicates. Default value: true. Possible values: {true, false}
--STOP_AFTER <Long> Stop after processing N reads, mainly for debugging. Default value: null.
--TMP_DIR <File> One or more directories with space available to be used by this program for temporary
storage of working files This argument may be specified 0 or more times. Default value:
null.
--USE_JDK_DEFLATER,-use_jdk_deflater <Boolean>
Use the JDK Deflater instead of the Intel Deflater for writing compressed output Default
value: false. Possible values: {true, false}
--USE_JDK_INFLATER,-use_jdk_inflater <Boolean>
Use the JDK Inflater instead of the Intel Inflater for reading compressed input Default
value: false. Possible values: {true, false}
--VALIDATION_STRINGENCY <ValidationStringency>
Validation stringency for all SAM files read by this program. Setting stringency to
SILENT can improve performance when processing a BAM file in which variable-length data
(read, qualities, tags) do not otherwise need to be decoded. Default value: STRICT.
Possible values: {STRICT, LENIENT, SILENT}
--VERBOSITY <LogLevel> Control verbosity of logging. Default value: INFO. Possible values: {ERROR, WARNING,
INFO, DEBUG}
--version <Boolean> display the version number for this tool Default value: false. Possible values: {true,
false}
Advanced Arguments:
--showHidden <Boolean> display hidden arguments Default value: false. Possible values: {true, false}
FRACTION must be a value between 0 and 1, found: -5.0
USAGE: PositionBasedDownsampleSam [arguments]
<h3>Summary</h3>
Class to downsample a SAM/BAM/CRAM file based on the position of the read in a flowcell. As with DownsampleSam, all the
reads with the same queryname are either kept or dropped as a unit.
<h3>Details</h3>
The downsampling is _not_ random (and there is no random seed). It is deterministically determined by the position of
each read within its tile. Specifically, it draws an ellipse that covers a FRACTION of the total tile's area and of all
the edges of the tile. It uses this area to determine whether to keep or drop the record. Since reads with the same name
have the same position (mates, secondary and supplemental alignments), the decision will be the same for all of them.
The main concern of this downsampling method is that due to "optical duplicates" downsampling randomly can create a
result that has a different optical duplicate rate, and therefore a different estimated library size (when running
MarkDuplicates). This method keeps (physically) close read together, so that (except for reads near the boundary of the
circle) optical duplicates are kept or dropped as a group. By default the program expects the read names to have 5 or 7
fields separated by colons (:) and it takes the last two to indicate the x and y coordinates of the reads within the
tile whence it was sequenced. See DEFAULT_READ_NAME_REGEX for more detail. The program traverses the INPUT twice: first
to find out the size of each of the tiles, and next to perform the downsampling. Downsampling invalidates the duplicate
flag because duplicate reads before downsampling may not all remain duplicated after downsampling. Thus, the default
setting also removes the duplicate information.
Example
java -jar picard.jar PositionBasedDownsampleSam \
I=input.bam \
O=downsampled.bam \
FRACTION=0.1
Caveats
Note 1: This method is <b>technology and read-name dependent</b>. If the read-names do not have coordinate information
embedded in them, or if your BAM contains reads from multiple technologies (flowcell versions, sequencing machines) this
will not work properly. It has been designed to work with Illumina technology and reads-names. Consider modifying {@link
#READ_NAME_REGEX} in other cases.
Note 2: The code has been designed to simulate, as accurately as possible, sequencing less, <b>not</b> for getting an
exact downsampled fraction (Use {@link DownsampleSam} for that.) In particular, since the reads may be distributed
non-evenly within the lanes/tiles, the resulting downsampling percentage will not be accurately determined by the input
argument FRACTION.
Note 3:Consider running {@link MarkDuplicates} after downsampling in order to "expose" the duplicates whose
representative has been downsampled away.
Note 4:The downsampling assumes a uniform distribution of reads in the flowcell. Input already downsampled with
PositionBasedDownsampleSam violates this assumption. To guard against such input, PositionBasedDownsampleSam always
places a PG record in the header of its output, and aborts whenever it finds such a PG record in its input.
Version:null
Required Arguments:
--FRACTION,-F <Double> The (approximate) fraction of reads to be kept, between 0 and 1. Required.
--INPUT,-I <File> The input SAM/BAM/CRAM file to downsample. Required.
--OUTPUT,-O <File> The output, downsampled, SAM/BAM/CRAM file. Required.
Optional Arguments:
--ALLOW_MULTIPLE_DOWNSAMPLING_DESPITE_WARNINGS <Boolean>
Allow downsampling again despite this being a bad idea with possibly unexpected results.
Default value: false. Possible values: {true, false}
--arguments_file <File> read one or more arguments files and add them to the command line This argument may be
specified 0 or more times. Default value: null.
--COMPRESSION_LEVEL <Integer> Compression level for all compressed files created (e.g. BAM and VCF). Default value: 5.
--CREATE_INDEX <Boolean> Whether to create an index when writing VCF or coordinate sorted BAM output. Default
value: false. Possible values: {true, false}
--CREATE_MD5_FILE <Boolean> Whether to create an MD5 digest for any BAM or FASTQ files created. Default value:
false. Possible values: {true, false}
--help,-h <Boolean> display the help message Default value: false. Possible values: {true, false}
--MAX_RECORDS_IN_RAM <Integer>When writing files that need to be sorted, this will specify the number of records stored
in RAM before spilling to disk. Increasing this number reduces the number of file handles
needed to sort the file, and increases the amount of RAM needed. Default value: 100.
--QUIET <Boolean> Whether to suppress job-summary info on System.err. Default value: false. Possible
values: {true, false}
--READ_NAME_REGEX <String> Use these regular expressions to parse read names in the input SAM file. Read names are
parsed to extract three variables: tile/region, x coordinate and y coordinate. The x and y
coordinates are used to determine the downsample decision. Set this option to null to
disable optical duplicate detection, e.g. for RNA-seq The regular expression should
contain three capture groups for the three variables, in order. It must match the entire
read name. Note that if the default regex is specified, a regex match is not actually
done, but instead the read name is split on colons (:). For 5 element names, the 3rd, 4th
and 5th elements are assumed to be tile, x and y values. For 7 element names (CASAVA 1.8),
the 5th, 6th, and 7th elements are assumed to be tile, x and y values. Default value:
<optimized capture of last three ':' separated fields as numeric values>.
--REFERENCE_SEQUENCE,-R <PicardHtsPath>
Reference sequence file. Default value: null.
--REMOVE_DUPLICATE_INFORMATION <Boolean>
Determines whether the duplicate tag should be reset since the downsampling requires
re-marking duplicates. Default value: true. Possible values: {true, false}
--STOP_AFTER <Long> Stop after processing N reads, mainly for debugging. Default value: null.
--TMP_DIR <File> One or more directories with space available to be used by this program for temporary
storage of working files This argument may be specified 0 or more times. Default value:
null.
--USE_JDK_DEFLATER,-use_jdk_deflater <Boolean>
Use the JDK Deflater instead of the Intel Deflater for writing compressed output Default
value: false. Possible values: {true, false}
--USE_JDK_INFLATER,-use_jdk_inflater <Boolean>
Use the JDK Inflater instead of the Intel Inflater for reading compressed input Default
value: false. Possible values: {true, false}
--VALIDATION_STRINGENCY <ValidationStringency>
Validation stringency for all SAM files read by this program. Setting stringency to
SILENT can improve performance when processing a BAM file in which variable-length data
(read, qualities, tags) do not otherwise need to be decoded. Default value: STRICT.
Possible values: {STRICT, LENIENT, SILENT}
--VERBOSITY <LogLevel> Control verbosity of logging. Default value: INFO. Possible values: {ERROR, WARNING,
INFO, DEBUG}
--version <Boolean> display the version number for this tool Default value: false. Possible values: {true,
false}
Advanced Arguments:
--showHidden <Boolean> display hidden arguments Default value: false. Possible values: {true, false}
FRACTION must be a value between 0 and 1, found: 1.00001
USAGE: PositionBasedDownsampleSam [arguments]
<h3>Summary</h3>
Class to downsample a SAM/BAM/CRAM file based on the position of the read in a flowcell. As with DownsampleSam, all the
reads with the same queryname are either kept or dropped as a unit.
<h3>Details</h3>
The downsampling is _not_ random (and there is no random seed). It is deterministically determined by the position of
each read within its tile. Specifically, it draws an ellipse that covers a FRACTION of the total tile's area and of all
the edges of the tile. It uses this area to determine whether to keep or drop the record. Since reads with the same name
have the same position (mates, secondary and supplemental alignments), the decision will be the same for all of them.
The main concern of this downsampling method is that due to "optical duplicates" downsampling randomly can create a
result that has a different optical duplicate rate, and therefore a different estimated library size (when running
MarkDuplicates). This method keeps (physically) close read together, so that (except for reads near the boundary of the
circle) optical duplicates are kept or dropped as a group. By default the program expects the read names to have 5 or 7
fields separated by colons (:) and it takes the last two to indicate the x and y coordinates of the reads within the
tile whence it was sequenced. See DEFAULT_READ_NAME_REGEX for more detail. The program traverses the INPUT twice: first
to find out the size of each of the tiles, and next to perform the downsampling. Downsampling invalidates the duplicate
flag because duplicate reads before downsampling may not all remain duplicated after downsampling. Thus, the default
setting also removes the duplicate information.
Example
java -jar picard.jar PositionBasedDownsampleSam \
I=input.bam \
O=downsampled.bam \
FRACTION=0.1
Caveats
Note 1: This method is <b>technology and read-name dependent</b>. If the read-names do not have coordinate information
embedded in them, or if your BAM contains reads from multiple technologies (flowcell versions, sequencing machines) this
will not work properly. It has been designed to work with Illumina technology and reads-names. Consider modifying {@link
#READ_NAME_REGEX} in other cases.
Note 2: The code has been designed to simulate, as accurately as possible, sequencing less, <b>not</b> for getting an
exact downsampled fraction (Use {@link DownsampleSam} for that.) In particular, since the reads may be distributed
non-evenly within the lanes/tiles, the resulting downsampling percentage will not be accurately determined by the input
argument FRACTION.
Note 3:Consider running {@link MarkDuplicates} after downsampling in order to "expose" the duplicates whose
representative has been downsampled away.
Note 4:The downsampling assumes a uniform distribution of reads in the flowcell. Input already downsampled with
PositionBasedDownsampleSam violates this assumption. To guard against such input, PositionBasedDownsampleSam always
places a PG record in the header of its output, and aborts whenever it finds such a PG record in its input.
Version:null
Required Arguments:
--FRACTION,-F <Double> The (approximate) fraction of reads to be kept, between 0 and 1. Required.
--INPUT,-I <File> The input SAM/BAM/CRAM file to downsample. Required.
--OUTPUT,-O <File> The output, downsampled, SAM/BAM/CRAM file. Required.
Optional Arguments:
--ALLOW_MULTIPLE_DOWNSAMPLING_DESPITE_WARNINGS <Boolean>
Allow downsampling again despite this being a bad idea with possibly unexpected results.
Default value: false. Possible values: {true, false}
--arguments_file <File> read one or more arguments files and add them to the command line This argument may be
specified 0 or more times. Default value: null.
--COMPRESSION_LEVEL <Integer> Compression level for all compressed files created (e.g. BAM and VCF). Default value: 5.
--CREATE_INDEX <Boolean> Whether to create an index when writing VCF or coordinate sorted BAM output. Default
value: false. Possible values: {true, false}
--CREATE_MD5_FILE <Boolean> Whether to create an MD5 digest for any BAM or FASTQ files created. Default value:
false. Possible values: {true, false}
--help,-h <Boolean> display the help message Default value: false. Possible values: {true, false}
--MAX_RECORDS_IN_RAM <Integer>When writing files that need to be sorted, this will specify the number of records stored
in RAM before spilling to disk. Increasing this number reduces the number of file handles
needed to sort the file, and increases the amount of RAM needed. Default value: 100.
--QUIET <Boolean> Whether to suppress job-summary info on System.err. Default value: false. Possible
values: {true, false}
--READ_NAME_REGEX <String> Use these regular expressions to parse read names in the input SAM file. Read names are
parsed to extract three variables: tile/region, x coordinate and y coordinate. The x and y
coordinates are used to determine the downsample decision. Set this option to null to
disable optical duplicate detection, e.g. for RNA-seq The regular expression should
contain three capture groups for the three variables, in order. It must match the entire
read name. Note that if the default regex is specified, a regex match is not actually
done, but instead the read name is split on colons (:). For 5 element names, the 3rd, 4th
and 5th elements are assumed to be tile, x and y values. For 7 element names (CASAVA 1.8),
the 5th, 6th, and 7th elements are assumed to be tile, x and y values. Default value:
<optimized capture of last three ':' separated fields as numeric values>.
--REFERENCE_SEQUENCE,-R <PicardHtsPath>
Reference sequence file. Default value: null.
--REMOVE_DUPLICATE_INFORMATION <Boolean>
Determines whether the duplicate tag should be reset since the downsampling requires
re-marking duplicates. Default value: true. Possible values: {true, false}
--STOP_AFTER <Long> Stop after processing N reads, mainly for debugging. Default value: null.
--TMP_DIR <File> One or more directories with space available to be used by this program for temporary
storage of working files This argument may be specified 0 or more times. Default value:
null.
--USE_JDK_DEFLATER,-use_jdk_deflater <Boolean>
Use the JDK Deflater instead of the Intel Deflater for writing compressed output Default
value: false. Possible values: {true, false}
--USE_JDK_INFLATER,-use_jdk_inflater <Boolean>
Use the JDK Inflater instead of the Intel Inflater for reading compressed input Default
value: false. Possible values: {true, false}
--VALIDATION_STRINGENCY <ValidationStringency>
Validation stringency for all SAM files read by this program. Setting stringency to
SILENT can improve performance when processing a BAM file in which variable-length data
(read, qualities, tags) do not otherwise need to be decoded. Default value: STRICT.
Possible values: {STRICT, LENIENT, SILENT}
--VERBOSITY <LogLevel> Control verbosity of logging. Default value: INFO. Possible values: {ERROR, WARNING,
INFO, DEBUG}
--version <Boolean> display the version number for this tool Default value: false. Possible values: {true,
false}
Advanced Arguments:
--showHidden <Boolean> display hidden arguments Default value: false. Possible values: {true, false}
FRACTION must be a value between 0 and 1, found: 5.0
USAGE: PositionBasedDownsampleSam [arguments]
<h3>Summary</h3>
Class to downsample a SAM/BAM/CRAM file based on the position of the read in a flowcell. As with DownsampleSam, all the
reads with the same queryname are either kept or dropped as a unit.
<h3>Details</h3>
The downsampling is _not_ random (and there is no random seed). It is deterministically determined by the position of
each read within its tile. Specifically, it draws an ellipse that covers a FRACTION of the total tile's area and of all
the edges of the tile. It uses this area to determine whether to keep or drop the record. Since reads with the same name
have the same position (mates, secondary and supplemental alignments), the decision will be the same for all of them.
The main concern of this downsampling method is that due to "optical duplicates" downsampling randomly can create a
result that has a different optical duplicate rate, and therefore a different estimated library size (when running
MarkDuplicates). This method keeps (physically) close read together, so that (except for reads near the boundary of the
circle) optical duplicates are kept or dropped as a group. By default the program expects the read names to have 5 or 7
fields separated by colons (:) and it takes the last two to indicate the x and y coordinates of the reads within the
tile whence it was sequenced. See DEFAULT_READ_NAME_REGEX for more detail. The program traverses the INPUT twice: first
to find out the size of each of the tiles, and next to perform the downsampling. Downsampling invalidates the duplicate
flag because duplicate reads before downsampling may not all remain duplicated after downsampling. Thus, the default
setting also removes the duplicate information.
Example
java -jar picard.jar PositionBasedDownsampleSam \
I=input.bam \
O=downsampled.bam \
FRACTION=0.1
Caveats
Note 1: This method is <b>technology and read-name dependent</b>. If the read-names do not have coordinate information
embedded in them, or if your BAM contains reads from multiple technologies (flowcell versions, sequencing machines) this
will not work properly. It has been designed to work with Illumina technology and reads-names. Consider modifying {@link
#READ_NAME_REGEX} in other cases.
Note 2: The code has been designed to simulate, as accurately as possible, sequencing less, <b>not</b> for getting an
exact downsampled fraction (Use {@link DownsampleSam} for that.) In particular, since the reads may be distributed
non-evenly within the lanes/tiles, the resulting downsampling percentage will not be accurately determined by the input
argument FRACTION.
Note 3:Consider running {@link MarkDuplicates} after downsampling in order to "expose" the duplicates whose
representative has been downsampled away.
Note 4:The downsampling assumes a uniform distribution of reads in the flowcell. Input already downsampled with
PositionBasedDownsampleSam violates this assumption. To guard against such input, PositionBasedDownsampleSam always
places a PG record in the header of its output, and aborts whenever it finds such a PG record in its input.
Version:null
Required Arguments:
--FRACTION,-F <Double> The (approximate) fraction of reads to be kept, between 0 and 1. Required.
--INPUT,-I <File> The input SAM/BAM/CRAM file to downsample. Required.
--OUTPUT,-O <File> The output, downsampled, SAM/BAM/CRAM file. Required.
Optional Arguments:
--ALLOW_MULTIPLE_DOWNSAMPLING_DESPITE_WARNINGS <Boolean>
Allow downsampling again despite this being a bad idea with possibly unexpected results.
Default value: false. Possible values: {true, false}
--arguments_file <File> read one or more arguments files and add them to the command line This argument may be
specified 0 or more times. Default value: null.
--COMPRESSION_LEVEL <Integer> Compression level for all compressed files created (e.g. BAM and VCF). Default value: 5.
--CREATE_INDEX <Boolean> Whether to create an index when writing VCF or coordinate sorted BAM output. Default
value: false. Possible values: {true, false}
--CREATE_MD5_FILE <Boolean> Whether to create an MD5 digest for any BAM or FASTQ files created. Default value:
false. Possible values: {true, false}
--help,-h <Boolean> display the help message Default value: false. Possible values: {true, false}
--MAX_RECORDS_IN_RAM <Integer>When writing files that need to be sorted, this will specify the number of records stored
in RAM before spilling to disk. Increasing this number reduces the number of file handles
needed to sort the file, and increases the amount of RAM needed. Default value: 100.
--QUIET <Boolean> Whether to suppress job-summary info on System.err. Default value: false. Possible
values: {true, false}
--READ_NAME_REGEX <String> Use these regular expressions to parse read names in the input SAM file. Read names are
parsed to extract three variables: tile/region, x coordinate and y coordinate. The x and y
coordinates are used to determine the downsample decision. Set this option to null to
disable optical duplicate detection, e.g. for RNA-seq The regular expression should
contain three capture groups for the three variables, in order. It must match the entire
read name. Note that if the default regex is specified, a regex match is not actually
done, but instead the read name is split on colons (:). For 5 element names, the 3rd, 4th
and 5th elements are assumed to be tile, x and y values. For 7 element names (CASAVA 1.8),
the 5th, 6th, and 7th elements are assumed to be tile, x and y values. Default value:
<optimized capture of last three ':' separated fields as numeric values>.
--REFERENCE_SEQUENCE,-R <PicardHtsPath>
Reference sequence file. Default value: null.
--REMOVE_DUPLICATE_INFORMATION <Boolean>
Determines whether the duplicate tag should be reset since the downsampling requires
re-marking duplicates. Default value: true. Possible values: {true, false}
--STOP_AFTER <Long> Stop after processing N reads, mainly for debugging. Default value: null.
--TMP_DIR <File> One or more directories with space available to be used by this program for temporary
storage of working files This argument may be specified 0 or more times. Default value:
null.
--USE_JDK_DEFLATER,-use_jdk_deflater <Boolean>
Use the JDK Deflater instead of the Intel Deflater for writing compressed output Default
value: false. Possible values: {true, false}
--USE_JDK_INFLATER,-use_jdk_inflater <Boolean>
Use the JDK Inflater instead of the Intel Inflater for reading compressed input Default
value: false. Possible values: {true, false}
--VALIDATION_STRINGENCY <ValidationStringency>
Validation stringency for all SAM files read by this program. Setting stringency to
SILENT can improve performance when processing a BAM file in which variable-length data
(read, qualities, tags) do not otherwise need to be decoded. Default value: STRICT.
Possible values: {STRICT, LENIENT, SILENT}
--VERBOSITY <LogLevel> Control verbosity of logging. Default value: INFO. Possible values: {ERROR, WARNING,
INFO, DEBUG}
--version <Boolean> display the version number for this tool Default value: false. Possible values: {true,
false}
Advanced Arguments:
--showHidden <Boolean> display hidden arguments Default value: false. Possible values: {true, false}
FRACTION must be a value between 0 and 1, found: 50.0
USAGE: PositionBasedDownsampleSam [arguments]
<h3>Summary</h3>
Class to downsample a SAM/BAM/CRAM file based on the position of the read in a flowcell. As with DownsampleSam, all the
reads with the same queryname are either kept or dropped as a unit.
<h3>Details</h3>
The downsampling is _not_ random (and there is no random seed). It is deterministically determined by the position of
each read within its tile. Specifically, it draws an ellipse that covers a FRACTION of the total tile's area and of all
the edges of the tile. It uses this area to determine whether to keep or drop the record. Since reads with the same name
have the same position (mates, secondary and supplemental alignments), the decision will be the same for all of them.
The main concern of this downsampling method is that due to "optical duplicates" downsampling randomly can create a
result that has a different optical duplicate rate, and therefore a different estimated library size (when running
MarkDuplicates). This method keeps (physically) close read together, so that (except for reads near the boundary of the
circle) optical duplicates are kept or dropped as a group. By default the program expects the read names to have 5 or 7
fields separated by colons (:) and it takes the last two to indicate the x and y coordinates of the reads within the
tile whence it was sequenced. See DEFAULT_READ_NAME_REGEX for more detail. The program traverses the INPUT twice: first
to find out the size of each of the tiles, and next to perform the downsampling. Downsampling invalidates the duplicate
flag because duplicate reads before downsampling may not all remain duplicated after downsampling. Thus, the default
setting also removes the duplicate information.
Example
java -jar picard.jar PositionBasedDownsampleSam \
I=input.bam \
O=downsampled.bam \
FRACTION=0.1
Caveats
Note 1: This method is <b>technology and read-name dependent</b>. If the read-names do not have coordinate information
embedded in them, or if your BAM contains reads from multiple technologies (flowcell versions, sequencing machines) this
will not work properly. It has been designed to work with Illumina technology and reads-names. Consider modifying {@link
#READ_NAME_REGEX} in other cases.
Note 2: The code has been designed to simulate, as accurately as possible, sequencing less, <b>not</b> for getting an
exact downsampled fraction (Use {@link DownsampleSam} for that.) In particular, since the reads may be distributed
non-evenly within the lanes/tiles, the resulting downsampling percentage will not be accurately determined by the input
argument FRACTION.
Note 3:Consider running {@link MarkDuplicates} after downsampling in order to "expose" the duplicates whose
representative has been downsampled away.
Note 4:The downsampling assumes a uniform distribution of reads in the flowcell. Input already downsampled with
PositionBasedDownsampleSam violates this assumption. To guard against such input, PositionBasedDownsampleSam always
places a PG record in the header of its output, and aborts whenever it finds such a PG record in its input.
Version:null
Required Arguments:
--FRACTION,-F <Double> The (approximate) fraction of reads to be kept, between 0 and 1. Required.
--INPUT,-I <File> The input SAM/BAM/CRAM file to downsample. Required.
--OUTPUT,-O <File> The output, downsampled, SAM/BAM/CRAM file. Required.
Optional Arguments:
--ALLOW_MULTIPLE_DOWNSAMPLING_DESPITE_WARNINGS <Boolean>
Allow downsampling again despite this being a bad idea with possibly unexpected results.
Default value: false. Possible values: {true, false}
--arguments_file <File> read one or more arguments files and add them to the command line This argument may be
specified 0 or more times. Default value: null.
--COMPRESSION_LEVEL <Integer> Compression level for all compressed files created (e.g. BAM and VCF). Default value: 5.
--CREATE_INDEX <Boolean> Whether to create an index when writing VCF or coordinate sorted BAM output. Default
value: false. Possible values: {true, false}
--CREATE_MD5_FILE <Boolean> Whether to create an MD5 digest for any BAM or FASTQ files created. Default value:
false. Possible values: {true, false}
--help,-h <Boolean> display the help message Default value: false. Possible values: {true, false}
--MAX_RECORDS_IN_RAM <Integer>When writing files that need to be sorted, this will specify the number of records stored
in RAM before spilling to disk. Increasing this number reduces the number of file handles
needed to sort the file, and increases the amount of RAM needed. Default value: 100.
--QUIET <Boolean> Whether to suppress job-summary info on System.err. Default value: false. Possible
values: {true, false}
--READ_NAME_REGEX <String> Use these regular expressions to parse read names in the input SAM file. Read names are
parsed to extract three variables: tile/region, x coordinate and y coordinate. The x and y
coordinates are used to determine the downsample decision. Set this option to null to
disable optical duplicate detection, e.g. for RNA-seq The regular expression should
contain three capture groups for the three variables, in order. It must match the entire
read name. Note that if the default regex is specified, a regex match is not actually
done, but instead the read name is split on colons (:). For 5 element names, the 3rd, 4th
and 5th elements are assumed to be tile, x and y values. For 7 element names (CASAVA 1.8),
the 5th, 6th, and 7th elements are assumed to be tile, x and y values. Default value:
<optimized capture of last three ':' separated fields as numeric values>.
--REFERENCE_SEQUENCE,-R <PicardHtsPath>
Reference sequence file. Default value: null.
--REMOVE_DUPLICATE_INFORMATION <Boolean>
Determines whether the duplicate tag should be reset since the downsampling requires
re-marking duplicates. Default value: true. Possible values: {true, false}
--STOP_AFTER <Long> Stop after processing N reads, mainly for debugging. Default value: null.
--TMP_DIR <File> One or more directories with space available to be used by this program for temporary
storage of working files This argument may be specified 0 or more times. Default value:
null.
--USE_JDK_DEFLATER,-use_jdk_deflater <Boolean>
Use the JDK Deflater instead of the Intel Deflater for writing compressed output Default
value: false. Possible values: {true, false}
--USE_JDK_INFLATER,-use_jdk_inflater <Boolean>
Use the JDK Inflater instead of the Intel Inflater for reading compressed input Default
value: false. Possible values: {true, false}
--VALIDATION_STRINGENCY <ValidationStringency>
Validation stringency for all SAM files read by this program. Setting stringency to
SILENT can improve performance when processing a BAM file in which variable-length data
(read, qualities, tags) do not otherwise need to be decoded. Default value: STRICT.
Possible values: {STRICT, LENIENT, SILENT}
--VERBOSITY <LogLevel> Control verbosity of logging. Default value: INFO. Possible values: {ERROR, WARNING,
INFO, DEBUG}
--version <Boolean> display the version number for this tool Default value: false. Possible values: {true,
false}
Advanced Arguments:
--showHidden <Boolean> display hidden arguments Default value: false. Possible values: {true, false}
FRACTION must be a value between 0 and 1, found: 1.7976931348623157E308
USAGE: PositionBasedDownsampleSam [arguments]
<h3>Summary</h3>
Class to downsample a SAM/BAM/CRAM file based on the position of the read in a flowcell. As with DownsampleSam, all the
reads with the same queryname are either kept or dropped as a unit.
<h3>Details</h3>
The downsampling is _not_ random (and there is no random seed). It is deterministically determined by the position of
each read within its tile. Specifically, it draws an ellipse that covers a FRACTION of the total tile's area and of all
the edges of the tile. It uses this area to determine whether to keep or drop the record. Since reads with the same name
have the same position (mates, secondary and supplemental alignments), the decision will be the same for all of them.
The main concern of this downsampling method is that due to "optical duplicates" downsampling randomly can create a
result that has a different optical duplicate rate, and therefore a different estimated library size (when running
MarkDuplicates). This method keeps (physically) close read together, so that (except for reads near the boundary of the
circle) optical duplicates are kept or dropped as a group. By default the program expects the read names to have 5 or 7
fields separated by colons (:) and it takes the last two to indicate the x and y coordinates of the reads within the
tile whence it was sequenced. See DEFAULT_READ_NAME_REGEX for more detail. The program traverses the INPUT twice: first
to find out the size of each of the tiles, and next to perform the downsampling. Downsampling invalidates the duplicate
flag because duplicate reads before downsampling may not all remain duplicated after downsampling. Thus, the default
setting also removes the duplicate information.
Example
java -jar picard.jar PositionBasedDownsampleSam \
I=input.bam \
O=downsampled.bam \
FRACTION=0.1
Caveats
Note 1: This method is <b>technology and read-name dependent</b>. If the read-names do not have coordinate information
embedded in them, or if your BAM contains reads from multiple technologies (flowcell versions, sequencing machines) this
will not work properly. It has been designed to work with Illumina technology and reads-names. Consider modifying {@link
#READ_NAME_REGEX} in other cases.
Note 2: The code has been designed to simulate, as accurately as possible, sequencing less, <b>not</b> for getting an
exact downsampled fraction (Use {@link DownsampleSam} for that.) In particular, since the reads may be distributed
non-evenly within the lanes/tiles, the resulting downsampling percentage will not be accurately determined by the input
argument FRACTION.
Note 3:Consider running {@link MarkDuplicates} after downsampling in order to "expose" the duplicates whose
representative has been downsampled away.
Note 4:The downsampling assumes a uniform distribution of reads in the flowcell. Input already downsampled with
PositionBasedDownsampleSam violates this assumption. To guard against such input, PositionBasedDownsampleSam always
places a PG record in the header of its output, and aborts whenever it finds such a PG record in its input.
Version:null
Required Arguments:
--FRACTION,-F <Double> The (approximate) fraction of reads to be kept, between 0 and 1. Required.
--INPUT,-I <File> The input SAM/BAM/CRAM file to downsample. Required.
--OUTPUT,-O <File> The output, downsampled, SAM/BAM/CRAM file. Required.
Optional Arguments:
--ALLOW_MULTIPLE_DOWNSAMPLING_DESPITE_WARNINGS <Boolean>
Allow downsampling again despite this being a bad idea with possibly unexpected results.
Default value: false. Possible values: {true, false}
--arguments_file <File> read one or more arguments files and add them to the command line This argument may be
specified 0 or more times. Default value: null.
--COMPRESSION_LEVEL <Integer> Compression level for all compressed files created (e.g. BAM and VCF). Default value: 5.
--CREATE_INDEX <Boolean> Whether to create an index when writing VCF or coordinate sorted BAM output. Default
value: false. Possible values: {true, false}
--CREATE_MD5_FILE <Boolean> Whether to create an MD5 digest for any BAM or FASTQ files created. Default value:
false. Possible values: {true, false}
--help,-h <Boolean> display the help message Default value: false. Possible values: {true, false}
--MAX_RECORDS_IN_RAM <Integer>When writing files that need to be sorted, this will specify the number of records stored
in RAM before spilling to disk. Increasing this number reduces the number of file handles
needed to sort the file, and increases the amount of RAM needed. Default value: 100.
--QUIET <Boolean> Whether to suppress job-summary info on System.err. Default value: false. Possible
values: {true, false}
--READ_NAME_REGEX <String> Use these regular expressions to parse read names in the input SAM file. Read names are
parsed to extract three variables: tile/region, x coordinate and y coordinate. The x and y
coordinates are used to determine the downsample decision. Set this option to null to
disable optical duplicate detection, e.g. for RNA-seq The regular expression should
contain three capture groups for the three variables, in order. It must match the entire
read name. Note that if the default regex is specified, a regex match is not actually
done, but instead the read name is split on colons (:). For 5 element names, the 3rd, 4th
and 5th elements are assumed to be tile, x and y values. For 7 element names (CASAVA 1.8),
the 5th, 6th, and 7th elements are assumed to be tile, x and y values. Default value:
<optimized capture of last three ':' separated fields as numeric values>.
--REFERENCE_SEQUENCE,-R <PicardHtsPath>
Reference sequence file. Default value: null.
--REMOVE_DUPLICATE_INFORMATION <Boolean>
Determines whether the duplicate tag should be reset since the downsampling requires
re-marking duplicates. Default value: true. Possible values: {true, false}
--STOP_AFTER <Long> Stop after processing N reads, mainly for debugging. Default value: null.
--TMP_DIR <File> One or more directories with space available to be used by this program for temporary
storage of working files This argument may be specified 0 or more times. Default value:
null.
--USE_JDK_DEFLATER,-use_jdk_deflater <Boolean>
Use the JDK Deflater instead of the Intel Deflater for writing compressed output Default
value: false. Possible values: {true, false}
--USE_JDK_INFLATER,-use_jdk_inflater <Boolean>
Use the JDK Inflater instead of the Intel Inflater for reading compressed input Default
value: false. Possible values: {true, false}
--VALIDATION_STRINGENCY <ValidationStringency>
Validation stringency for all SAM files read by this program. Setting stringency to
SILENT can improve performance when processing a BAM file in which variable-length data
(read, qualities, tags) do not otherwise need to be decoded. Default value: STRICT.
Possible values: {STRICT, LENIENT, SILENT}
--VERBOSITY <LogLevel> Control verbosity of logging. Default value: INFO. Possible values: {ERROR, WARNING,
INFO, DEBUG}
--version <Boolean> display the version number for this tool Default value: false. Possible values: {true,
false}
Advanced Arguments:
--showHidden <Boolean> display hidden arguments Default value: false. Possible values: {true, false}
FRACTION must be a value between 0 and 1, found: Infinity
USAGE: PositionBasedDownsampleSam [arguments]
<h3>Summary</h3>
Class to downsample a SAM/BAM/CRAM file based on the position of the read in a flowcell. As with DownsampleSam, all the
reads with the same queryname are either kept or dropped as a unit.
<h3>Details</h3>
The downsampling is _not_ random (and there is no random seed). It is deterministically determined by the position of
each read within its tile. Specifically, it draws an ellipse that covers a FRACTION of the total tile's area and of all
the edges of the tile. It uses this area to determine whether to keep or drop the record. Since reads with the same name
have the same position (mates, secondary and supplemental alignments), the decision will be the same for all of them.
The main concern of this downsampling method is that due to "optical duplicates" downsampling randomly can create a
result that has a different optical duplicate rate, and therefore a different estimated library size (when running
MarkDuplicates). This method keeps (physically) close read together, so that (except for reads near the boundary of the
circle) optical duplicates are kept or dropped as a group. By default the program expects the read names to have 5 or 7
fields separated by colons (:) and it takes the last two to indicate the x and y coordinates of the reads within the
tile whence it was sequenced. See DEFAULT_READ_NAME_REGEX for more detail. The program traverses the INPUT twice: first
to find out the size of each of the tiles, and next to perform the downsampling. Downsampling invalidates the duplicate
flag because duplicate reads before downsampling may not all remain duplicated after downsampling. Thus, the default
setting also removes the duplicate information.
Example
java -jar picard.jar PositionBasedDownsampleSam \
I=input.bam \
O=downsampled.bam \
FRACTION=0.1
Caveats
Note 1: This method is <b>technology and read-name dependent</b>. If the read-names do not have coordinate information
embedded in them, or if your BAM contains reads from multiple technologies (flowcell versions, sequencing machines) this
will not work properly. It has been designed to work with Illumina technology and reads-names. Consider modifying {@link
#READ_NAME_REGEX} in other cases.
Note 2: The code has been designed to simulate, as accurately as possible, sequencing less, <b>not</b> for getting an
exact downsampled fraction (Use {@link DownsampleSam} for that.) In particular, since the reads may be distributed
non-evenly within the lanes/tiles, the resulting downsampling percentage will not be accurately determined by the input
argument FRACTION.
Note 3:Consider running {@link MarkDuplicates} after downsampling in order to "expose" the duplicates whose
representative has been downsampled away.
Note 4:The downsampling assumes a uniform distribution of reads in the flowcell. Input already downsampled with
PositionBasedDownsampleSam violates this assumption. To guard against such input, PositionBasedDownsampleSam always
places a PG record in the header of its output, and aborts whenever it finds such a PG record in its input.
Version:null
Required Arguments:
--FRACTION,-F <Double> The (approximate) fraction of reads to be kept, between 0 and 1. Required.
--INPUT,-I <File> The input SAM/BAM/CRAM file to downsample. Required.
--OUTPUT,-O <File> The output, downsampled, SAM/BAM/CRAM file. Required.
Optional Arguments:
--ALLOW_MULTIPLE_DOWNSAMPLING_DESPITE_WARNINGS <Boolean>
Allow downsampling again despite this being a bad idea with possibly unexpected results.
Default value: false. Possible values: {true, false}
--arguments_file <File> read one or more arguments files and add them to the command line This argument may be
specified 0 or more times. Default value: null.
--COMPRESSION_LEVEL <Integer> Compression level for all compressed files created (e.g. BAM and VCF). Default value: 5.
--CREATE_INDEX <Boolean> Whether to create an index when writing VCF or coordinate sorted BAM output. Default
value: false. Possible values: {true, false}
--CREATE_MD5_FILE <Boolean> Whether to create an MD5 digest for any BAM or FASTQ files created. Default value:
false. Possible values: {true, false}
--help,-h <Boolean> display the help message Default value: false. Possible values: {true, false}
--MAX_RECORDS_IN_RAM <Integer>When writing files that need to be sorted, this will specify the number of records stored
in RAM before spilling to disk. Increasing this number reduces the number of file handles
needed to sort the file, and increases the amount of RAM needed. Default value: 100.
--QUIET <Boolean> Whether to suppress job-summary info on System.err. Default value: false. Possible
values: {true, false}
--READ_NAME_REGEX <String> Use these regular expressions to parse read names in the input SAM file. Read names are
parsed to extract three variables: tile/region, x coordinate and y coordinate. The x and y
coordinates are used to determine the downsample decision. Set this option to null to
disable optical duplicate detection, e.g. for RNA-seq The regular expression should
contain three capture groups for the three variables, in order. It must match the entire
read name. Note that if the default regex is specified, a regex match is not actually
done, but instead the read name is split on colons (:). For 5 element names, the 3rd, 4th
and 5th elements are assumed to be tile, x and y values. For 7 element names (CASAVA 1.8),
the 5th, 6th, and 7th elements are assumed to be tile, x and y values. Default value:
<optimized capture of last three ':' separated fields as numeric values>.
--REFERENCE_SEQUENCE,-R <PicardHtsPath>
Reference sequence file. Default value: null.
--REMOVE_DUPLICATE_INFORMATION <Boolean>
Determines whether the duplicate tag should be reset since the downsampling requires
re-marking duplicates. Default value: true. Possible values: {true, false}
--STOP_AFTER <Long> Stop after processing N reads, mainly for debugging. Default value: null.
--TMP_DIR <File> One or more directories with space available to be used by this program for temporary
storage of working files This argument may be specified 0 or more times. Default value:
null.
--USE_JDK_DEFLATER,-use_jdk_deflater <Boolean>
Use the JDK Deflater instead of the Intel Deflater for writing compressed output Default
value: false. Possible values: {true, false}
--USE_JDK_INFLATER,-use_jdk_inflater <Boolean>
Use the JDK Inflater instead of the Intel Inflater for reading compressed input Default
value: false. Possible values: {true, false}
--VALIDATION_STRINGENCY <ValidationStringency>
Validation stringency for all SAM files read by this program. Setting stringency to
SILENT can improve performance when processing a BAM file in which variable-length data
(read, qualities, tags) do not otherwise need to be decoded. Default value: STRICT.
Possible values: {STRICT, LENIENT, SILENT}
--VERBOSITY <LogLevel> Control verbosity of logging. Default value: INFO. Possible values: {ERROR, WARNING,
INFO, DEBUG}
--version <Boolean> display the version number for this tool Default value: false. Possible values: {true,
false}
Advanced Arguments:
--showHidden <Boolean> display hidden arguments Default value: false. Possible values: {true, false}
FRACTION must be a value between 0 and 1, found: -Infinity
[Thu Apr 24 13:04:11 EDT 2025] PositionBasedDownsampleSam --INPUT /tmp/pds_test_PositionalDownsampling5160644569884654405/PositionalDownsampleSam8241000921554732497.bam --OUTPUT /tmp/pds_test_PositionalDownsampling5160644569884654405/PositionalDownsampleSam17091562803111839686.bam --FRACTION 0.1 --REMOVE_DUPLICATE_INFORMATION true --READ_NAME_REGEX <optimized capture of last three ':' separated fields as numeric values> --ALLOW_MULTIPLE_DOWNSAMPLING_DESPITE_WARNINGS false --VERBOSITY INFO --QUIET false --VALIDATION_STRINGENCY STRICT --COMPRESSION_LEVEL 5 --MAX_RECORDS_IN_RAM 100 --CREATE_INDEX false --CREATE_MD5_FILE false --help false --version false --showHidden false --USE_JDK_DEFLATER false --USE_JDK_INFLATER false
[Thu Apr 24 13:04:11 EDT 2025] Executing as root@lovelace on Linux 6.1.0-28-amd64 amd64; OpenJDK 64-Bit Server VM 17.0.13+11-Debian-2deb12u1; Deflater: Intel; Inflater: Intel; Provider GCS is available; Picard version: Version:null
INFO 2025-04-24 13:04:11 PositionBasedDownsampleSam Checking to see if input file has been downsampled with this program before.
INFO 2025-04-24 13:04:11 PositionBasedDownsampleSam Starting first pass. Examining read distribution in tiles.
INFO 2025-04-24 13:04:11 PositionBasedDownsampleSam First pass done.
INFO 2025-04-24 13:04:11 PositionBasedDownsampleSam Starting second pass. Outputting reads.
INFO 2025-04-24 13:04:12 PositionBasedDownsampleSam Second pass done.
WARNING 2025-04-24 13:04:12 PositionBasedDownsampleSam You've requested FRACTION=0.100000, the resulting downsampling resulted in a rate of 0.069400.
INFO 2025-04-24 13:04:12 PositionBasedDownsampleSam Finished! Kept 1388 out of 20000 reads (P=0.0694000).
[Thu Apr 24 13:04:12 EDT 2025] picard.sam.PositionBasedDownsampleSam done. Elapsed time: 0.00 minutes.
Runtime.totalMemory()=1073741824
[Thu Apr 24 13:04:12 EDT 2025] PositionBasedDownsampleSam --INPUT /tmp/pds_test_PositionalDownsampling5160644569884654405/PositionalDownsampleSam17091562803111839686.bam --OUTPUT /tmp/pds_test_PositionalDownsampling5160644569884654405/PositionalDownsampleSam7485707036275958588.bam --FRACTION 0.1 --ALLOW_MULTIPLE_DOWNSAMPLING_DESPITE_WARNINGS true --REMOVE_DUPLICATE_INFORMATION true --READ_NAME_REGEX <optimized capture of last three ':' separated fields as numeric values> --VERBOSITY INFO --QUIET false --VALIDATION_STRINGENCY STRICT --COMPRESSION_LEVEL 5 --MAX_RECORDS_IN_RAM 100 --CREATE_INDEX false --CREATE_MD5_FILE false --help false --version false --showHidden false --USE_JDK_DEFLATER false --USE_JDK_INFLATER false
[Thu Apr 24 13:04:12 EDT 2025] Executing as root@lovelace on Linux 6.1.0-28-amd64 amd64; OpenJDK 64-Bit Server VM 17.0.13+11-Debian-2deb12u1; Deflater: Intel; Inflater: Intel; Provider GCS is available; Picard version: Version:null
INFO 2025-04-24 13:04:12 PositionBasedDownsampleSam Checking to see if input file has been downsampled with this program before.
WARNING 2025-04-24 13:04:12 PositionBasedDownsampleSam Found previous Program Record that indicates that this file has been downsampled already with this program. Operation not supported! Previous PG: SAMProgramRecord{PN=PositionBasedDownsampleSam, CL=PositionBasedDownsampleSam --INPUT /tmp/pds_test_PositionalDownsampling5160644569884654405/PositionalDownsampleSam8241000921554732497.bam --OUTPUT /tmp/pds_test_PositionalDownsampling5160644569884654405/PositionalDownsampleSam17091562803111839686.bam --FRACTION 0.1 --REMOVE_DUPLICATE_INFORMATION true --READ_NAME_REGEX <optimized capture of last three ':' separated fields as numeric values> --ALLOW_MULTIPLE_DOWNSAMPLING_DESPITE_WARNINGS false --VERBOSITY INFO --QUIET false --VALIDATION_STRINGENCY STRICT --COMPRESSION_LEVEL 5 --MAX_RECORDS_IN_RAM 100 --CREATE_INDEX false --CREATE_MD5_FILE false --help false --version false --showHidden false --USE_JDK_DEFLATER false --USE_JDK_INFLATER false, VN=Version:null}
INFO 2025-04-24 13:04:12 PositionBasedDownsampleSam Starting first pass. Examining read distribution in tiles.
INFO 2025-04-24 13:04:12 PositionBasedDownsampleSam First pass done.
INFO 2025-04-24 13:04:12 PositionBasedDownsampleSam Starting second pass. Outputting reads.
INFO 2025-04-24 13:04:12 PositionBasedDownsampleSam Second pass done.
WARNING 2025-04-24 13:04:12 PositionBasedDownsampleSam You've requested FRACTION=0.100000, the resulting downsampling resulted in a rate of 0.998559.
INFO 2025-04-24 13:04:12 PositionBasedDownsampleSam Finished! Kept 1386 out of 1388 reads (P=0.998559).
[Thu Apr 24 13:04:12 EDT 2025] picard.sam.PositionBasedDownsampleSam done. Elapsed time: 0.00 minutes.
Runtime.totalMemory()=1073741824
[Thu Apr 24 13:04:12 EDT 2025] PositionBasedDownsampleSam --INPUT /tmp/pds_test_PositionalDownsampling5160644569884654405/PositionalDownsampleSam8241000921554732497.bam --OUTPUT /tmp/pds_test_PositionalDownsampling5160644569884654405/PositionalDownsampleSam14680955743806128770.bam --FRACTION 0.1 --REMOVE_DUPLICATE_INFORMATION true --READ_NAME_REGEX <optimized capture of last three ':' separated fields as numeric values> --ALLOW_MULTIPLE_DOWNSAMPLING_DESPITE_WARNINGS false --VERBOSITY INFO --QUIET false --VALIDATION_STRINGENCY STRICT --COMPRESSION_LEVEL 5 --MAX_RECORDS_IN_RAM 100 --CREATE_INDEX false --CREATE_MD5_FILE false --help false --version false --showHidden false --USE_JDK_DEFLATER false --USE_JDK_INFLATER false
[Thu Apr 24 13:04:12 EDT 2025] Executing as root@lovelace on Linux 6.1.0-28-amd64 amd64; OpenJDK 64-Bit Server VM 17.0.13+11-Debian-2deb12u1; Deflater: Intel; Inflater: Intel; Provider GCS is available; Picard version: Version:null
INFO 2025-04-24 13:04:12 PositionBasedDownsampleSam Checking to see if input file has been downsampled with this program before.
INFO 2025-04-24 13:04:12 PositionBasedDownsampleSam Starting first pass. Examining read distribution in tiles.
INFO 2025-04-24 13:04:12 PositionBasedDownsampleSam First pass done.
INFO 2025-04-24 13:04:12 PositionBasedDownsampleSam Starting second pass. Outputting reads.
INFO 2025-04-24 13:04:12 PositionBasedDownsampleSam Second pass done.
WARNING 2025-04-24 13:04:12 PositionBasedDownsampleSam You've requested FRACTION=0.100000, the resulting downsampling resulted in a rate of 0.069400.
INFO 2025-04-24 13:04:12 PositionBasedDownsampleSam Finished! Kept 1388 out of 20000 reads (P=0.0694000).
[Thu Apr 24 13:04:12 EDT 2025] picard.sam.PositionBasedDownsampleSam done. Elapsed time: 0.00 minutes.
Runtime.totalMemory()=1073741824
[Thu Apr 24 13:04:12 EDT 2025] PositionBasedDownsampleSam --INPUT /tmp/pds_test_PositionalDownsampling5160644569884654405/PositionalDownsampleSam14680955743806128770.bam --OUTPUT /tmp/pds_test_PositionalDownsampling5160644569884654405/PositionalDownsampleSam684682119236562693.bam --FRACTION 0.1 --REMOVE_DUPLICATE_INFORMATION true --READ_NAME_REGEX <optimized capture of last three ':' separated fields as numeric values> --ALLOW_MULTIPLE_DOWNSAMPLING_DESPITE_WARNINGS false --VERBOSITY INFO --QUIET false --VALIDATION_STRINGENCY STRICT --COMPRESSION_LEVEL 5 --MAX_RECORDS_IN_RAM 100 --CREATE_INDEX false --CREATE_MD5_FILE false --help false --version false --showHidden false --USE_JDK_DEFLATER false --USE_JDK_INFLATER false
[Thu Apr 24 13:04:12 EDT 2025] Executing as root@lovelace on Linux 6.1.0-28-amd64 amd64; OpenJDK 64-Bit Server VM 17.0.13+11-Debian-2deb12u1; Deflater: Intel; Inflater: Intel; Provider GCS is available; Picard version: Version:null
INFO 2025-04-24 13:04:12 PositionBasedDownsampleSam Checking to see if input file has been downsampled with this program before.
ERROR 2025-04-24 13:04:12 PositionBasedDownsampleSam Found previous Program Record that indicates that this file has been downsampled already with this program. Operation not supported! Previous PG: SAMProgramRecord{PN=PositionBasedDownsampleSam, CL=PositionBasedDownsampleSam --INPUT /tmp/pds_test_PositionalDownsampling5160644569884654405/PositionalDownsampleSam8241000921554732497.bam --OUTPUT /tmp/pds_test_PositionalDownsampling5160644569884654405/PositionalDownsampleSam14680955743806128770.bam --FRACTION 0.1 --REMOVE_DUPLICATE_INFORMATION true --READ_NAME_REGEX <optimized capture of last three ':' separated fields as numeric values> --ALLOW_MULTIPLE_DOWNSAMPLING_DESPITE_WARNINGS false --VERBOSITY INFO --QUIET false --VALIDATION_STRINGENCY STRICT --COMPRESSION_LEVEL 5 --MAX_RECORDS_IN_RAM 100 --CREATE_INDEX false --CREATE_MD5_FILE false --help false --version false --showHidden false --USE_JDK_DEFLATER false --USE_JDK_INFLATER false, VN=Version:null}
[Thu Apr 24 13:04:12 EDT 2025] picard.sam.PositionBasedDownsampleSam done. Elapsed time: 0.00 minutes.
Runtime.totalMemory()=1073741824
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
[Thu Apr 24 13:04:12 EDT 2025] PositionBasedDownsampleSam --INPUT /tmp/pds_test_PositionalDownsampling5160644569884654405/PositionalDownsampleSam8241000921554732497.bam --OUTPUT /tmp/pds_test_PositionalDownsampling5160644569884654405/PositionalDownsampleSam11134014865551519153.bam --FRACTION 0.3 --CREATE_INDEX true --REMOVE_DUPLICATE_INFORMATION true --READ_NAME_REGEX <optimized capture of last three ':' separated fields as numeric values> --ALLOW_MULTIPLE_DOWNSAMPLING_DESPITE_WARNINGS false --VERBOSITY INFO --QUIET false --VALIDATION_STRINGENCY STRICT --COMPRESSION_LEVEL 5 --MAX_RECORDS_IN_RAM 100 --CREATE_MD5_FILE false --help false --version false --showHidden false --USE_JDK_DEFLATER false --USE_JDK_INFLATER false
[Thu Apr 24 13:04:12 EDT 2025] Executing as root@lovelace on Linux 6.1.0-28-amd64 amd64; OpenJDK 64-Bit Server VM 17.0.13+11-Debian-2deb12u1; Deflater: Intel; Inflater: Intel; Provider GCS is available; Picard version: Version:null
INFO 2025-04-24 13:04:12 PositionBasedDownsampleSam Checking to see if input file has been downsampled with this program before.
INFO 2025-04-24 13:04:12 PositionBasedDownsampleSam Starting first pass. Examining read distribution in tiles.
INFO 2025-04-24 13:04:12 PositionBasedDownsampleSam First pass done.
INFO 2025-04-24 13:04:12 PositionBasedDownsampleSam Starting second pass. Outputting reads.
INFO 2025-04-24 13:04:12 PositionBasedDownsampleSam Second pass done.
INFO 2025-04-24 13:04:12 PositionBasedDownsampleSam Finished! Kept 5544 out of 20000 reads (P=0.277200).
[Thu Apr 24 13:04:12 EDT 2025] picard.sam.PositionBasedDownsampleSam done. Elapsed time: 0.00 minutes.
Runtime.totalMemory()=1073741824
WARNING 2025-04-24 13:04:12 ValidateSamFile NM validation cannot be performed without the reference. All other validations will still occur.
[Thu Apr 24 13:04:12 EDT 2025] PositionBasedDownsampleSam --INPUT /tmp/pds_test_PositionalDownsampling5160644569884654405/PositionalDownsampleSam8241000921554732497.bam --OUTPUT /tmp/pds_test_PositionalDownsampling5160644569884654405/PositionalDownsampleSam14804667383810650798.bam --FRACTION 0.4 --CREATE_INDEX true --REMOVE_DUPLICATE_INFORMATION true --READ_NAME_REGEX <optimized capture of last three ':' separated fields as numeric values> --ALLOW_MULTIPLE_DOWNSAMPLING_DESPITE_WARNINGS false --VERBOSITY INFO --QUIET false --VALIDATION_STRINGENCY STRICT --COMPRESSION_LEVEL 5 --MAX_RECORDS_IN_RAM 100 --CREATE_MD5_FILE false --help false --version false --showHidden false --USE_JDK_DEFLATER false --USE_JDK_INFLATER false
[Thu Apr 24 13:04:12 EDT 2025] Executing as root@lovelace on Linux 6.1.0-28-amd64 amd64; OpenJDK 64-Bit Server VM 17.0.13+11-Debian-2deb12u1; Deflater: Intel; Inflater: Intel; Provider GCS is available; Picard version: Version:null
INFO 2025-04-24 13:04:12 PositionBasedDownsampleSam Checking to see if input file has been downsampled with this program before.
INFO 2025-04-24 13:04:12 PositionBasedDownsampleSam Starting first pass. Examining read distribution in tiles.
INFO 2025-04-24 13:04:12 PositionBasedDownsampleSam First pass done.
INFO 2025-04-24 13:04:12 PositionBasedDownsampleSam Starting second pass. Outputting reads.
INFO 2025-04-24 13:04:12 PositionBasedDownsampleSam Second pass done.
INFO 2025-04-24 13:04:12 PositionBasedDownsampleSam Finished! Kept 7652 out of 20000 reads (P=0.382600).
[Thu Apr 24 13:04:12 EDT 2025] picard.sam.PositionBasedDownsampleSam done. Elapsed time: 0.00 minutes.
Runtime.totalMemory()=1073741824
WARNING 2025-04-24 13:04:12 ValidateSamFile NM validation cannot be performed without the reference. All other validations will still occur.
[Thu Apr 24 13:04:12 EDT 2025] PositionBasedDownsampleSam --INPUT /tmp/pds_test_PositionalDownsampling5160644569884654405/PositionalDownsampleSam8241000921554732497.bam --OUTPUT /tmp/pds_test_PositionalDownsampling5160644569884654405/PositionalDownsampleSam8763103151718330851.bam --FRACTION 0.5 --CREATE_INDEX true --REMOVE_DUPLICATE_INFORMATION true --READ_NAME_REGEX <optimized capture of last three ':' separated fields as numeric values> --ALLOW_MULTIPLE_DOWNSAMPLING_DESPITE_WARNINGS false --VERBOSITY INFO --QUIET false --VALIDATION_STRINGENCY STRICT --COMPRESSION_LEVEL 5 --MAX_RECORDS_IN_RAM 100 --CREATE_MD5_FILE false --help false --version false --showHidden false --USE_JDK_DEFLATER false --USE_JDK_INFLATER false
[Thu Apr 24 13:04:12 EDT 2025] Executing as root@lovelace on Linux 6.1.0-28-amd64 amd64; OpenJDK 64-Bit Server VM 17.0.13+11-Debian-2deb12u1; Deflater: Intel; Inflater: Intel; Provider GCS is available; Picard version: Version:null
INFO 2025-04-24 13:04:12 PositionBasedDownsampleSam Checking to see if input file has been downsampled with this program before.
INFO 2025-04-24 13:04:12 PositionBasedDownsampleSam Starting first pass. Examining read distribution in tiles.
INFO 2025-04-24 13:04:12 PositionBasedDownsampleSam First pass done.
INFO 2025-04-24 13:04:12 PositionBasedDownsampleSam Starting second pass. Outputting reads.
INFO 2025-04-24 13:04:12 PositionBasedDownsampleSam Second pass done.
INFO 2025-04-24 13:04:12 PositionBasedDownsampleSam Finished! Kept 9688 out of 20000 reads (P=0.484400).
[Thu Apr 24 13:04:12 EDT 2025] picard.sam.PositionBasedDownsampleSam done. Elapsed time: 0.00 minutes.
Runtime.totalMemory()=1073741824
WARNING 2025-04-24 13:04:12 ValidateSamFile NM validation cannot be performed without the reference. All other validations will still occur.
[Thu Apr 24 13:04:12 EDT 2025] PositionBasedDownsampleSam --INPUT /tmp/pds_test_PositionalDownsampling5160644569884654405/PositionalDownsampleSam8241000921554732497.bam --OUTPUT /tmp/pds_test_PositionalDownsampling5160644569884654405/PositionalDownsampleSam1555394520304180532.bam --FRACTION 0.6 --CREATE_INDEX true --REMOVE_DUPLICATE_INFORMATION true --READ_NAME_REGEX <optimized capture of last three ':' separated fields as numeric values> --ALLOW_MULTIPLE_DOWNSAMPLING_DESPITE_WARNINGS false --VERBOSITY INFO --QUIET false --VALIDATION_STRINGENCY STRICT --COMPRESSION_LEVEL 5 --MAX_RECORDS_IN_RAM 100 --CREATE_MD5_FILE false --help false --version false --showHidden false --USE_JDK_DEFLATER false --USE_JDK_INFLATER false
[Thu Apr 24 13:04:12 EDT 2025] Executing as root@lovelace on Linux 6.1.0-28-amd64 amd64; OpenJDK 64-Bit Server VM 17.0.13+11-Debian-2deb12u1; Deflater: Intel; Inflater: Intel; Provider GCS is available; Picard version: Version:null
INFO 2025-04-24 13:04:12 PositionBasedDownsampleSam Checking to see if input file has been downsampled with this program before.
INFO 2025-04-24 13:04:12 PositionBasedDownsampleSam Starting first pass. Examining read distribution in tiles.
INFO 2025-04-24 13:04:12 PositionBasedDownsampleSam First pass done.
INFO 2025-04-24 13:04:12 PositionBasedDownsampleSam Starting second pass. Outputting reads.
INFO 2025-04-24 13:04:12 PositionBasedDownsampleSam Second pass done.
INFO 2025-04-24 13:04:12 PositionBasedDownsampleSam Finished! Kept 12348 out of 20000 reads (P=0.617400).
[Thu Apr 24 13:04:12 EDT 2025] picard.sam.PositionBasedDownsampleSam done. Elapsed time: 0.00 minutes.
Runtime.totalMemory()=1073741824
WARNING 2025-04-24 13:04:12 ValidateSamFile NM validation cannot be performed without the reference. All other validations will still occur.
[Thu Apr 24 13:04:12 EDT 2025] PositionBasedDownsampleSam --INPUT /tmp/pds_test_PositionalDownsampling5160644569884654405/PositionalDownsampleSam8241000921554732497.bam --OUTPUT /tmp/pds_test_PositionalDownsampling5160644569884654405/PositionalDownsampleSam3293118460970238390.bam --FRACTION 0.7 --CREATE_INDEX true --REMOVE_DUPLICATE_INFORMATION true --READ_NAME_REGEX <optimized capture of last three ':' separated fields as numeric values> --ALLOW_MULTIPLE_DOWNSAMPLING_DESPITE_WARNINGS false --VERBOSITY INFO --QUIET false --VALIDATION_STRINGENCY STRICT --COMPRESSION_LEVEL 5 --MAX_RECORDS_IN_RAM 100 --CREATE_MD5_FILE false --help false --version false --showHidden false --USE_JDK_DEFLATER false --USE_JDK_INFLATER false
[Thu Apr 24 13:04:12 EDT 2025] Executing as root@lovelace on Linux 6.1.0-28-amd64 amd64; OpenJDK 64-Bit Server VM 17.0.13+11-Debian-2deb12u1; Deflater: Intel; Inflater: Intel; Provider GCS is available; Picard version: Version:null
INFO 2025-04-24 13:04:12 PositionBasedDownsampleSam Checking to see if input file has been downsampled with this program before.
INFO 2025-04-24 13:04:12 PositionBasedDownsampleSam Starting first pass. Examining read distribution in tiles.
INFO 2025-04-24 13:04:12 PositionBasedDownsampleSam First pass done.
INFO 2025-04-24 13:04:12 PositionBasedDownsampleSam Starting second pass. Outputting reads.
INFO 2025-04-24 13:04:12 PositionBasedDownsampleSam Second pass done.
INFO 2025-04-24 13:04:12 PositionBasedDownsampleSam Finished! Kept 14456 out of 20000 reads (P=0.722800).
[Thu Apr 24 13:04:12 EDT 2025] picard.sam.PositionBasedDownsampleSam done. Elapsed time: 0.00 minutes.
Runtime.totalMemory()=1073741824
WARNING 2025-04-24 13:04:12 ValidateSamFile NM validation cannot be performed without the reference. All other validations will still occur.
[Thu Apr 24 13:04:12 EDT 2025] PositionBasedDownsampleSam --INPUT /tmp/pds_test_PositionalDownsampling5160644569884654405/PositionalDownsampleSam8241000921554732497.bam --OUTPUT /tmp/pds_test_PositionalDownsampling5160644569884654405/PositionalDownsampleSam5465010466467136963.bam --FRACTION 0.7999999999999999 --CREATE_INDEX true --REMOVE_DUPLICATE_INFORMATION true --READ_NAME_REGEX <optimized capture of last three ':' separated fields as numeric values> --ALLOW_MULTIPLE_DOWNSAMPLING_DESPITE_WARNINGS false --VERBOSITY INFO --QUIET false --VALIDATION_STRINGENCY STRICT --COMPRESSION_LEVEL 5 --MAX_RECORDS_IN_RAM 100 --CREATE_MD5_FILE false --help false --version false --showHidden false --USE_JDK_DEFLATER false --USE_JDK_INFLATER false
[Thu Apr 24 13:04:12 EDT 2025] Executing as root@lovelace on Linux 6.1.0-28-amd64 amd64; OpenJDK 64-Bit Server VM 17.0.13+11-Debian-2deb12u1; Deflater: Intel; Inflater: Intel; Provider GCS is available; Picard version: Version:null
INFO 2025-04-24 13:04:12 PositionBasedDownsampleSam Checking to see if input file has been downsampled with this program before.
INFO 2025-04-24 13:04:12 PositionBasedDownsampleSam Starting first pass. Examining read distribution in tiles.
INFO 2025-04-24 13:04:12 PositionBasedDownsampleSam First pass done.
INFO 2025-04-24 13:04:12 PositionBasedDownsampleSam Starting second pass. Outputting reads.
INFO 2025-04-24 13:04:12 PositionBasedDownsampleSam Second pass done.
INFO 2025-04-24 13:04:12 PositionBasedDownsampleSam Finished! Kept 16642 out of 20000 reads (P=0.832100).
[Thu Apr 24 13:04:12 EDT 2025] picard.sam.PositionBasedDownsampleSam done. Elapsed time: 0.00 minutes.
Runtime.totalMemory()=1073741824
WARNING 2025-04-24 13:04:12 ValidateSamFile NM validation cannot be performed without the reference. All other validations will still occur.
[Thu Apr 24 13:04:13 EDT 2025] PositionBasedDownsampleSam --INPUT /tmp/pds_test_PositionalDownsampling5160644569884654405/PositionalDownsampleSam8241000921554732497.bam --OUTPUT /tmp/pds_test_PositionalDownsampling5160644569884654405/PositionalDownsampleSam8768151596868067586.bam --FRACTION 0.8999999999999999 --CREATE_INDEX true --REMOVE_DUPLICATE_INFORMATION true --READ_NAME_REGEX <optimized capture of last three ':' separated fields as numeric values> --ALLOW_MULTIPLE_DOWNSAMPLING_DESPITE_WARNINGS false --VERBOSITY INFO --QUIET false --VALIDATION_STRINGENCY STRICT --COMPRESSION_LEVEL 5 --MAX_RECORDS_IN_RAM 100 --CREATE_MD5_FILE false --help false --version false --showHidden false --USE_JDK_DEFLATER false --USE_JDK_INFLATER false
[Thu Apr 24 13:04:13 EDT 2025] Executing as root@lovelace on Linux 6.1.0-28-amd64 amd64; OpenJDK 64-Bit Server VM 17.0.13+11-Debian-2deb12u1; Deflater: Intel; Inflater: Intel; Provider GCS is available; Picard version: Version:null
INFO 2025-04-24 13:04:13 PositionBasedDownsampleSam Checking to see if input file has been downsampled with this program before.
INFO 2025-04-24 13:04:13 PositionBasedDownsampleSam Starting first pass. Examining read distribution in tiles.
INFO 2025-04-24 13:04:13 PositionBasedDownsampleSam First pass done.
INFO 2025-04-24 13:04:13 PositionBasedDownsampleSam Starting second pass. Outputting reads.
INFO 2025-04-24 13:04:13 PositionBasedDownsampleSam Second pass done.
INFO 2025-04-24 13:04:13 PositionBasedDownsampleSam Finished! Kept 18612 out of 20000 reads (P=0.930600).
[Thu Apr 24 13:04:13 EDT 2025] picard.sam.PositionBasedDownsampleSam done. Elapsed time: 0.00 minutes.
Runtime.totalMemory()=1073741824
WARNING 2025-04-24 13:04:13 ValidateSamFile NM validation cannot be performed without the reference. All other validations will still occur.
[Thu Apr 24 13:04:13 EDT 2025] PositionBasedDownsampleSam --INPUT /tmp/pds_test_PositionalDownsampling5160644569884654405/PositionalDownsampleSam8241000921554732497.bam --OUTPUT /tmp/pds_test_PositionalDownsampling5160644569884654405/PositionalDownsampleSam11367191584493318599.bam --FRACTION 0.9999999999999999 --CREATE_INDEX true --REMOVE_DUPLICATE_INFORMATION true --READ_NAME_REGEX <optimized capture of last three ':' separated fields as numeric values> --ALLOW_MULTIPLE_DOWNSAMPLING_DESPITE_WARNINGS false --VERBOSITY INFO --QUIET false --VALIDATION_STRINGENCY STRICT --COMPRESSION_LEVEL 5 --MAX_RECORDS_IN_RAM 100 --CREATE_MD5_FILE false --help false --version false --showHidden false --USE_JDK_DEFLATER false --USE_JDK_INFLATER false
[Thu Apr 24 13:04:13 EDT 2025] Executing as root@lovelace on Linux 6.1.0-28-amd64 amd64; OpenJDK 64-Bit Server VM 17.0.13+11-Debian-2deb12u1; Deflater: Intel; Inflater: Intel; Provider GCS is available; Picard version: Version:null
INFO 2025-04-24 13:04:13 PositionBasedDownsampleSam Checking to see if input file has been downsampled with this program before.
INFO 2025-04-24 13:04:13 PositionBasedDownsampleSam Starting first pass. Examining read distribution in tiles.
INFO 2025-04-24 13:04:13 PositionBasedDownsampleSam First pass done.
INFO 2025-04-24 13:04:13 PositionBasedDownsampleSam Starting second pass. Outputting reads.
INFO 2025-04-24 13:04:13 PositionBasedDownsampleSam Second pass done.
INFO 2025-04-24 13:04:13 PositionBasedDownsampleSam Finished! Kept 20000 out of 20000 reads (P=1.00000).
[Thu Apr 24 13:04:13 EDT 2025] picard.sam.PositionBasedDownsampleSam done. Elapsed time: 0.00 minutes.
Runtime.totalMemory()=1073741824
WARNING 2025-04-24 13:04:13 ValidateSamFile NM validation cannot be performed without the reference. All other validations will still occur.