bamql_queries

LOGICAL CONNECTIVES
PREDICATES
EXAMPLES
SEE ALSO

BAMQL filters SAM or BAM files using a simple query language that is more expressive than the view options available in samtools(1).

The language consists of logical connectives, specifically and, or, not, and a ternary if, and predicates that match properties of the read. Expressions may be grouped using parentheses.

LOGICAL CONNECTIVES

The logical connectives are presented in order from lowest precedence to highest precedence.

VARIABLES
let name1 = expr1 [, name2 = expr2 ...] in body_expr

Computes the value of exprN and assigns it to nameN which can then be used in body_expr. The name may contain only alphanumeric characters or underscore.

bind str_expr using /regex/[i] in body_expr

Compute the value of str_expr, which must be a string, and then matches it against regex. If i is included, the match will be case-insensitive. If the expression matches, body_expr, which must be Boolean, will be evaluated. Any named subpatterns, as per pcrepattern(3), are available as variables. If the patterns name ends in _d, _i, _c, it will be converted to a floating point value, an integer value, or the first character as an integer.

For example, bind read_group using /C3BUK\.(?<x_i>\d+)/ in x_i > 2 will match any read group that starts with C3BUK. followed by a number greater than 2.

LOOPS
all name = expr1 , expr2 [ , expr3 ...] in body_expr

Evaluates body_expr with name set to each of expr1, expr2, and so on, and is satisfied if it is satisfied for every value.

any name = expr1 , expr2 [ , expr3 ...] in body_expr

Evaluates body_expr with name set to each of expr1, expr2, and so on, and is satisfied if it is satisfied for some value.

TERNARY IF
cond then then_expression else else_expression

If cond is satisfied, this expression will only be satisfied if then_expression is satisfied. If cond is not satisfied, then this expression will only be satisfied if else_expression is satisfied.

DISJUNCTION (OR)
expr | expr

Is satisfied if at least one operand is satisfied.

EXCLUSIVE DISJUNCTION (XOR)
expr ^ expr

Is satisfied if exactly one operand is satisfied, but not both.

CONJUNCTION (AND)
expr & expr

Is satisfied only if both operands are satisfied.

MATERIAL CONDITONAL (IMPLIES)
antecedent -> assertion

Is satisfied if either both the antecedent and assertion are statisfied or the antecedent is not satisfied. Consider the following:

position(500) -> nt_exact(500, G)

This will reject any reads where position 500 is not G, but keep reads which either have a G at position 500 or do not overlap with position 500.

COMPARISON AND MATCHING
expr == expr
expr != expr
expr < expr
expr > expr
expr <= expr
expr => expr

Compares values in the usual way. Note that only values of the same type can be compared. For strings, lexicographical order is used for less/greater than comparisons.

expr ~ /regex/[i]
expr : glob

Does string matching usin a PCRE regular expression or a glob. See pcrepattern(3) or glob(7) for details. If i is included, the match will be case-insensitive.

haystack \ needle

Check that the integer haystack contains all the bits set in needle. This is equivalent to the C/Java/C#/Python/PERL expression (haystack & needle) == needle.

NEGATION (NOT)
! expr

Is satisfied if expr is not satisfied.

LITERALS
’x

The integer value of character x.

Numeric floating point and integer values are also supported.

PREDICATES

The predicates are grouped based on function.

BAM FLAGS
These predicates match the sequences in the BAM file. For more details, see sam(5).

paired?

The read is paired in sequencing.

proper_pair?

The read is mapped in a proper pair; that is, the aligner thought the insert size implied by the paired-end reads seemed correct.

unmapped?

The query sequence itself is unmapped.

mate_unmapped?

The mate is unmapped.

mapped_to_reverse?

The read is mapped to the reverse strand.

mate_mapped_to_reverse?

The read’s mate is mapped to the reverse strand.

raw_flag(number)

If numeric BAM flags have been burned into your brain, you can check them directly by specifying number.

read1?

The read is the first read in a pair.

read2?

The read is the second read in a pair.

secondary?

The alignment is not primary.

failed_qc?

The read fails platform/vendor quality checks.

duplicate?

The read is either a PCR or an optical duplicate. That is, another read with the same sequence occurs at exactly the same position in the reference genome.

supplementary?

The alignment is supplementary.

flags

Returns the read flag as an integer.

MAPPING INFORMATION
begin

Returns the start position of the read on the chromosome, if it is mapped.

chr(glob)

Matches the chromosome name to which the read is mapped. Chromosome names should not start with chr, as that will be automatically checked. Moreover, some human chromosome have differing names, so both are checked. The known rules are:

X == 23
Y == 24
MT == M == 25

Also, case is ignored. Additionally, the chromosome is matched using wildcards from glob(7).

chr_name

Returns the chromosome name (stripped of chr) as a string. This can be used with other comparisons, but lacks all the equivalent chromosome magic that chr provides.

mapping_quality(probability)

Matches the read if the probability of error is less than probability. The mapping quality is approximated in the SAM format, so this will be imperfect. For clarity, setting the probability to 0 will be so stringent as to reject all reads, while setting it to 1 will be so liberal as to accept all reads.

end

Returns the final position of the read on the chromosome, if it is mapped.

insert_reversed

Satisfied if the insert size indicates that it is flipped relative to the reference. If the insert is in the direction of the reference or the insert size is not provided, this returns false.

insert_size

Returns the size of the region between a pair of mapped reads if the aligner has been able to determine it.

mate_chr(glob)

This works identically to chr, but on the chromosome of the mate pair, if one exists. If the mate is unmapped, this returns false.

mate_begin

Returns the start position of the mate pair on the chromosome, if it is mapped. Note that mate_end, while logical, cannot exist because the necessary information is not available.

mate_chr_name

Returns the chromosome name of the mate pair, if one exists, (stripped of chr) as a string. This can be used with other comparisons, but lacks all the equivalent chromosome magic that mate_chr provides.

split_pair?

Checks if both the reads in a mate pair are mapped, but to different chromosomes.

OTHER READ INFORMATION
read_group

Returns the read group.

aux_dbl(code)
aux_int(code)
aux_str(code)

Gets a piece of auxiliary data, if specified in the input. The code is the two symbol identifier for the auxiliary format. The result will be a float point number, integral number, or string for aux_dbl, aux_int, and aux_str, respectively. BAM files also have a character type which may be read as an integer.

POSITION
All of the position operations are inclusive: that means they take any reads with nucleotides in the desired range. This means that the start or end of a read can extend beyond the desired positions. BAM files allow reads to have position information while still being marked as unmapped. This operations ignore the official mapping status, and work solely on the position information. If this is undesirable, combine with & !unmapped?. Occasionally, the aligner produces reads which have a position, but no detailed mapping information (i.e., no CIGAR string). In this case, the end position of the read is assumed to be mapped with no insertions or deletions.

after(position)

Matches all sequences that cover the specified position or any higher position (more q-ward on the chromosome).

before(position)

Matches all sequences that cover the specified position or any lower position (more p-ward on the chromosome).

position(start, end)

Matches all sequences that cover the range of position from start to end.

SEQUENCE
nt(position, n)

Matches a read has nucleotide n at the provided position, relative to the chromosome. The nucleotide can be any IUPAC-style base (ACGTU, KMYR, BDHV, and N). The match is degenerate; that is, if the nucleotide specified is N, any base will match. It will reject unmapped reads and reads which do not contain the required position.

nt_exact(position, n)

Matches a read has nucleotide n at the provided position, relative to the chromosome. The nucleotide can be any IUPAC-style base (ACGTU, KMYR, BDHV, and N). The match is exact; that is, if the nucleotide specified is N, the base in the read must be N too. It will reject unmapped reads and reads which do not contain the required position.

MISCELLANEOUS
bed(file)

Reads a BED-formatted file and creates an expression that is satisfied if the read interesects any of the segments in the file. Note that this must load the BED file into memory, so do not use large BED files.

header

Returns the header string of the read.

max(expr1, expr2, ...)

Returns the largest value among the supplied arguments, which must be of the same type. If strings, this is the last string, lexicographically.

min(expr1, expr2, ...)

Returns the smallest value among the supplied arguments, which must be of the same type. If strings, this is the first string, lexicographically.

true

Always satisfied.

false

Never satisfied.

random(probability)

This chooses a uniform pseudo-random variable and is satisfied with frequency probability. This can be used to provide a random sub-sample of reads, keeping the proportion of total reads specified as the probability. The probability must be between 0 and 1 and can be specified using scientific notation. The random number chosen is selected using drand48(3) if one is inclined to care about such things.

EXAMPLES

Match sequences on chromosome 7 which are from the read group labelled RUN3:

chr(7) & read_group : RUN3

Sub-sample mitochondrial reads and all the reads that have matched to chromosomes starting with ug.

chr(M) & random(0.2) | chr(ug*)