filter
Keep variants matching an expression. The fast path reads raw VCF lines and only parses the fields referenced by the expression — matching records are written as raw bytes without re-serialization.
Synopsis
Section titled “Synopsis”vcfkit filter [OPTIONS] --expression <EXPR> [INPUT]Options
Section titled “Options”| Flag | Description |
|---|---|
-e, --expression <EXPR> | Filter expression (mutually exclusive with --ask) |
-a, --ask <QUERY> | Natural-language query — translated via Anthropic API |
--yes | Skip confirmation when using --ask (for scripting) |
--accept-low-confidence | Proceed even when translation confidence is below 50% |
-o, --output <FILE> | Output file (default: stdout) |
-v, --invert | Invert: keep records that do NOT match |
-q, --quiet | Suppress progress bar and stats |
Examples
Section titled “Examples”# Rare variantsvcfkit filter -e "INFO/AF < 0.01" input.vcf
# High quality PASS variantsvcfkit filter -e "QUAL > 30 && FILTER == 'PASS'" input.vcf
# Substring match (contains)vcfkit filter -e "INFO/CSQ ~ 'missense'" input.vcf
# Non-PASS variants (inverted filter)vcfkit filter -e "FILTER == 'PASS'" --invert input.vcf
# Chromosome + position rangevcfkit filter -e "CHROM == 'chr17' && POS >= 43044295 && POS <= 43125483" input.vcf
# Compound expressionvcfkit filter -e "INFO/AF < 0.05 && QUAL >= 50 && FILTER == 'PASS'" input.vcf > output.vcf
# From stdinbcftools view input.bcf | vcfkit filter -e "INFO/DP > 10"Expression language
Section titled “Expression language”Fields
Section titled “Fields”| Field | Type | Notes |
|---|---|---|
INFO/<key> | Per-header type | e.g., INFO/AF, INFO/DP, INFO/CSQ |
FORMAT/<key> | Per-header type | First sample only |
CHROM | String | e.g., 'chr1' |
POS | Integer | 1-based |
QUAL | Float | Missing (.) evaluates to false |
FILTER | String | e.g., 'PASS' |
Operators
Section titled “Operators”| Operator | Meaning |
|---|---|
<, <=, >, >=, ==, != | Comparison |
&&, ||, ! | Logical |
~ | Substring match (contains) |
!~ | Substring non-match |
Literals
Section titled “Literals”42 # integer3.14 # float'chr1' # string (single quotes)Type coercion
Section titled “Type coercion”Fields declared as Type=Float in the VCF header are parsed as f64 for numeric
comparisons. Type=Integer as i64. Type=String (including FILTER) as string.
A missing value (.) evaluates to false in all comparisons.
Multi-allelic INFO fields
Section titled “Multi-allelic INFO fields”INFO fields with Number=A (one value per ALT allele) use any-element semantics:
INFO/AF < 0.01 matches if any ALT allele has AF < 0.01.
INFO/AF=0.05,0.003 → INFO/AF < 0.01 matches (0.003 < 0.01)INFO/AF=0.05,0.12 → INFO/AF < 0.01 does not matchbcftools equivalence
Section titled “bcftools equivalence”# vcfkitvcfkit filter -e "INFO/AF < 0.01 && FILTER == 'PASS'" input.vcf
# bcftoolsbcftools view -i 'INFO/AF < 0.01 && FILTER == "PASS"' input.vcfThe expression syntax is similar. Key differences: vcfkit uses single quotes for string literals; bcftools uses double quotes.
Natural-language filter (--ask)
Section titled “Natural-language filter (--ask)”Translate plain English into a filter expression via Anthropic’s Claude API:
export ANTHROPIC_API_KEY=sk-ant-...
# Interactive — shows expression for review before runningvcfkit filter --ask "rare PASS variants on chromosome 17" input.vcf
# Non-interactive — skip confirmationvcfkit filter -a "rare PASS variants" --yes input.vcf
# Override low-confidence gatevcfkit filter -a "missense variants in BRCA1" --yes --accept-low-confidence input.vcfThe LLM sees only:
- Your query text
- The VCF header schema (INFO/FORMAT field names, types, descriptions, contig names)
Variant data (CHROM, POS, REF, ALT, genotypes) never leaves your machine.
The translated expression is validated by vcfkit’s deterministic parser before it runs.
When confidence is below 50%, --yes is blocked — add --accept-low-confidence to proceed.
--ask requires an input file path; stdin is not supported (the header must be readable
without consuming the variant stream). Requires ANTHROPIC_API_KEY.
Not available in the browser demo — ANTHROPIC_API_KEY cannot safely be exposed client-side.
Performance
Section titled “Performance”On 1000 Genomes chr22 (1.1M records): 422ms vs bcftools 1,695ms (4.0× faster).
The fast path reads raw lines. For each line, it only parses the INFO fields referenced in the expression — skipping all other fields. Matching records are written as raw bytes. Non-matching records are discarded. The VCF header is parsed once with noodles to get INFO type metadata.