Skip to content

filter

Keep variants matching an expression. The fast path reads raw VCF lines and only parses the fields referenced by the expression — matching records are written as raw bytes without re-serialization.

Terminal window
vcfkit filter [OPTIONS] --expression <EXPR> [INPUT]
FlagDescription
-e, --expression <EXPR>Filter expression (mutually exclusive with --ask)
-a, --ask <QUERY>Natural-language query — translated via Anthropic API
--yesSkip confirmation when using --ask (for scripting)
--accept-low-confidenceProceed even when translation confidence is below 50%
-o, --output <FILE>Output file (default: stdout)
-v, --invertInvert: keep records that do NOT match
-q, --quietSuppress progress bar and stats
Terminal window
# Rare variants
vcfkit filter -e "INFO/AF < 0.01" input.vcf
# High quality PASS variants
vcfkit filter -e "QUAL > 30 && FILTER == 'PASS'" input.vcf
# Substring match (contains)
vcfkit filter -e "INFO/CSQ ~ 'missense'" input.vcf
# Non-PASS variants (inverted filter)
vcfkit filter -e "FILTER == 'PASS'" --invert input.vcf
# Chromosome + position range
vcfkit filter -e "CHROM == 'chr17' && POS >= 43044295 && POS <= 43125483" input.vcf
# Compound expression
vcfkit filter -e "INFO/AF < 0.05 && QUAL >= 50 && FILTER == 'PASS'" input.vcf > output.vcf
# From stdin
bcftools view input.bcf | vcfkit filter -e "INFO/DP > 10"
FieldTypeNotes
INFO/<key>Per-header typee.g., INFO/AF, INFO/DP, INFO/CSQ
FORMAT/<key>Per-header typeFirst sample only
CHROMStringe.g., 'chr1'
POSInteger1-based
QUALFloatMissing (.) evaluates to false
FILTERStringe.g., 'PASS'
OperatorMeaning
<, <=, >, >=, ==, !=Comparison
&&, ||, !Logical
~Substring match (contains)
!~Substring non-match
42 # integer
3.14 # float
'chr1' # string (single quotes)

Fields declared as Type=Float in the VCF header are parsed as f64 for numeric comparisons. Type=Integer as i64. Type=String (including FILTER) as string. A missing value (.) evaluates to false in all comparisons.

INFO fields with Number=A (one value per ALT allele) use any-element semantics: INFO/AF < 0.01 matches if any ALT allele has AF < 0.01.

INFO/AF=0.05,0.003 → INFO/AF < 0.01 matches (0.003 < 0.01)
INFO/AF=0.05,0.12 → INFO/AF < 0.01 does not match
Terminal window
# vcfkit
vcfkit filter -e "INFO/AF < 0.01 && FILTER == 'PASS'" input.vcf
# bcftools
bcftools view -i 'INFO/AF < 0.01 && FILTER == "PASS"' input.vcf

The expression syntax is similar. Key differences: vcfkit uses single quotes for string literals; bcftools uses double quotes.

Translate plain English into a filter expression via Anthropic’s Claude API:

Terminal window
export ANTHROPIC_API_KEY=sk-ant-...
# Interactive — shows expression for review before running
vcfkit filter --ask "rare PASS variants on chromosome 17" input.vcf
# Non-interactive — skip confirmation
vcfkit filter -a "rare PASS variants" --yes input.vcf
# Override low-confidence gate
vcfkit filter -a "missense variants in BRCA1" --yes --accept-low-confidence input.vcf

The LLM sees only:

  • Your query text
  • The VCF header schema (INFO/FORMAT field names, types, descriptions, contig names)

Variant data (CHROM, POS, REF, ALT, genotypes) never leaves your machine.

The translated expression is validated by vcfkit’s deterministic parser before it runs. When confidence is below 50%, --yes is blocked — add --accept-low-confidence to proceed.

--ask requires an input file path; stdin is not supported (the header must be readable without consuming the variant stream). Requires ANTHROPIC_API_KEY.

Not available in the browser demo — ANTHROPIC_API_KEY cannot safely be exposed client-side.

On 1000 Genomes chr22 (1.1M records): 422ms vs bcftools 1,695ms (4.0× faster).

The fast path reads raw lines. For each line, it only parses the INFO fields referenced in the expression — skipping all other fields. Matching records are written as raw bytes. Non-matching records are discarded. The VCF header is parsed once with noodles to get INFO type metadata.