Filtering strategies

Filtering is essential to maximize Mendelian Precision while retaining biologically relevant CNVs. MCNV2 provides flexible filtering options across multiple dimensions.

Overview

Goal of filtering:

  • Increase Mendelian Precision by removing false positives

  • Balance precision and sensitivity

Two-stage approach:

  1. MP Exploration — Apply broad filters to assess baseline quality

  2. Fine-tuning — Refine filters systematically to identify optimal thresholds

Key principles:

  • Calculate MP separately for deletions and duplications

  • Stratify by CNV size (small vs large CNVs have different quality profiles)

  • Consider quality metrics

  • Consider gene constraint when evaluating MP

Filter categories

Size-based filters

CNV size (bp):

  • Minimum size: Exclude very small CNVs (e.g., <10 kb)

  • Maximum size: Optionally exclude very large CNVs (rare, may be artifacts)

Rationale:

Small CNVs (<30 kb) typically have lower Mendelian Precision due to:

  • Lower signal-to-noise ratio

  • Difficulty distinguishing from technical noise

  • Higher breakpoint uncertainty

  • Example: Analyse CNV from ≥30 kb

Important

Do not apply the same quality score threshold to all CNV sizes. Small CNVs require more stringent filtering than large CNVs. Always examine MP stratified by size before setting filters.

Quality score filters

Available metrics:

  • Score — Caller-specific quality score (higher = more confident)

  • SNP — Number of SNP probes supporting the CNV (array data)

  • %overlap — Reciprocal overlap percentage between CNVs from different algorithms

  • NbreAlgos — Number of algorithms that detected the CNV (1, 2, 3, etc.)

Rationale: Higher quality scores correlate with higher Mendelian Precision. The relationship is often non-linear, with MP plateauing at a certain threshold.

Typical strategy:

  1. Plot MP versus quality score threshold

  2. Observe the trade-off between quality and quantity

  3. Select the quality score that meets your requirements

Example:

Deletions 50-100kb:

Score ≥10  → MP = 60% (n = 800)
Score ≥50  → MP = 75% (n = 500)
Score ≥100 → MP = 85% (n = 300)
Score ≥150 → MP = 91% (n = 150)
Score ≥200 → MP = 92% (n = 145)

Optimal threshold: 150 (trade-off between MP (quality) and CNV count (quantity))

Caller concordance (if available):

When merging CNV callsets from multiple algorithms, two metrics quantify caller agreement:

1. NbreAlgos — Number of algorithms detecting the CNV

  • NbreAlgos = 1 → Detected by single algorithm only

  • NbreAlgos = 2 → Detected by 2 algorithms

  • NbreAlgos = 3 → Detected by 3 algorithms

2. %overlap — Reciprocal overlap percentage between detections

  • %overlap = 0% → No overlap (NbreAlgos = 1)

  • %overlap = 50% → 50% reciprocal overlap between algorithm calls

  • %overlap = 100% → Perfect overlap between algorithm calls

Relationship:

  • If NbreAlgos = 1%overlap = 0% (no concordance possible)

  • If NbreAlgos ≥ 2%overlap can range from 0.1% to 100%

Filtering strategy:

Apply a minimum %overlap threshold to require caller concordance:

Filter                Meaning
%overlap ≥ 0%        All CNVs (including single-caller)
%overlap ≥ 50%       ≥2 algos with ≥50% overlap
%overlap ≥ 70%       ≥2 algos with ≥70% overlap
%overlap ≥ 90%       ≥2 algos with ≥90% overlap

Note: Filtering by %overlap ≥ 50% implicitly requires NbreAlgos ≥ 2 (since single-caller CNVs have 0% overlap).

Typical strategy:

  • Balanced approach: %overlap ≥ 50% (moderate concordance, retains more CNVs)

  • High precision: %overlap ≥ 70% or ≥ 80% (strong concordance required)

Rationale:

CNVs detected independently by multiple algorithms with high reciprocal overlap are more likely to be genuine. Each caller has different sensitivities to artifacts, so concordance helps filter out caller-specific false positives.

Tip

When working with merged callsets, caller concordance (%overlap ≥ 50%) is often an effective filter. Apply this before optimizing other quality scores.

Problematic region filters

Genomic regions prone to artifacts:

  • Segmental duplications — Highly similar sequences causing misalignment

  • Centromeres — Repetitive, poorly mappable

  • Telomeres — Repetitive, highly variable

  • HLA region — Extreme polymorphism

  • Low mappability regions — Reads cannot be uniquely placed

Important

Highly recommended: Apply problematic region filters to all CNV datasets.

CNVs overlapping these regions have substantially lower Mendelian Precision due to technical artifacts:

  • Read mismapping to paralogous sequences (segmental duplications)

  • Low coverage and poor mappability (centromeres, telomeres)

  • High genuine copy number variation (HLA region)

Filter approach:

  • Percent overlap threshold: Exclude CNVs with >X% overlap with problematic regions

  • Binary filter: Exclude any CNV overlapping problematic regions

  • Recommended strategy: Apply 50% threshold (exclude CNVs with >50% overlap)

Transcript overlap filters

Percent transcript overlap:

  • Exclude CNVs with low genic content (e.g., <10% overlap with transcripts)

Use case:

  • When prioritizing functional variants

Gene-based filters

Exclusion lists:

Upload a file of Ensembl Gene IDs to exclude from MP calculation only.

Purpose:

Based on published studies, you may identify genes known to be highly constrained and enriched for genuine de novo CNVs. Excluding these genes from MP calculation helps assess technical precision separately from biological de novo events.

Examples:

  • Severe neurodevelopmental genes (MECP2, SCN1A, CDKL5)

  • Haploinsufficient genes from disease databases (ClinGen, DDG2P)

  • Genes with extreme constraint (pLI > 0.99)

File format:

Plain text file with one Ensembl Gene ID per line (no header):

ENSG00000169057
ENSG00000198712
ENSG00000130164

Important

Exclusion is for MP assessment only

These CNVs must be retained in your final dataset for downstream analyses as they may represent pathogenic de novo variants.

Workflow:

  1. Calculate MP (all CNVs) → e.g., 82%

  2. Calculate MP (excluding gene list) → e.g., 90%

  3. Difference (8%) estimates de novo contribution

Gene constraint filters (LOEUF):

Exclude CNVs affecting highly constrained genes (LOEUF < threshold, e.g., 0.6) from MP calculation only.

Rationale:

CNVs affecting constrained genes are enriched for genuine de novo events. These events reduce MP but are biologically valid. Excluding them from MP calculation allows you to:

  1. Assess technical precision — MP without likely de novo events

  2. Estimate *de novo* rate — Difference between filtered and unfiltered MP

Important

LOEUF filter for MP calculation only

Excluding CNVs in constrained genes (low LOEUF) helps distinguish:

  • Technical MP — Precision excluding likely de novo events

  • Overall MP — Precision including all non-inherited CNVs

CNVs affecting constrained genes are enriched for genuine de novo events, which reduce MP but are biologically real. Excluding them from MP calculation provides a cleaner assessment of technical false positive rate.

Critical: These CNVs must be retained in your final dataset for downstream analyses (disease association, burden tests) as they may represent pathogenic variants.

Typical workflow:

  1. Calculate MP (all CNVs) → e.g., 85% (technical + biological)

  2. Calculate MP (LOEUF ≥ 0.6) → e.g., 92% (technical only)

  3. Difference (92% - 85% = 7%) estimates genuine de novo contribution

MP Exploration filters

Purpose:

Assess baseline MP and identify broad filtering needs.

Available filters:

  • CNV size (slider)

  • Minimum % transcript overlap (slider)

  • Maximum % problematic regions overlap (slider)

  • Gene exclusion list (upload)

  • LOEUF threshold (slider)

Workflow:

  1. Load preprocessed file

  2. Choose transmission type (CNV-level or Gene-level)

  3. Apply filters via sliders

  4. Observe impact on MP (via plots and summary cards)

  5. Proceed to Fine-tuning for detailed optimization

Output:

  • MP by size range or MP by quality score stratified by size range

  • Filtered table with Download option

Fine-tuning filters

Purpose:

Systematically optimize MP.

Available filters:

  • CNV type: DEL or DUP (analyzed separately)

  • Quality metrics: Score, SNP, % overlap, nbre_algo

  • Operators: ≥, ≤, = (combine multiple conditions)

  • Additional filters: bp_overlap, LOEUF , t_Stop, t_Start

Workflow:

1. Apply filters 3. Compare “Before” vs “After” plots 4. Evaluate subset analyses (Genic only, Intergenic only, etc.) 5. Iterate until optimal threshold identified

Output:

  • 4 comparative plots: * Before additional filters * After additional filters * After + Genic CNVs only * After + Intergenic CNVs only

  • Downloadable tables for each plot

Subset analyses:

  • Genic CNVs only — CNVs overlapping at least one gene

  • Intergenic CNVs only — CNVs in non-genic regions

  • No excluded genes — Exclude CNVs in user-provided gene list

  • No constrained genes (LOEUF < 1) — Exclude CNVs in constrained genes

Use cases:

  • Compare genic vs intergenic MP

  • Assess impact of gene exclusion lists

  • Evaluate technical MP (excluding constrained genes)

Optimization strategies

Plateau-based optimization

Approach:

  1. Plot MP versus quality score threshold

  2. Identify plateau (where MP stops increasing)

  3. Use lowest threshold at plateau

Example:

Threshold → MP      → CNVs retained
≥50       → 75%     → 1000
≥100      → 85%     → 600
≥150      → 92%     → 300  ← Plateau starts
≥200      → 92%     → 280  ← No further MP gain

Optimal threshold: 150

Rationale:

Beyond the plateau, additional filtering removes genuine CNVs without improving MP.

Size-specific optimization

Approach:

Optimize filters separately for each size range. Different size categories require different quality score thresholds.

Important

Critical: Do not apply the same quality threshold to all CNV sizes. This is a common mistake that leads to either:

  • Over-filtering large CNVs (losing genuine events)

  • Under-filtering small CNVs (retaining false positives)

Example:

Size range      Optimal Score threshold    MP after filtering
1-30kb          ≥200                       85%
30-50kb         ≥150                       90%
50-100kb        ≥100                       92%
100-200kb       ≥50                        94%
>200kb          ≥20                        95%

Rationale:

Small CNVs have lower signal-to-noise ratio and require more stringent filtering to achieve comparable MP. Large CNVs are inherently more reliable and can pass with lower quality scores.

Workflow:

  1. Stratify MP by size range

  2. For each size range, plot MP vs quality threshold

  3. Identify plateau for each size range

  4. Apply size-specific thresholds

Type-specific optimization

Approach:

Optimize filters separately for deletions (DEL) and duplications (DUP).

Tip

Always analyze DEL and DUP separately. They have different baseline MP values and respond differently to filtering.

Typical patterns:

  • Deletions: Higher baseline MP (signal easier to detect)

  • Duplications: Lower baseline MP (signal harder to detect, more ambiguous)

Implications for filtering:

  • Deletions may achieve high MP (≥90%) with moderate filtering

  • Duplications may require more aggressive filtering to reach similar MP

Example:

CNV type    Size range    Optimal threshold    MP after filtering
DEL         50-100kb      Score ≥100           92%
DUP         50-100kb      Score ≥150           88%

Rationale:

Duplications are technically harder to detect than deletions. CNV callers typically have:

  • Higher sensitivity for deletions (easier to detect copy number loss)

  • Lower sensitivity for duplications (copy number gain harder to distinguish from noise)

This difference in detection difficulty translates to different optimal filtering thresholds.

Gene constraint consideration

Approach:

When evaluating filtering strategies, separate technical precision from biological de novo rate using LOEUF.

Workflow:

  1. Apply your candidate filter strategy

  2. Calculate MP (all CNVs)

  3. Calculate MP (excluding LOEUF < 0.6)

  4. If the difference is large (>10%), many non-inherited CNVs may be genuine de novo events

Example:

Filter: Score ≥100, Size ≥30kb

MP (all CNVs) = 85%
MP (LOEUF ≥ 0.6) = 92%

Interpretation:
- Technical precision: 92% (good)
- Estimated *de novo* rate: 7% (reasonable)
- Filter strategy is appropriate

vs.

Filter: Score ≥50, Size ≥10kb

MP (all CNVs) = 70%
MP (LOEUF ≥ 0.6) = 72%

Interpretation:
- Technical precision: 72% (poor)
- Estimated *de novo* rate: 2%
- High false positive rate → More aggressive filtering needed

Rationale:

This approach helps you distinguish:

  • Low MP due to technical false positives (requires filtering)

  • Low MP due to genuine de novo events (biologically expected)

Balancing precision and sensitivity

Trade-off:

  • More aggressive filtering → Higher MP, fewer CNVs

  • Lenient filtering → Lower MP, more CNVs

Considerations:

  • For discovery studies: Prioritize sensitivity (lenient filtering, MP ≥70%)

  • For clinical validation: Prioritize precision (aggressive filtering, MP ≥90%)

  • For method comparison: Use consistent filters across methods

Recommended approach:

  1. Start with minimal filters to assess baseline quality

  2. Identify optimal thresholds via Fine-tuning

  3. Apply filters that achieve MP ≥85% while retaining sufficient CNVs

  4. For clinical applications, target MP ≥90%

Best practices

  1. Always stratify by size before filtering — Do not apply uniform thresholds

  2. Always calculate MP separately for DEL and DUP — They have different quality profiles

  3. Apply problematic region filters — Highly recommended for all datasets

  4. Consider gene constraint (LOEUF) — Distinguish technical FP from biological de novo

  5. Caller concordance (if available): Require ≥2 algorithms with ≥50% reciprocal overlap

  6. Balance quality and quantity — Visualize the trade-off between MP (quality) and CNV count (quantity) to make informed filtering decisions based on your study goals

See also

  • Mendelian Precision — Understanding the MP metric

  • Preprocessing — Pre-filtering annotation and inheritance calculation

  • Outputs — Visualizing filtering impact and downloading filtered tables