Filtering strategies
====================

Filtering is essential to maximize Mendelian Precision while retaining biologically relevant CNVs. MCNV2 provides flexible filtering options across multiple dimensions.

Overview
--------

**Goal of filtering:**

* Increase Mendelian Precision by removing false positives
* Balance precision and sensitivity

.. admonition:: **Two-stage approach:**

   1. **MP Exploration** — Apply broad filters to assess baseline quality
   2. **Fine-tuning** — Refine filters systematically to identify optimal thresholds

**Key principles:**

* Calculate MP separately for deletions and duplications
* Stratify by CNV size (small vs large CNVs have different quality profiles)
* Consider quality metrics
* Consider gene constraint when evaluating MP

Filter categories
-----------------

Size-based filters
~~~~~~~~~~~~~~~~~~

**CNV size (bp):**

* **Minimum size:** Exclude very small CNVs (e.g., <10 kb)
* **Maximum size:** Optionally exclude very large CNVs (rare, may be artifacts)

**Rationale:**

Small CNVs (<30 kb) typically have lower Mendelian Precision due to:

* Lower signal-to-noise ratio
* Difficulty distinguishing from technical noise
* Higher breakpoint uncertainty
* Example: Analyse CNV from ≥30 kb

.. important::

   **Do not apply the same quality score threshold to all CNV sizes.** Small CNVs 
   require more stringent filtering than large CNVs. Always examine MP stratified 
   by size before setting filters.

Quality score filters
~~~~~~~~~~~~~~~~~~~~~

**Available metrics:**

* **Score** — Caller-specific quality score (higher = more confident)
* **SNP** — Number of SNP probes supporting the CNV (array data)
* **%overlap** — Reciprocal overlap percentage between CNVs from different algorithms
* **NbreAlgos** — Number of algorithms that detected the CNV (1, 2, 3, etc.)

**Rationale:**
Higher quality scores correlate with higher Mendelian Precision. The relationship is often non-linear, with MP plateauing at a certain threshold.

**Typical strategy:**

1. Plot MP versus quality score threshold
2. Observe the trade-off between quality and quantity
3. Select the quality score that meets your requirements

**Example:**

.. code-block:: text

   Deletions 50-100kb:
   
   Score ≥10  → MP = 60% (n = 800)
   Score ≥50  → MP = 75% (n = 500)
   Score ≥100 → MP = 85% (n = 300)
   Score ≥150 → MP = 91% (n = 150) 
   Score ≥200 → MP = 92% (n = 145)  
   
   Optimal threshold: 150 (trade-off between MP (quality) and CNV count (quantity))

**Caller concordance (if available):**

When merging CNV callsets from multiple algorithms, two metrics quantify caller agreement:

**1. NbreAlgos** — Number of algorithms detecting the CNV

* NbreAlgos = 1 → Detected by single algorithm only
* NbreAlgos = 2 → Detected by 2 algorithms
* NbreAlgos = 3 → Detected by 3 algorithms

**2. %overlap** — Reciprocal overlap percentage between detections

* %overlap = 0% → No overlap (NbreAlgos = 1)
* %overlap = 50% → 50% reciprocal overlap between algorithm calls
* %overlap = 100% → Perfect overlap between algorithm calls

**Relationship:**

* If **NbreAlgos = 1** → **%overlap = 0%** (no concordance possible)
* If **NbreAlgos ≥ 2** → **%overlap** can range from 0.1% to 100%

**Filtering strategy:**

Apply a minimum **%overlap** threshold to require caller concordance:

.. code-block:: text

   Filter                Meaning                              
   %overlap ≥ 0%        All CNVs (including single-caller)   
   %overlap ≥ 50%       ≥2 algos with ≥50% overlap           
   %overlap ≥ 70%       ≥2 algos with ≥70% overlap           
   %overlap ≥ 90%       ≥2 algos with ≥90% overlap           

**Note:** Filtering by **%overlap ≥ 50%** implicitly requires **NbreAlgos ≥ 2** 
(since single-caller CNVs have 0% overlap).

**Typical strategy:**

* **Balanced approach:** %overlap ≥ 50% (moderate concordance, retains more CNVs)
* **High precision:** %overlap ≥ 70% or ≥ 80% (strong concordance required)

**Rationale:**

CNVs detected independently by multiple algorithms with high reciprocal overlap 
are more likely to be genuine. Each caller has different sensitivities to 
artifacts, so concordance helps filter out caller-specific false positives.

.. tip::

   When working with merged callsets, **caller concordance (%overlap ≥ 50%)** 
   is often an effective filter. Apply this before optimizing other quality scores.

Problematic region filters
~~~~~~~~~~~~~~~~~~~~~~~~~~~

**Genomic regions prone to artifacts:**

* **Segmental duplications** — Highly similar sequences causing misalignment
* **Centromeres** — Repetitive, poorly mappable
* **Telomeres** — Repetitive, highly variable
* **HLA region** — Extreme polymorphism
* **Low mappability regions** — Reads cannot be uniquely placed

.. important::

   **Highly recommended:** Apply problematic region filters to all CNV datasets. 
   
   CNVs overlapping these regions have substantially lower Mendelian Precision 
   due to technical artifacts:
   
   * Read mismapping to paralogous sequences (segmental duplications)
   * Low coverage and poor mappability (centromeres, telomeres)
   * High genuine copy number variation (HLA region)

**Filter approach:**

* **Percent overlap threshold:** Exclude CNVs with >X% overlap with problematic regions
* **Binary filter:** Exclude any CNV overlapping problematic regions
* **Recommended strategy:** Apply 50% threshold (exclude CNVs with >50% overlap)

Transcript overlap filters
~~~~~~~~~~~~~~~~~~~~~~~~~~~

**Percent transcript overlap:**

* Exclude CNVs with low genic content (e.g., <10% overlap with transcripts)

**Use case:**

* When prioritizing functional variants

Gene-based filters
~~~~~~~~~~~~~~~~~~

**Exclusion lists:**

Upload a file of Ensembl Gene IDs to exclude **from MP calculation only**.

**Purpose:**

Based on published studies, you may identify genes known to be highly constrained and enriched for genuine *de novo* CNVs. Excluding these genes from MP calculation helps assess **technical precision** separately from biological *de novo* events.

**Examples:**

* Severe neurodevelopmental genes (*MECP2*, *SCN1A*, *CDKL5*)
* Haploinsufficient genes from disease databases (ClinGen, DDG2P)
* Genes with extreme constraint (pLI > 0.99)

**File format:**

Plain text file with one Ensembl Gene ID per line (no header):

.. code-block:: text

   ENSG00000169057
   ENSG00000198712
   ENSG00000130164

.. important::

   **Exclusion is for MP assessment only**
   
   These CNVs must be **retained** in your final dataset for downstream analyses 
   as they may represent pathogenic *de novo* variants.

**Workflow:**

1. Calculate MP (all CNVs) → e.g., 82%
2. Calculate MP (excluding gene list) → e.g., 90%
3. Difference (8%) estimates *de novo* contribution

**Gene constraint filters (LOEUF):**

Exclude CNVs affecting highly constrained genes (LOEUF < threshold, e.g., 0.6) **from MP calculation only**.

**Rationale:**

CNVs affecting constrained genes are enriched for genuine *de novo* events. These events reduce MP but are biologically valid. Excluding them from MP calculation allows you to:

1. **Assess technical precision** — MP without likely *de novo* events
2. **Estimate *de novo* rate** — Difference between filtered and unfiltered MP

.. important::

   **LOEUF filter for MP calculation only**
   
   Excluding CNVs in constrained genes (low LOEUF) helps distinguish:
   
   * **Technical MP** — Precision excluding likely *de novo* events
   * **Overall MP** — Precision including all non-inherited CNVs
   
   CNVs affecting constrained genes are enriched for genuine *de novo* events, 
   which reduce MP but are biologically real. Excluding them from MP calculation 
   provides a cleaner assessment of **technical false positive rate**.
   
   **Critical:** These CNVs must be **retained** in your final dataset for 
   downstream analyses (disease association, burden tests) as they may represent 
   pathogenic variants.

**Typical workflow:**

1. Calculate MP (all CNVs) → e.g., 85% (technical + biological)
2. Calculate MP (LOEUF ≥ 0.6) → e.g., 92% (technical only)
3. Difference (92% - 85% = 7%) estimates genuine *de novo* contribution

MP Exploration filters
----------------------

**Purpose:**

Assess baseline MP and identify broad filtering needs.

**Available filters:**

* CNV size (slider)
* Minimum % transcript overlap (slider)
* Maximum % problematic regions overlap (slider)
* Gene exclusion list (upload)
* LOEUF threshold (slider)

**Workflow:**

1. Load preprocessed file
2. Choose transmission type (CNV-level or Gene-level)
3. Apply filters via sliders
4. Observe impact on MP (via plots and summary cards)
5. Proceed to Fine-tuning for detailed optimization

**Output:**

* MP by size range or MP by quality score stratified by size range
* Filtered table with Download option

Fine-tuning filters
-------------------

**Purpose:**

Systematically optimize MP.

**Available filters:**

* **CNV type:** DEL or DUP (analyzed separately)
* **Quality metrics:** Score, SNP, % overlap, nbre_algo
* **Operators:** ≥, ≤, = (combine multiple conditions)
* **Additional filters:** bp_overlap, LOEUF , t_Stop, t_Start

**Workflow:**

1. Apply filters
3. Compare "Before" vs "After" plots
4. Evaluate subset analyses (Genic only, Intergenic only, etc.)
5. Iterate until optimal threshold identified

**Output:**

* 4 comparative plots:
  * Before additional filters
  * After additional filters
  * After + Genic CNVs only
  * After + Intergenic CNVs only
* Downloadable tables for each plot

**Subset analyses:**

* **Genic CNVs only** — CNVs overlapping at least one gene
* **Intergenic CNVs only** — CNVs in non-genic regions
* **No excluded genes** — Exclude CNVs in user-provided gene list
* **No constrained genes (LOEUF < 1)** — Exclude CNVs in constrained genes

**Use cases:**

* Compare genic vs intergenic MP
* Assess impact of gene exclusion lists
* Evaluate technical MP (excluding constrained genes)

Optimization strategies
-----------------------

Plateau-based optimization
~~~~~~~~~~~~~~~~~~~~~~~~~~~

**Approach:**

1. Plot MP versus quality score threshold
2. Identify plateau (where MP stops increasing)
3. Use lowest threshold at plateau

**Example:**

.. code-block:: text

   Threshold → MP      → CNVs retained
   ≥50       → 75%     → 1000
   ≥100      → 85%     → 600
   ≥150      → 92%     → 300  ← Plateau starts
   ≥200      → 92%     → 280  ← No further MP gain
   
   Optimal threshold: 150

**Rationale:**

Beyond the plateau, additional filtering removes genuine CNVs without improving MP.

Size-specific optimization
~~~~~~~~~~~~~~~~~~~~~~~~~~

**Approach:**

Optimize filters separately for each size range. Different size categories require different quality score thresholds.

.. important::

   **Critical:** Do not apply the same quality threshold to all CNV sizes. 
   This is a common mistake that leads to either:
   
   * Over-filtering large CNVs (losing genuine events)
   * Under-filtering small CNVs (retaining false positives)

**Example:**

.. code-block:: text

   Size range      Optimal Score threshold    MP after filtering
   1-30kb          ≥200                       85%
   30-50kb         ≥150                       90%
   50-100kb        ≥100                       92%
   100-200kb       ≥50                        94%
   >200kb          ≥20                        95%

**Rationale:**

Small CNVs have lower signal-to-noise ratio and require more stringent filtering to achieve comparable MP. Large CNVs are inherently more reliable and can pass with lower quality scores.

**Workflow:**

1. Stratify MP by size range
2. For each size range, plot MP vs quality threshold
3. Identify plateau for each size range
4. Apply size-specific thresholds

Type-specific optimization
~~~~~~~~~~~~~~~~~~~~~~~~~~

**Approach:**

Optimize filters separately for deletions (DEL) and duplications (DUP).

.. tip::

   Always analyze DEL and DUP separately. They have different baseline MP values 
   and respond differently to filtering.

**Typical patterns:**

* **Deletions:** Higher baseline MP (signal easier to detect)
* **Duplications:** Lower baseline MP (signal harder to detect, more ambiguous)

**Implications for filtering:**

* Deletions may achieve high MP (≥90%) with moderate filtering
* Duplications may require more aggressive filtering to reach similar MP

**Example:**

.. code-block:: text

   CNV type    Size range    Optimal threshold    MP after filtering
   DEL         50-100kb      Score ≥100           92%
   DUP         50-100kb      Score ≥150           88%

**Rationale:**

Duplications are technically harder to detect than deletions. CNV callers typically have:

* Higher sensitivity for deletions (easier to detect copy number loss)
* Lower sensitivity for duplications (copy number gain harder to distinguish from noise)

This difference in detection difficulty translates to different optimal filtering thresholds.

Gene constraint consideration
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

**Approach:**

When evaluating filtering strategies, separate technical precision from biological *de novo* rate using LOEUF.

**Workflow:**

1. Apply your candidate filter strategy
2. Calculate MP (all CNVs)
3. Calculate MP (excluding LOEUF < 0.6)
4. If the difference is large (>10%), many non-inherited CNVs may be genuine *de novo* events

**Example:**

.. code-block:: text

   Filter: Score ≥100, Size ≥30kb
   
   MP (all CNVs) = 85%
   MP (LOEUF ≥ 0.6) = 92%
   
   Interpretation:
   - Technical precision: 92% (good)
   - Estimated *de novo* rate: 7% (reasonable)
   - Filter strategy is appropriate

vs.

.. code-block:: text

   Filter: Score ≥50, Size ≥10kb
   
   MP (all CNVs) = 70%
   MP (LOEUF ≥ 0.6) = 72%
   
   Interpretation:
   - Technical precision: 72% (poor)
   - Estimated *de novo* rate: 2%
   - High false positive rate → More aggressive filtering needed

**Rationale:**

This approach helps you distinguish:

* Low MP due to technical false positives (requires filtering)
* Low MP due to genuine *de novo* events (biologically expected)

Balancing precision and sensitivity
------------------------------------

**Trade-off:**

* **More aggressive filtering** → Higher MP, fewer CNVs
* **Lenient filtering** → Lower MP, more CNVs

**Considerations:**

* **For discovery studies:** Prioritize sensitivity (lenient filtering, MP ≥70%)
* **For clinical validation:** Prioritize precision (aggressive filtering, MP ≥90%)
* **For method comparison:** Use consistent filters across methods

**Recommended approach:**

1. Start with minimal filters to assess baseline quality
2. Identify optimal thresholds via Fine-tuning 
3. Apply filters that achieve MP ≥85% while retaining sufficient CNVs
4. For clinical applications, target MP ≥90%

.. admonition:: Best practices

  1. **Always stratify by size** before filtering — Do not apply uniform thresholds
  2. **Always calculate MP separately for DEL and DUP** — They have different quality profiles
  3. **Apply problematic region filters** — Highly recommended for all datasets
  4. **Consider gene constraint (LOEUF)** — Distinguish technical FP from biological *de novo*
  5.  **Caller concordance (if available):** Require ≥2 algorithms with ≥50% reciprocal overlap 
  6. **Balance quality and quantity** — Visualize the trade-off between MP (quality) and CNV count (quantity) to make informed filtering decisions based on your study goals

See also
--------

* :doc:`mendelian_precision` — Understanding the MP metric
* :doc:`preprocessing` — Pre-filtering annotation and inheritance calculation
* :doc:`outputs` — Visualizing filtering impact and downloading filtered tables