Test Data
=========

MCNV2 provides simulated test datasets to help users get started quickly and to
satisfy the reproducibility requirements of peer-reviewed publication.

.. note::

   The test datasets are available on Zenodo:
   `https://zenodo.org/records/19860597 <https://zenodo.org/records/19860597>`_


Download
--------

.. code-block:: bash

   # Download test files
   wget https://zenodo.org/records/19860597/files/sim_cnvs.tsv
   wget https://zenodo.org/records/19860597/files/sim_pedigree.tsv


Quick Start with Test Data
--------------------------

.. note::

   The problematic regions file (GRCh38) is bundled with the MCNV2 package and
   does not need to be downloaded separately. It is accessed via
   ``system.file("resources", "problematic_regions_GRCh38.bed", package = "MCNV2")``.

**Step 1 — Run the preprocessing pipeline (CLI)**

.. code-block:: r

   library(MCNV2)

   # 1. Annotate simulated CNVs
   # The problematic regions file is included in the package — no download needed
   annotate(
     cnvs_file         = "sim_cnvs.tsv",
     prob_regions_file = system.file("resources", "problematic_regions_GRCh38.bed",
                                     package = "MCNV2"),
     output_file       = "sim_cnvs_annotated.tsv",
     genome_version    = 38,
     bedtools_path     = Sys.which("bedtools")
   )

   # 2. Compute inheritance
   compute_inheritance(
     cnvs_file     = "sim_cnvs_annotated.tsv",
     pedigree_file = "sim_pedigree.tsv",
     output_file   = "sim_cnvs_inheritance.tsv",
     overlap       = 0.5
   )

**Step 2 — Explore results in the Shiny app**

.. code-block:: r

   # Launch the app
   MCNV2::launch(
     bedtools_path = Sys.which("bedtools"),
     results_dir   = "~/mcnv2_results"
   )

In the **Preprocessing** tab:

- **CNV file** → upload ``sim_cnvs.tsv``
- **Pedigree file** → upload ``sim_pedigree.tsv``
- **Problematic regions** → bundled with the package at
  ``system.file("resources", "problematic_regions_GRCh38.bed", package = "MCNV2")``
  — no download needed, the app loads it automatically

In the **MP Exploration** tab:

- Upload ``sim_cnvs_inheritance.tsv`` as the preprocessed input file
- Select a quality metric and click **Submit**


Dataset Description
-------------------

The simulated dataset represents 199 complete parent–offspring trios (597 individuals
total) from a whole-genome sequencing cohort.

.. list-table::
   :header-rows: 1
   :widths: 40 60

   * - Property
     - Value
   * - Trios
     - 199
   * - Total individuals
     - 597 (199 children + 199 fathers + 199 mothers)
   * - Total CNVs
     - 67,608
   * - Deletions (DEL)
     - 44,447 (65.7%)
   * - Duplications (DUP)
     - 23,161 (34.3%)
   * - Median CNV size
     - 3,952 bp
   * - Mean CNV size
     - 15,184 bp
   * - Genome build
     - GRCh38/hg38
   * - Chromosomes
     - chr1–chr22

**File formats**

``sim_cnvs.tsv`` — Tab-delimited CNV file (input to MCNV2):

.. list-table::
   :header-rows: 1
   :widths: 25 75

   * - Column
     - Description
   * - Chr
     - Chromosome (e.g., chr1)
   * - Start
     - CNV start position (integer)
   * - Stop
     - CNV end position (integer)
   * - Type
     - CNV type: DEL or DUP
   * - SampleID
     - Sample identifier (SIM_C_XXXX for children, SIM_F_XXXX / SIM_M_XXXX for parents)
   * - Score
     - Simulated quality score
   * - SNP
     - Simulated SNP count
   * - NbreAlgos
     - Number of supporting algorithms (1 or 2)
   * - algos_overlap
     - Algorithm concordance overlap (0–1)

``sim_pedigree.tsv`` — Tab-delimited pedigree file (input to MCNV2):

.. list-table::
   :header-rows: 1
   :widths: 25 75

   * - Column
     - Description
   * - ChildID
     - Child sample identifier
   * - FatherID
     - Father sample identifier
   * - MotherID
     - Mother sample identifier


Simulation Methodology
----------------------

The test datasets were generated by empirical parametric simulation calibrated on
a real whole-genome sequencing callset from the SPARK cohort (1,103 trios). The
real data cannot be shared due to privacy constraints; however, the simulation
script is available on Zenodo (`https://zenodo.org/records/19860597 <https://zenodo.org/records/19860597>`_) and fully documents
the methodology.

**Simulation approach**

All parameters were estimated directly from the real callset. No arbitrary values
were used.

.. list-table::
   :header-rows: 1
   :widths: 30 35 35

   * - Variable
     - Method
     - Calibration
   * - CNVs per individual
     - Negative Binomial
     - μ and size fitted from observed mean and variance
   * - CNV type (DEL/DUP)
     - Bernoulli
     - Empirical proportions
   * - CNV size
     - Empirical resampling
     - Direct resampling stratified by CNV type
   * - Chromosome assignment
     - Multinomial
     - Observed chromosome-specific frequencies
   * - Genomic position
     - Uniform
     - Within GRCh38 chromosome boundaries
   * - Number of algorithms
     - Bernoulli
     - Empirical proportions
   * - Score and SNP count
     - Empirical resampling
     - Conditional on size bin × NbreAlgos × inheritance status
   * - Algorithm overlap
     - Beta distribution
     - Parameters estimated separately for inherited and de novo events
   * - Inheritance probability
     - Stratified lookup table
     - By CNV type × size bin × NbreAlgos, with progressive fallback
   * - Parental CNVs
     - Derived
     - Inherited: ≥95% reciprocal overlap with child CNV; non-transmitted: independent simulation

**Simulation hierarchy**

Simulation follows a hierarchical structure preserving the trio family design:

1. For each trio (child + father + mother), draw N CNVs ~ NegBin(μ, size)
2. For each CNV: simulate type, size, chromosome, position, quality metrics, and inheritance status
3. If inherited: generate a matching parental CNV with ±5% position jitter
4. Generate additional non-transmitted parental CNVs independently

**Downstream annotation**

Simulated CNVs do not include gene annotations, LOEUF scores, or inheritance
labels. These are computed by the MCNV2 pipeline (``annotate()`` and
``compute_inheritance()``) using real GRCh38 genomic resources, ensuring that
downstream annotations reflect true biological signal rather than simulated values.

**Reproducibility**

The simulation script uses ``set.seed(42)`` for full reproducibility. The script
is provided as a methodological reference; re-running it requires access to the
original SPARK callset, which is not publicly available.

**Methods text for citation**

   *Test datasets were generated by empirical parametric simulation calibrated on
   a real whole-genome sequencing callset (SPARK cohort). The number of CNVs per
   individual was modeled using a negative binomial distribution fitted to the
   observed counts. CNV type, size, chromosomal assignment, and technical quality
   variables (Score, SNP count, number of supporting algorithms) were simulated
   by conditional empirical resampling stratified by CNV type, size bin, algorithm
   concordance, and inheritance status. Algorithm overlap for concordant calls was
   drawn from a Beta distribution with parameters estimated separately for
   inherited and de novo events. Inheritance probability was modeled as a
   stratified lookup table by CNV type, size bin, and algorithm concordance.
   Simulated CNVs were then processed through the full MCNV2 annotation pipeline,
   ensuring that gene annotations, LOEUF scores, and inheritance labels were
   derived from real genomic resources.*