Test Data ========= MCNV2 provides simulated test datasets to help users get started quickly and to satisfy the reproducibility requirements of peer-reviewed publication. .. note:: The test datasets are available on Zenodo: `https://zenodo.org/records/19860597 `_ Download -------- .. code-block:: bash # Download test files wget https://zenodo.org/records/19860597/files/sim_cnvs.tsv wget https://zenodo.org/records/19860597/files/sim_pedigree.tsv Quick Start with Test Data -------------------------- .. note:: The problematic regions file (GRCh38) is bundled with the MCNV2 package and does not need to be downloaded separately. It is accessed via ``system.file("resources", "problematic_regions_GRCh38.bed", package = "MCNV2")``. **Step 1 — Run the preprocessing pipeline (CLI)** .. code-block:: r library(MCNV2) # 1. Annotate simulated CNVs # The problematic regions file is included in the package — no download needed annotate( cnvs_file = "sim_cnvs.tsv", prob_regions_file = system.file("resources", "problematic_regions_GRCh38.bed", package = "MCNV2"), output_file = "sim_cnvs_annotated.tsv", genome_version = 38, bedtools_path = Sys.which("bedtools") ) # 2. Compute inheritance compute_inheritance( cnvs_file = "sim_cnvs_annotated.tsv", pedigree_file = "sim_pedigree.tsv", output_file = "sim_cnvs_inheritance.tsv", overlap = 0.5 ) **Step 2 — Explore results in the Shiny app** .. code-block:: r # Launch the app MCNV2::launch( bedtools_path = Sys.which("bedtools"), results_dir = "~/mcnv2_results" ) In the **Preprocessing** tab: - **CNV file** → upload ``sim_cnvs.tsv`` - **Pedigree file** → upload ``sim_pedigree.tsv`` - **Problematic regions** → bundled with the package at ``system.file("resources", "problematic_regions_GRCh38.bed", package = "MCNV2")`` — no download needed, the app loads it automatically In the **MP Exploration** tab: - Upload ``sim_cnvs_inheritance.tsv`` as the preprocessed input file - Select a quality metric and click **Submit** Dataset Description ------------------- The simulated dataset represents 199 complete parent–offspring trios (597 individuals total) from a whole-genome sequencing cohort. .. list-table:: :header-rows: 1 :widths: 40 60 * - Property - Value * - Trios - 199 * - Total individuals - 597 (199 children + 199 fathers + 199 mothers) * - Total CNVs - 67,608 * - Deletions (DEL) - 44,447 (65.7%) * - Duplications (DUP) - 23,161 (34.3%) * - Median CNV size - 3,952 bp * - Mean CNV size - 15,184 bp * - Genome build - GRCh38/hg38 * - Chromosomes - chr1–chr22 **File formats** ``sim_cnvs.tsv`` — Tab-delimited CNV file (input to MCNV2): .. list-table:: :header-rows: 1 :widths: 25 75 * - Column - Description * - Chr - Chromosome (e.g., chr1) * - Start - CNV start position (integer) * - Stop - CNV end position (integer) * - Type - CNV type: DEL or DUP * - SampleID - Sample identifier (SIM_C_XXXX for children, SIM_F_XXXX / SIM_M_XXXX for parents) * - Score - Simulated quality score * - SNP - Simulated SNP count * - NbreAlgos - Number of supporting algorithms (1 or 2) * - algos_overlap - Algorithm concordance overlap (0–1) ``sim_pedigree.tsv`` — Tab-delimited pedigree file (input to MCNV2): .. list-table:: :header-rows: 1 :widths: 25 75 * - Column - Description * - ChildID - Child sample identifier * - FatherID - Father sample identifier * - MotherID - Mother sample identifier Simulation Methodology ---------------------- The test datasets were generated by empirical parametric simulation calibrated on a real whole-genome sequencing callset from the SPARK cohort (1,103 trios). The real data cannot be shared due to privacy constraints; however, the simulation script is available on Zenodo (`https://zenodo.org/records/19860597 `_) and fully documents the methodology. **Simulation approach** All parameters were estimated directly from the real callset. No arbitrary values were used. .. list-table:: :header-rows: 1 :widths: 30 35 35 * - Variable - Method - Calibration * - CNVs per individual - Negative Binomial - μ and size fitted from observed mean and variance * - CNV type (DEL/DUP) - Bernoulli - Empirical proportions * - CNV size - Empirical resampling - Direct resampling stratified by CNV type * - Chromosome assignment - Multinomial - Observed chromosome-specific frequencies * - Genomic position - Uniform - Within GRCh38 chromosome boundaries * - Number of algorithms - Bernoulli - Empirical proportions * - Score and SNP count - Empirical resampling - Conditional on size bin × NbreAlgos × inheritance status * - Algorithm overlap - Beta distribution - Parameters estimated separately for inherited and de novo events * - Inheritance probability - Stratified lookup table - By CNV type × size bin × NbreAlgos, with progressive fallback * - Parental CNVs - Derived - Inherited: ≥95% reciprocal overlap with child CNV; non-transmitted: independent simulation **Simulation hierarchy** Simulation follows a hierarchical structure preserving the trio family design: 1. For each trio (child + father + mother), draw N CNVs ~ NegBin(μ, size) 2. For each CNV: simulate type, size, chromosome, position, quality metrics, and inheritance status 3. If inherited: generate a matching parental CNV with ±5% position jitter 4. Generate additional non-transmitted parental CNVs independently **Downstream annotation** Simulated CNVs do not include gene annotations, LOEUF scores, or inheritance labels. These are computed by the MCNV2 pipeline (``annotate()`` and ``compute_inheritance()``) using real GRCh38 genomic resources, ensuring that downstream annotations reflect true biological signal rather than simulated values. **Reproducibility** The simulation script uses ``set.seed(42)`` for full reproducibility. The script is provided as a methodological reference; re-running it requires access to the original SPARK callset, which is not publicly available. **Methods text for citation** *Test datasets were generated by empirical parametric simulation calibrated on a real whole-genome sequencing callset (SPARK cohort). The number of CNVs per individual was modeled using a negative binomial distribution fitted to the observed counts. CNV type, size, chromosomal assignment, and technical quality variables (Score, SNP count, number of supporting algorithms) were simulated by conditional empirical resampling stratified by CNV type, size bin, algorithm concordance, and inheritance status. Algorithm overlap for concordant calls was drawn from a Beta distribution with parameters estimated separately for inherited and de novo events. Inheritance probability was modeled as a stratified lookup table by CNV type, size bin, and algorithm concordance. Simulated CNVs were then processed through the full MCNV2 annotation pipeline, ensuring that gene annotations, LOEUF scores, and inheritance labels were derived from real genomic resources.*