Test Data

MCNV2 provides simulated test datasets to help users get started quickly and to satisfy the reproducibility requirements of peer-reviewed publication.

Note

The test datasets are available on Zenodo: https://zenodo.org/records/19860597

Download

# Download test files
wget https://zenodo.org/records/19860597/files/sim_cnvs.tsv
wget https://zenodo.org/records/19860597/files/sim_pedigree.tsv

Quick Start with Test Data

Note

The problematic regions file (GRCh38) is bundled with the MCNV2 package and does not need to be downloaded separately. It is accessed via system.file("resources", "problematic_regions_GRCh38.bed", package = "MCNV2").

Step 1 — Run the preprocessing pipeline (CLI)

library(MCNV2)

# 1. Annotate simulated CNVs
# The problematic regions file is included in the package — no download needed
annotate(
  cnvs_file         = "sim_cnvs.tsv",
  prob_regions_file = system.file("resources", "problematic_regions_GRCh38.bed",
                                  package = "MCNV2"),
  output_file       = "sim_cnvs_annotated.tsv",
  genome_version    = 38,
  bedtools_path     = Sys.which("bedtools")
)

# 2. Compute inheritance
compute_inheritance(
  cnvs_file     = "sim_cnvs_annotated.tsv",
  pedigree_file = "sim_pedigree.tsv",
  output_file   = "sim_cnvs_inheritance.tsv",
  overlap       = 0.5
)

Step 2 — Explore results in the Shiny app

# Launch the app
MCNV2::launch(
  bedtools_path = Sys.which("bedtools"),
  results_dir   = "~/mcnv2_results"
)

In the Preprocessing tab:

CNV file → upload sim_cnvs.tsv
Pedigree file → upload sim_pedigree.tsv
Problematic regions → bundled with the package at system.file("resources", "problematic_regions_GRCh38.bed", package = "MCNV2") — no download needed, the app loads it automatically

In the MP Exploration tab:

Upload sim_cnvs_inheritance.tsv as the preprocessed input file
Select a quality metric and click Submit

Dataset Description

The simulated dataset represents 199 complete parent–offspring trios (597 individuals total) from a whole-genome sequencing cohort.

Property	Value
Trios	199
Total individuals	597 (199 children + 199 fathers + 199 mothers)
Total CNVs	67,608
Deletions (DEL)	44,447 (65.7%)
Duplications (DUP)	23,161 (34.3%)
Median CNV size	3,952 bp
Mean CNV size	15,184 bp
Genome build	GRCh38/hg38
Chromosomes	chr1–chr22

File formats

sim_cnvs.tsv — Tab-delimited CNV file (input to MCNV2):

Column	Description
Chr	Chromosome (e.g., chr1)
Start	CNV start position (integer)
Stop	CNV end position (integer)
Type	CNV type: DEL or DUP
SampleID	Sample identifier (SIM_C_XXXX for children, SIM_F_XXXX / SIM_M_XXXX for parents)
Score	Simulated quality score
SNP	Simulated SNP count
NbreAlgos	Number of supporting algorithms (1 or 2)
algos_overlap	Algorithm concordance overlap (0–1)

sim_pedigree.tsv — Tab-delimited pedigree file (input to MCNV2):

Column	Description
ChildID	Child sample identifier
FatherID	Father sample identifier
MotherID	Mother sample identifier

Simulation Methodology

The test datasets were generated by empirical parametric simulation calibrated on a real whole-genome sequencing callset from the SPARK cohort (1,103 trios). The real data cannot be shared due to privacy constraints; however, the simulation script is available on Zenodo (https://zenodo.org/records/19860597) and fully documents the methodology.

Simulation approach

All parameters were estimated directly from the real callset. No arbitrary values were used.

Variable	Method	Calibration
CNVs per individual	Negative Binomial	μ and size fitted from observed mean and variance
CNV type (DEL/DUP)	Bernoulli	Empirical proportions
CNV size	Empirical resampling	Direct resampling stratified by CNV type
Chromosome assignment	Multinomial	Observed chromosome-specific frequencies
Genomic position	Uniform	Within GRCh38 chromosome boundaries
Number of algorithms	Bernoulli	Empirical proportions
Score and SNP count	Empirical resampling	Conditional on size bin × NbreAlgos × inheritance status
Algorithm overlap	Beta distribution	Parameters estimated separately for inherited and de novo events
Inheritance probability	Stratified lookup table	By CNV type × size bin × NbreAlgos, with progressive fallback
Parental CNVs	Derived	Inherited: ≥95% reciprocal overlap with child CNV; non-transmitted: independent simulation

Simulation hierarchy

Simulation follows a hierarchical structure preserving the trio family design:

For each trio (child + father + mother), draw N CNVs ~ NegBin(μ, size)
For each CNV: simulate type, size, chromosome, position, quality metrics, and inheritance status
If inherited: generate a matching parental CNV with ±5% position jitter
Generate additional non-transmitted parental CNVs independently

Downstream annotation

Simulated CNVs do not include gene annotations, LOEUF scores, or inheritance labels. These are computed by the MCNV2 pipeline (annotate() and compute_inheritance()) using real GRCh38 genomic resources, ensuring that downstream annotations reflect true biological signal rather than simulated values.

Reproducibility

The simulation script uses set.seed(42) for full reproducibility. The script is provided as a methodological reference; re-running it requires access to the original SPARK callset, which is not publicly available.

Methods text for citation

Test datasets were generated by empirical parametric simulation calibrated on a real whole-genome sequencing callset (SPARK cohort). The number of CNVs per individual was modeled using a negative binomial distribution fitted to the observed counts. CNV type, size, chromosomal assignment, and technical quality variables (Score, SNP count, number of supporting algorithms) were simulated by conditional empirical resampling stratified by CNV type, size bin, algorithm concordance, and inheritance status. Algorithm overlap for concordant calls was drawn from a Beta distribution with parameters estimated separately for inherited and de novo events. Inheritance probability was modeled as a stratified lookup table by CNV type, size bin, and algorithm concordance. Simulated CNVs were then processed through the full MCNV2 annotation pipeline, ensuring that gene annotations, LOEUF scores, and inheritance labels were derived from real genomic resources.