Test Data

MCNV2 provides simulated test datasets to help users get started quickly and to satisfy the reproducibility requirements of peer-reviewed publication.

Note

The test datasets are available on Zenodo: https://zenodo.org/records/19860597

Download

# Download test files
wget https://zenodo.org/records/19860597/files/sim_cnvs.tsv
wget https://zenodo.org/records/19860597/files/sim_pedigree.tsv

Quick Start with Test Data

Note

The problematic regions file (GRCh38) is bundled with the MCNV2 package and does not need to be downloaded separately. It is accessed via system.file("resources", "problematic_regions_GRCh38.bed", package = "MCNV2").

Step 1 — Run the preprocessing pipeline (CLI)

library(MCNV2)

# 1. Annotate simulated CNVs
# The problematic regions file is included in the package — no download needed
annotate(
  cnvs_file         = "sim_cnvs.tsv",
  prob_regions_file = system.file("resources", "problematic_regions_GRCh38.bed",
                                  package = "MCNV2"),
  output_file       = "sim_cnvs_annotated.tsv",
  genome_version    = 38,
  bedtools_path     = Sys.which("bedtools")
)

# 2. Compute inheritance
compute_inheritance(
  cnvs_file     = "sim_cnvs_annotated.tsv",
  pedigree_file = "sim_pedigree.tsv",
  output_file   = "sim_cnvs_inheritance.tsv",
  overlap       = 0.5
)

Step 2 — Explore results in the Shiny app

# Launch the app
MCNV2::launch(
  bedtools_path = Sys.which("bedtools"),
  results_dir   = "~/mcnv2_results"
)

In the Preprocessing tab:

  • CNV file → upload sim_cnvs.tsv

  • Pedigree file → upload sim_pedigree.tsv

  • Problematic regions → bundled with the package at system.file("resources", "problematic_regions_GRCh38.bed", package = "MCNV2") — no download needed, the app loads it automatically

In the MP Exploration tab:

  • Upload sim_cnvs_inheritance.tsv as the preprocessed input file

  • Select a quality metric and click Submit

Dataset Description

The simulated dataset represents 199 complete parent–offspring trios (597 individuals total) from a whole-genome sequencing cohort.

Property

Value

Trios

199

Total individuals

597 (199 children + 199 fathers + 199 mothers)

Total CNVs

67,608

Deletions (DEL)

44,447 (65.7%)

Duplications (DUP)

23,161 (34.3%)

Median CNV size

3,952 bp

Mean CNV size

15,184 bp

Genome build

GRCh38/hg38

Chromosomes

chr1–chr22

File formats

sim_cnvs.tsv — Tab-delimited CNV file (input to MCNV2):

Column

Description

Chr

Chromosome (e.g., chr1)

Start

CNV start position (integer)

Stop

CNV end position (integer)

Type

CNV type: DEL or DUP

SampleID

Sample identifier (SIM_C_XXXX for children, SIM_F_XXXX / SIM_M_XXXX for parents)

Score

Simulated quality score

SNP

Simulated SNP count

NbreAlgos

Number of supporting algorithms (1 or 2)

algos_overlap

Algorithm concordance overlap (0–1)

sim_pedigree.tsv — Tab-delimited pedigree file (input to MCNV2):

Column

Description

ChildID

Child sample identifier

FatherID

Father sample identifier

MotherID

Mother sample identifier

Simulation Methodology

The test datasets were generated by empirical parametric simulation calibrated on a real whole-genome sequencing callset from the SPARK cohort (1,103 trios). The real data cannot be shared due to privacy constraints; however, the simulation script is available on Zenodo (https://zenodo.org/records/19860597) and fully documents the methodology.

Simulation approach

All parameters were estimated directly from the real callset. No arbitrary values were used.

Variable

Method

Calibration

CNVs per individual

Negative Binomial

μ and size fitted from observed mean and variance

CNV type (DEL/DUP)

Bernoulli

Empirical proportions

CNV size

Empirical resampling

Direct resampling stratified by CNV type

Chromosome assignment

Multinomial

Observed chromosome-specific frequencies

Genomic position

Uniform

Within GRCh38 chromosome boundaries

Number of algorithms

Bernoulli

Empirical proportions

Score and SNP count

Empirical resampling

Conditional on size bin × NbreAlgos × inheritance status

Algorithm overlap

Beta distribution

Parameters estimated separately for inherited and de novo events

Inheritance probability

Stratified lookup table

By CNV type × size bin × NbreAlgos, with progressive fallback

Parental CNVs

Derived

Inherited: ≥95% reciprocal overlap with child CNV; non-transmitted: independent simulation

Simulation hierarchy

Simulation follows a hierarchical structure preserving the trio family design:

  1. For each trio (child + father + mother), draw N CNVs ~ NegBin(μ, size)

  2. For each CNV: simulate type, size, chromosome, position, quality metrics, and inheritance status

  3. If inherited: generate a matching parental CNV with ±5% position jitter

  4. Generate additional non-transmitted parental CNVs independently

Downstream annotation

Simulated CNVs do not include gene annotations, LOEUF scores, or inheritance labels. These are computed by the MCNV2 pipeline (annotate() and compute_inheritance()) using real GRCh38 genomic resources, ensuring that downstream annotations reflect true biological signal rather than simulated values.

Reproducibility

The simulation script uses set.seed(42) for full reproducibility. The script is provided as a methodological reference; re-running it requires access to the original SPARK callset, which is not publicly available.

Methods text for citation

Test datasets were generated by empirical parametric simulation calibrated on a real whole-genome sequencing callset (SPARK cohort). The number of CNVs per individual was modeled using a negative binomial distribution fitted to the observed counts. CNV type, size, chromosomal assignment, and technical quality variables (Score, SNP count, number of supporting algorithms) were simulated by conditional empirical resampling stratified by CNV type, size bin, algorithm concordance, and inheritance status. Algorithm overlap for concordant calls was drawn from a Beta distribution with parameters estimated separately for inherited and de novo events. Inheritance probability was modeled as a stratified lookup table by CNV type, size bin, and algorithm concordance. Simulated CNVs were then processed through the full MCNV2 annotation pipeline, ensuring that gene annotations, LOEUF scores, and inheritance labels were derived from real genomic resources.