Test Data
MCNV2 provides simulated test datasets to help users get started quickly and to satisfy the reproducibility requirements of peer-reviewed publication.
Note
The test datasets are available on Zenodo: https://zenodo.org/records/19860597
Download
# Download test files
wget https://zenodo.org/records/19860597/files/sim_cnvs.tsv
wget https://zenodo.org/records/19860597/files/sim_pedigree.tsv
Quick Start with Test Data
Note
The problematic regions file (GRCh38) is bundled with the MCNV2 package and
does not need to be downloaded separately. It is accessed via
system.file("resources", "problematic_regions_GRCh38.bed", package = "MCNV2").
Step 1 — Run the preprocessing pipeline (CLI)
library(MCNV2)
# 1. Annotate simulated CNVs
# The problematic regions file is included in the package — no download needed
annotate(
cnvs_file = "sim_cnvs.tsv",
prob_regions_file = system.file("resources", "problematic_regions_GRCh38.bed",
package = "MCNV2"),
output_file = "sim_cnvs_annotated.tsv",
genome_version = 38,
bedtools_path = Sys.which("bedtools")
)
# 2. Compute inheritance
compute_inheritance(
cnvs_file = "sim_cnvs_annotated.tsv",
pedigree_file = "sim_pedigree.tsv",
output_file = "sim_cnvs_inheritance.tsv",
overlap = 0.5
)
Step 2 — Explore results in the Shiny app
# Launch the app
MCNV2::launch(
bedtools_path = Sys.which("bedtools"),
results_dir = "~/mcnv2_results"
)
In the Preprocessing tab:
CNV file → upload
sim_cnvs.tsvPedigree file → upload
sim_pedigree.tsvProblematic regions → bundled with the package at
system.file("resources", "problematic_regions_GRCh38.bed", package = "MCNV2")— no download needed, the app loads it automatically
In the MP Exploration tab:
Upload
sim_cnvs_inheritance.tsvas the preprocessed input fileSelect a quality metric and click Submit
Dataset Description
The simulated dataset represents 199 complete parent–offspring trios (597 individuals total) from a whole-genome sequencing cohort.
Property |
Value |
|---|---|
Trios |
199 |
Total individuals |
597 (199 children + 199 fathers + 199 mothers) |
Total CNVs |
67,608 |
Deletions (DEL) |
44,447 (65.7%) |
Duplications (DUP) |
23,161 (34.3%) |
Median CNV size |
3,952 bp |
Mean CNV size |
15,184 bp |
Genome build |
GRCh38/hg38 |
Chromosomes |
chr1–chr22 |
File formats
sim_cnvs.tsv — Tab-delimited CNV file (input to MCNV2):
Column |
Description |
|---|---|
Chr |
Chromosome (e.g., chr1) |
Start |
CNV start position (integer) |
Stop |
CNV end position (integer) |
Type |
CNV type: DEL or DUP |
SampleID |
Sample identifier (SIM_C_XXXX for children, SIM_F_XXXX / SIM_M_XXXX for parents) |
Score |
Simulated quality score |
SNP |
Simulated SNP count |
NbreAlgos |
Number of supporting algorithms (1 or 2) |
algos_overlap |
Algorithm concordance overlap (0–1) |
sim_pedigree.tsv — Tab-delimited pedigree file (input to MCNV2):
Column |
Description |
|---|---|
ChildID |
Child sample identifier |
FatherID |
Father sample identifier |
MotherID |
Mother sample identifier |
Simulation Methodology
The test datasets were generated by empirical parametric simulation calibrated on a real whole-genome sequencing callset from the SPARK cohort (1,103 trios). The real data cannot be shared due to privacy constraints; however, the simulation script is available on Zenodo (https://zenodo.org/records/19860597) and fully documents the methodology.
Simulation approach
All parameters were estimated directly from the real callset. No arbitrary values were used.
Variable |
Method |
Calibration |
|---|---|---|
CNVs per individual |
Negative Binomial |
μ and size fitted from observed mean and variance |
CNV type (DEL/DUP) |
Bernoulli |
Empirical proportions |
CNV size |
Empirical resampling |
Direct resampling stratified by CNV type |
Chromosome assignment |
Multinomial |
Observed chromosome-specific frequencies |
Genomic position |
Uniform |
Within GRCh38 chromosome boundaries |
Number of algorithms |
Bernoulli |
Empirical proportions |
Score and SNP count |
Empirical resampling |
Conditional on size bin × NbreAlgos × inheritance status |
Algorithm overlap |
Beta distribution |
Parameters estimated separately for inherited and de novo events |
Inheritance probability |
Stratified lookup table |
By CNV type × size bin × NbreAlgos, with progressive fallback |
Parental CNVs |
Derived |
Inherited: ≥95% reciprocal overlap with child CNV; non-transmitted: independent simulation |
Simulation hierarchy
Simulation follows a hierarchical structure preserving the trio family design:
For each trio (child + father + mother), draw N CNVs ~ NegBin(μ, size)
For each CNV: simulate type, size, chromosome, position, quality metrics, and inheritance status
If inherited: generate a matching parental CNV with ±5% position jitter
Generate additional non-transmitted parental CNVs independently
Downstream annotation
Simulated CNVs do not include gene annotations, LOEUF scores, or inheritance
labels. These are computed by the MCNV2 pipeline (annotate() and
compute_inheritance()) using real GRCh38 genomic resources, ensuring that
downstream annotations reflect true biological signal rather than simulated values.
Reproducibility
The simulation script uses set.seed(42) for full reproducibility. The script
is provided as a methodological reference; re-running it requires access to the
original SPARK callset, which is not publicly available.
Methods text for citation
Test datasets were generated by empirical parametric simulation calibrated on a real whole-genome sequencing callset (SPARK cohort). The number of CNVs per individual was modeled using a negative binomial distribution fitted to the observed counts. CNV type, size, chromosomal assignment, and technical quality variables (Score, SNP count, number of supporting algorithms) were simulated by conditional empirical resampling stratified by CNV type, size bin, algorithm concordance, and inheritance status. Algorithm overlap for concordant calls was drawn from a Beta distribution with parameters estimated separately for inherited and de novo events. Inheritance probability was modeled as a stratified lookup table by CNV type, size bin, and algorithm concordance. Simulated CNVs were then processed through the full MCNV2 annotation pipeline, ensuring that gene annotations, LOEUF scores, and inheritance labels were derived from real genomic resources.