CLI tutorial
============

MCNV2 provides R functions for **reproducible batch processing** and integration into automated pipelines.

.. note::

   These functions are R wrappers that call Python scripts internally. You must have 
   the MCNV2 R package installed and Python environment configured 
   (see :doc:`../getting-started/installation`).

Overview
--------

MCNV2 provides three main functions for command-line workflows:

1. **annotate()** — Annotate CNVs with genes, LOEUF scores, and problematic regions
2. **compute_inheritance()** — Calculate inheritance status (Transmitted_CNV and Transmitted_gene)
3. **compute_mp()** — Calculate Mendelian Precision with filtering options

Typical workflow
----------------

.. code-block:: R

   library(MCNV2)
   
   # 1. Annotate CNVs
   annotate(
     cnvs_file = "data/cnvs.tsv",
     prob_regions_file = "data/problematic_regions.bed",
     output_file = "results/cnvs_annotated.tsv",
     genome_version = 38,
     bedtools_path = "/usr/local/bin/bedtools"
   )
   
   # 2. Compute inheritance status
   compute_inheritance(
     cnvs_file = "results/cnvs_annotated.tsv",
     pedigree_file = "data/pedigree.tsv",
     output_file = "results/cnvs_inheritance.tsv",
     overlap = 0.5
   )
   
   # 3. Compute Mendelian Precision
   # Note: MP is stratified by CNV type (DEL vs DUP) by default
   compute_mp(
     inheritance_file = "results/cnvs_inheritance.tsv",
     output_file = "results/mp_summary.tsv",
     transmission_type = "cnv",
     min_size = 30000,
     max_prob_regions = 0.5
   )

Function reference
------------------

annotate()
~~~~~~~~~~

Annotate CNVs with gene information, LOEUF scores, and problematic region overlap.

**Usage:**

.. code-block:: R

   annotate(
     cnvs_file,
     prob_regions_file,
     output_file,
     genome_version = 38,
     bedtools_path
   )

**Parameters:**

* **cnvs_file** (character) — Path to CNV file (tab-delimited)

  Required columns: CHR, START, STOP, TYPE, SAMPLE_ID

* **prob_regions_file** (character) — Path to problematic regions BED file

  Default file provided in package if not specified

* **output_file** (character) — Path for output annotated file

* **genome_version** (numeric) — Genome build: 38 (GRCh38/hg38) or 37 (GRCh37/hg19)

  Default: 38

* **bedtools_path** (character) — Path to bedtools executable

  Example: "/usr/local/bin/bedtools"

**Returns:**

* 0 if successful
* 1 if output file was not created (error)

**Output columns:**

All input columns plus:

* GeneName — HGNC gene symbol
* GeneID — Ensembl gene ID
* Transcript — Ensembl transcript ID
* LOEUF — Loss-of-function constraint score (gnomAD v4)
* problematic_region_overlap — Percentage overlap with problematic regions

**Example:**

.. code-block:: R

   library(MCNV2)
   
   # Annotate CNVs
   status <- annotate(
     cnvs_file = "data/cnvs.tsv",
     prob_regions_file = system.file("resources", "problematic_regions.bed", 
                                     package = "MCNV2"),
     output_file = "results/cnvs_annotated.tsv",
     genome_version = 38,
     bedtools_path = "/usr/local/bin/bedtools"
   )
   
   if (status == 0) {
     message("Annotation completed successfully")
   } else {
     stop("Annotation failed")
   }

compute_inheritance()
~~~~~~~~~~~~~~~~~~~~~

Calculate inheritance status for each CNV based on parental data.

**Usage:**

.. code-block:: R

   compute_inheritance(
     cnvs_file,
     pedigree_file,
     output_file,
     overlap = 0.5
   )

**Parameters:**

* **cnvs_file** (character) — Path to annotated CNV file (output from annotate())

* **pedigree_file** (character) — Path to pedigree file

  Three columns: SAMPLE_ID, FATHER_ID, MOTHER_ID (tab-delimited, no header)

* **output_file** (character) — Path for output file with inheritance status

* **overlap** (numeric) — Minimum overlap for a CNV to be considered as inherited

  Range: 0.01 to 1.0
  
  Default: 0.5 (50%)

**Returns:**

* 0 if successful
* 1 if output file was not created (error)

**Output columns:**

All input columns plus:

* **Transmitted_CNV** — True/False (coordinate-based inheritance)
* **Transmitted_gene** — True/False/intergenic (gene-based inheritance)

**Example:**

.. code-block:: R

   library(MCNV2)
   
   # Compute inheritance with 50% overlap threshold
   status <- compute_inheritance(
     cnvs_file = "results/cnvs_annotated.tsv",
     pedigree_file = "data/pedigree.tsv",
     output_file = "results/cnvs_inheritance.tsv",
     overlap = 0.5
   )
   
   if (status == 0) {
     message("Inheritance calculation completed")
     
     # Read results
     cnvs <- read.table("results/cnvs_inheritance.tsv", 
                        header = TRUE, sep = "\t", quote = "")
     
     # Count inherited vs non-inherited
     table(cnvs$Transmitted_CNV)
   }

compute_mp()
~~~~~~~~~~~~

Calculate Mendelian Precision with optional filtering.

.. important::

   **MP is always stratified by CNV type (DEL vs DUP) by default.** 
   
   This is the recommended approach as deletions and duplications have different 
   quality profiles. Set ``stratify_by_type = FALSE`` only if you specifically need 
   a single global MP value (not recommended).

.. important::

   This function requires the output file from ``compute_inheritance()`` as input.

**Usage:**

.. code-block:: R

   compute_mp(
     inheritance_file,
     output_file,
     transmission_type = "cnv",
     min_size = NULL,
     max_size = NULL,
     min_score = NULL,
     max_prob_regions = NULL,
     min_loeuf = NULL,
     stratify_by_size = FALSE,
     stratify_by_type = TRUE
   )

**Parameters:**

* **inheritance_file** (character) — Path to inheritance file

  **Must be the output from compute_inheritance()**
  
  Required columns: Transmitted_CNV, Transmitted_gene

* **output_file** (character) — Path for MP summary output

* **transmission_type** (character) — Transmission matching type

  * "cnv" — Use Transmitted_CNV (coordinate-based)
  * "gene" — Use Transmitted_gene (gene-based)
  
  Default: "cnv"

* **min_size** (numeric) — Minimum CNV size in bp (optional)

  Example: 30000 (30 kb)

* **max_size** (numeric) — Maximum CNV size in bp (optional)

* **min_score** (numeric) — Minimum quality score (optional)

* **max_prob_regions** (numeric) — Maximum problematic regions overlap (0-1, optional)

  Example: 0.5 (exclude CNVs with >50% overlap)

* **min_loeuf** (numeric) — Minimum LOEUF threshold (optional)

  Example: 0.6 (exclude constrained genes)

* **stratify_by_size** (logical) — Stratify MP by size ranges

  Default: FALSE

* **stratify_by_type** (logical) — Stratify MP by CNV type (DEL/DUP)

  Default: TRUE (recommended)
  
  **Keep TRUE** unless you specifically need a single global MP value. 
  Deletions and duplications have different quality profiles and should be 
  evaluated separately.

**Returns:**

* 0 if successful
* 1 if output file was not created (error)

**Output:**

Tab-delimited file with MP statistics:

* CNV_type (DEL/DUP, or All if stratify_by_type=FALSE)
* Size_range (All, or specific ranges if stratify_by_size=TRUE)
* Total_CNVs
* Inherited_CNVs
* Non_inherited_CNVs
* MP (Mendelian Precision %)

**Example output (default: stratified by type):**

.. code-block:: text

   CNV_type  Size_range  Total_CNVs  Inherited_CNVs  Non_inherited_CNVs   MP
   DEL       All         5000        4250            750                  85.0
   DUP       All         3000        2400            600                  80.0

**Example output (stratified by type AND size):**

.. code-block:: text

   CNV_type  Size_range   Total_CNVs  Inherited_CNVs  Non_inherited_CNVs   MP
   DEL       1-30kb       800         600             200                  75.0
   DEL       30-50kb      600         540             60                   90.0
   DEL       50-100kb     900         855             45                   95.0
   DUP       1-30kb       600         420             180                  70.0
   DUP       30-50kb      500         425             75                   85.0
   DUP       50-100kb     700         665             35                   95.0

**Example 1: Basic MP calculation (stratified by type)**

.. code-block:: R

   library(MCNV2)
   
   # Calculate MP (CNV-level, stratified by DEL/DUP, no filters)
   compute_mp(
     inheritance_file = "results/cnvs_inheritance.tsv",
     output_file = "results/mp_summary.tsv",
     transmission_type = "cnv"
   )
   
   # Read MP results
   mp <- read.table("results/mp_summary.tsv", header = TRUE, sep = "\t")
   print(mp)
   #   CNV_type  Size_range  Total_CNVs  Inherited_CNVs  Non_inherited_CNVs   MP
   #   DEL       All         5000        4250            750                  85.0
   #   DUP       All         3000        2400            600                  80.0

**Example 2: MP with filtering**

.. code-block:: R

   # Calculate MP with size and quality filters
   # Still stratified by DEL/DUP (default)
   compute_mp(
     inheritance_file = "results/cnvs_inheritance.tsv",
     output_file = "results/mp_filtered.tsv",
     transmission_type = "cnv",
     min_size = 30000,           # ≥30 kb
     max_prob_regions = 0.5,     # ≤50% prob regions overlap
     min_score = 100             # Score ≥100
   )
   
   # Output shows DEL and DUP separately after filtering
   #   CNV_type  Size_range  Total_CNVs  Inherited_CNVs  Non_inherited_CNVs   MP
   #   DEL       All         3000        2775            225                  92.5
   #   DUP       All         1800        1620            180                  90.0

**Example 3: Technical MP (excluding constrained genes)**

.. code-block:: R

   # Calculate technical MP (excluding LOEUF < 0.6)
   compute_mp(
     inheritance_file = "results/cnvs_inheritance.tsv",
     output_file = "results/mp_technical.tsv",
     transmission_type = "gene",
     min_loeuf = 0.6
   )
   
   # Output:
   #   CNV_type  Size_range  Total_CNVs  Inherited_CNVs  Non_inherited_CNVs   MP
   #   DEL       All         4500        4185            315                  93.0
   #   DUP       All         2700        2511            189                  93.0

**Example 4: MP stratified by size**

.. code-block:: R

   # Calculate MP for each size range
   compute_mp(
     inheritance_file = "results/cnvs_inheritance.tsv",
     output_file = "results/mp_by_size.tsv",
     transmission_type = "cnv",
     stratify_by_size = TRUE,
     stratify_by_type = TRUE
   )
   
   # Output shows DEL and DUP for each size range
   #   CNV_type  Size_range   Total_CNVs  Inherited_CNVs  Non_inherited_CNVs   MP
   #   DEL       1-30kb       800         600             200                  75.0
   #   DEL       30-50kb      600         540             60                   90.0
   #   DEL       50-100kb     900         855             45                   95.0
   #   DEL       100-200kb    700         686             14                   98.0
   #   DUP       1-30kb       600         420             180                  70.0
   #   DUP       30-50kb      500         425             75                   85.0
   #   DUP       50-100kb     700         665             35                   95.0
   #   DUP       100-200kb    500         490             10                   98.0

**Example 5: Global MP (not recommended)**

.. code-block:: R

   # Calculate single global MP value (not stratified by type)
   # NOT RECOMMENDED: loses information about DEL vs DUP differences
   compute_mp(
     inheritance_file = "results/cnvs_inheritance.tsv",
     output_file = "results/mp_global.tsv",
     transmission_type = "cnv",
     stratify_by_type = FALSE
   )
   
   # Output (single row):
   #   CNV_type  Size_range  Total_CNVs  Inherited_CNVs  Non_inherited_CNVs   MP
   #   All       All         8000        6650            1350                 83.1

Batch processing example
------------------------

Process multiple datasets:

.. code-block:: R

   library(MCNV2)
   
   # List of datasets
   datasets <- c("cohort1", "cohort2", "cohort3")
   
   for (dataset in datasets) {
     message(paste("Processing", dataset))
     
     # 1. Annotate
     annotate(
       cnvs_file = paste0("data/", dataset, "_cnvs.tsv"),
       prob_regions_file = system.file("resources", "problematic_regions.bed", 
                                       package = "MCNV2"),
       output_file = paste0("results/", dataset, "_annotated.tsv"),
       genome_version = 38,
       bedtools_path = "/usr/local/bin/bedtools"
     )
     
     # 2. Inheritance
     compute_inheritance(
       cnvs_file = paste0("results/", dataset, "_annotated.tsv"),
       pedigree_file = paste0("data/", dataset, "_pedigree.tsv"),
       output_file = paste0("results/", dataset, "_inheritance.tsv"),
       overlap = 0.5
     )
     
     # 3. MP (multiple strategies)
     # All MP calculations are stratified by type (DEL/DUP) by default
     
     # 3a. Overall MP
     compute_mp(
       inheritance_file = paste0("results/", dataset, "_inheritance.tsv"),
       output_file = paste0("results/", dataset, "_mp_all.tsv"),
       transmission_type = "cnv"
     )
     
     # 3b. Filtered MP
     compute_mp(
       inheritance_file = paste0("results/", dataset, "_inheritance.tsv"),
       output_file = paste0("results/", dataset, "_mp_filtered.tsv"),
       transmission_type = "cnv",
       min_size = 30000,
       max_prob_regions = 0.5
     )
     
     # 3c. Technical MP
     compute_mp(
       inheritance_file = paste0("results/", dataset, "_inheritance.tsv"),
       output_file = paste0("results/", dataset, "_mp_technical.tsv"),
       transmission_type = "gene",
       min_loeuf = 0.6
     )
   }
   
   # Combine results (each file has DEL and DUP rows)
   all_mp <- do.call(rbind, lapply(datasets, function(d) {
     mp <- read.table(paste0("results/", d, "_mp_all.tsv"), 
                      header = TRUE, sep = "\t")
     mp$Dataset <- d
     return(mp)
   }))
   
   # Result has DEL and DUP rows for each dataset
   write.table(all_mp, "results/combined_mp.tsv", 
               sep = "\t", row.names = FALSE, quote = FALSE)

Pipeline integration
--------------------

Nextflow workflow
~~~~~~~~~~~~~~~~~

.. code-block:: groovy

   // main.nf
   process annotate {
       input:
       path cnv
       path prob_regions
       
       output:
       path "${cnv.baseName}_annotated.tsv"
       
       script:
       """
       Rscript -e '
       MCNV2::annotate(
           cnvs_file = "${cnv}",
           prob_regions_file = "${prob_regions}",
           output_file = "${cnv.baseName}_annotated.tsv",
           genome_version = 38,
           bedtools_path = "/usr/local/bin/bedtools"
       )'
       """
   }
   
   process inheritance {
       input:
       path annotated
       path pedigree
       
       output:
       path "${annotated.baseName}_inheritance.tsv"
       
       script:
       """
       Rscript -e '
       MCNV2::compute_inheritance(
           cnvs_file = "${annotated}",
           pedigree_file = "${pedigree}",
           output_file = "${annotated.baseName}_inheritance.tsv",
           overlap = 0.5
       )'
       """
   }
   
   process mp {
       input:
       path inheritance
       
       output:
       path "${inheritance.baseName}_mp.tsv"
       
       script:
       """
       Rscript -e '
       MCNV2::compute_mp(
           inheritance_file = "${inheritance}",
           output_file = "${inheritance.baseName}_mp.tsv",
           transmission_type = "cnv",
           min_size = 30000
       )'
       """
   }

Error handling
--------------

Check return codes:

.. code-block:: R

   library(MCNV2)
   
   # Annotate with error checking
   status <- annotate(
     cnvs_file = "data/cnvs.tsv",
     prob_regions_file = "data/problematic_regions.bed",
     output_file = "results/annotated.tsv",
     genome_version = 38,
     bedtools_path = "/usr/local/bin/bedtools"
   )
   
   if (status != 0) {
     stop("Annotation failed. Check input files and bedtools path.")
   }
   
   # Verify output exists and is not empty
   if (!file.exists("results/annotated.tsv")) {
     stop("Output file was not created")
   }
   
   if (file.size("results/annotated.tsv") == 0) {
     stop("Output file is empty")
   }
   
   message("Annotation completed successfully")

Performance considerations
--------------------------

**Memory requirements:**

* Annotation: ~4 GB for 100,000 CNVs
* Inheritance: ~8 GB for 100,000 CNVs in 1,000 trios
* MP calculation: ~2 GB

**Runtime (approximate):**

* Annotation: ~5 min for 100,000 CNVs
* Inheritance: ~10 min for 100,000 CNVs in 1,000 trios
* MP calculation: ~1 min

**Recommendations:**

* For large datasets (>500,000 CNVs), consider splitting by chromosome
* Use parallel processing for batch analysis of multiple cohorts
* Ensure sufficient disk space for intermediate files

See also
--------

* :doc:`../user-guide/preprocessing` — Preprocessing steps explained
* :doc:`../user-guide/inheritance` — Inheritance calculation details
* :doc:`../user-guide/mendelian_precision` — MP calculation and interpretation
* :doc:`../user-guide/filtering` — Filtering strategies