Usage

Command Line Interface

PhenoQC provides a command-line interface for batch processing of phenotypic data files.

Basic Usage

Process a single file:

phenoqc \
  --input examples/samples/sample_data.json \
  --output ./reports/ \
  --schema examples/schemas/pheno_schema.json \
  --config config.yaml \
  --impute mice \
  --unique_identifiers SampleID \
  --phenotype_columns '{"PrimaryPhenotype": ["HPO"], "DiseaseCode": ["DO"]}' \
  --ontologies HPO DO

Batch Processing

Process multiple files:

phenoqc \
  --input examples/samples/sample_data.csv examples/samples/sample_data.json \
  --output ./reports/ \
  --schema examples/schemas/pheno_schema.json \
  --config config.yaml \
  --impute none \
  --unique_identifiers SampleID \
  --ontologies HPO DO MPO

Parameters

  • --input: One or more data files or directories (.csv, .tsv, .json, .zip)

  • --output: Directory for saving processed data and reports

  • --schema: Path to the JSON schema for data validation

  • --config: YAML config file defining ontologies and settings

  • --custom_mappings: Path to custom term-mapping JSON (optional)

  • --impute: Strategy for missing data (mean, median, mode, knn, mice, svd, none)

  • --unique_identifiers: Columns that uniquely identify each record

  • --phenotype_columns: JSON mapping of columns to ontologies

  • --ontologies: List of ontology IDs

  • --recursive: Enable recursive scanning of directories

Graphical User Interface

Launch the GUI:

python run_gui.py

The GUI provides an interactive interface for:

  1. Uploading configuration and schema files

  2. Uploading data files

  3. Selecting unique identifiers and ontologies

  4. Choosing missing data strategies

  5. Running QC and viewing results

Configuration

PhenoQC uses a YAML configuration file to define settings. Example config.yaml:

ontologies:
  HPO:
    name: Human Phenotype Ontology
    source: url
    url: http://purl.obolibrary.org/obo/hp.obo
    format: obo
  DO:
    name: Disease Ontology
    source: url
    url: http://purl.obolibrary.org/obo/doid.obo
    format: obo

default_ontologies:
  - HPO
  - DO

fuzzy_threshold: 80
cache_expiry_days: 30

imputation_strategies:
  Age: mean
  Gender: mode
  Height: median

Output

PhenoQC generates:

  1. Validated and processed data files

  2. Quality control reports (PDF/Markdown)

  3. Visual summaries of data quality

  4. Detailed logs of the QC process

Troubleshooting

Common issues:

  1. Ontology Mapping Failures: Check if config.yaml points to valid ontology URLs

  2. Missing Required Columns: Ensure specified columns exist in the dataset

  3. Imputation Errors: Verify column data types match imputation strategy

  4. Logs: Check phenoqc_*.log for detailed error messages