API Reference

This section provides detailed API documentation for PhenoQC’s modules.

Validation Module

class src.validation.DataValidator(df: DataFrame, schema: Dict[str, Any], unique_identifiers: List[str], reference_data: DataFrame | None = None, reference_columns: List[str] | None = None)[source]

Bases: object

A comprehensive DataValidator that performs both row-level JSON schema validation and cell-level property checks, along with detection of duplicates, conflicts, referential integrity, and anomalies.

__init__(df: DataFrame, schema: Dict[str, Any], unique_identifiers: List[str], reference_data: DataFrame | None = None, reference_columns: List[str] | None = None)[source]

Parameters:

df (pd.DataFrame) – The phenotypic data to validate.
schema (dict) – A JSON schema dict describing expected fields, types, constraints, etc.
unique_identifiers (list) – Column names that uniquely identify a record.
reference_data (pd.DataFrame, optional) – A reference dataset for cross-checking references (if any).
reference_columns (list, optional) – Which columns in df must match reference_data.

validate_format_rowwise() → bool[source]

Checks each row as a whole against the JSON schema. If a row fails, we note it in self.integrity_issues and mark a ‘SchemaViolationFlag’ in self.df.

Returns:: True if all rows pass, False if any row fails.
Return type:: bool

validate_row_json_schema(row_idx: int, row_dict: Dict[str, Any]) → bool[source]: Validates a single row against the JSON schema. Returns True if valid, False if invalid.

validate_cells()[source]

Checks each cell in self.df against the schema’s “properties” constraints such as: type, minimum, format, etc.

We store True in self.invalid_mask[row, col] if that cell fails.

identify_duplicates() → DataFrame[source]: Identifies rows that share the same unique_identifiers.

detect_conflicts() → DataFrame[source]: Among the identified duplicates, detects rows that have conflicting info in columns other than unique_identifiers.

verify_integrity() → DataFrame[source]: Checks for required fields, typed constraints, referential integrity, etc.

check_referential_integrity()[source]: Ensures that values in self.reference_columns exist in self.reference_data.

detect_anomalies()[source]: Simple numeric outlier detection using Z-score>3 as a threshold.

run_all_validations() → Dict[str, Any][source]: Runs row-level validation, cell-level checks, duplicates, conflicts, referential checks, and anomaly detection.

Mapping Module

class src.mapping.OntologyMapper(config_source)[source]

Bases: object

CACHE_DIR = '/home/docs/.phenoqc/ontologies'

__init__(config_source)[source]

Initializes the OntologyMapper by loading ontologies from a config source.

Parameters:: config_source (Union[str, dict]) – Either: - A string path to the configuration file (YAML/JSON) - An already-loaded dict with configuration data

load_config(config_path: str) → Dict[str, Any][source]

Loads the YAML configuration file.

Parameters:: config_path (str) – Path to the configuration YAML file.
Returns:: Configuration parameters.
Return type:: dict

load_ontologies() → Dict[str, Dict[str, str]][source]

Loads all ontologies specified in the configuration file.

Returns:: A dictionary where keys are ontology identifiers and values are term mapping dictionaries.
Return type:: dict

fetch_ontology_file_with_cache(ontology_id: str, url: str, file_format: str) → str[source]

Fetches the ontology file from the cache or downloads it if not present or expired.

Parameters:

ontology_id (str) – The ontology identifier.
url (str) – The URL to download the ontology from.
file_format (str) – The format of the ontology file (‘obo’, ‘owl’, ‘json’).

Returns:

Path to the saved ontology file.

Return type:

str

parse_ontology(ontology_file_path: str, file_format: str) → Dict[str, str][source]

Parses an ontology file into a mapping dictionary.

Parameters:

ontology_file_path (str) – Path to the ontology file.
file_format (str) – The format of the ontology file (‘obo’, ‘owl’, ‘json’).

Returns:

Mapping from term names and synonyms to their standardized IDs.

Return type:

dict

map_term(term: str, target_ontologies: List[str] | None = None, custom_mappings: Dict[str, str] | None = None) → Dict[str, str | None][source]

Maps a phenotypic term to IDs in the specified ontologies.

Parameters:

term (str) – Phenotypic term to map.
target_ontologies (list, optional) – List of ontology identifiers to map to. If None, maps to the default ontologies.
custom_mappings (dict, optional) – Custom mappings for terms.

Returns:

Dictionary with ontology IDs mapped for the term.

Return type:

dict

map_terms(terms: List[str], target_ontologies: List[str] | None = None, custom_mappings: Dict[str, str] | None = None) → Dict[str, Dict[str, str | None]][source]

Maps a list of phenotypic terms to IDs in the specified ontologies.

Parameters:

terms (list) – List of phenotypic terms to map.
target_ontologies (list, optional) – List of ontology identifiers to map to. If None, maps to the default ontologies.
custom_mappings (dict, optional) – Custom mappings for terms.

Returns:

Nested dictionary {term: {ontology_id: mapped_id}}.

Return type:

dict

get_supported_ontologies() → List[str][source]

Retrieves a list of supported ontology identifiers.

Returns:: Supported ontology identifiers.
Return type:: list

Missing Data Module

src.missing_data.detect_missing_data(df)[source]

Detects missing data in the DataFrame.

Parameters:: df (pd.DataFrame) – DataFrame to check for missing data.
Returns:: Count of missing values per column.
Return type:: pd.Series

src.missing_data.flag_missing_data_records(df)[source]

Flags records with missing data for manual review.

Parameters:: df (pd.DataFrame) – DataFrame to flag.
Returns:: DataFrame with an additional ‘MissingDataFlag’ column.
Return type:: pd.DataFrame

src.missing_data.impute_missing_data(df, strategy='mean', field_strategies=None)[source]

Imputes missing data in the DataFrame using specified strategies, but only when appropriate (e.g., numeric columns for mean/median).

Parameters:

df (pd.DataFrame) – DataFrame to impute.
strategy (str) – Default imputation strategy (‘mean’, ‘median’, ‘mode’, ‘knn’, ‘mice’, ‘svd’, ‘none’).
field_strategies (dict) – Dictionary of column-specific imputation strategies. E.g. {“Height_cm”: “median”, “CategoryCol”: “mode”}

Returns:

DataFrame with imputed values.

Return type:

pd.DataFrame

Batch Processing Module

src.batch_processing.child_process_run(file_path, schema, ontology_mapper, unique_identifiers, custom_mappings, impute_strategy, field_strategies, output_dir, target_ontologies, report_format, chunksize, phenotype_columns, log_file_for_children)[source]: This top-level function is what each child process calls. We do the logging re-init in append mode, then run process_file.

src.batch_processing.unique_output_name(file_path, output_dir, suffix='.csv')[source]

Creates a unique output filename using:

The original file’s base name (not the entire path),
A short 5-char hash based on that base name (to avoid collisions),
The original extension (e.g. .json -> ‘_json’),
And finally the desired suffix (.csv, _report.pdf, etc.).

src.batch_processing.convert_nans_to_none_for_string_cols(df, schema)[source]: Converts NaN to None for columns declared as type=[“string”,”null”] (or “string”) in the JSON schema. This ensures row-level validation won’t flag them as float(‘NaN’).

src.batch_processing.get_file_type(file_path)[source]: Returns ‘csv’, ‘tsv’, or ‘json’ depending on the file extension. Raises ValueError if unsupported.

src.batch_processing.process_file(file_path, schema, ontology_mapper, unique_identifiers, custom_mappings=None, impute_strategy='mean', field_strategies=None, output_dir='reports', target_ontologies=None, report_format='pdf', chunksize=10000, phenotype_columns=None)[source]: Processes a single file, generating an output CSV and a PDF/MD report. We only changed how we build the final filenames and how we display the file name in the PDF’s “Source file” reference.

src.batch_processing.batch_process(files, schema_path, config_path, unique_identifiers, custom_mappings_path=None, impute_strategy='mean', output_dir='reports', target_ontologies=None, report_format='pdf', chunksize=10000, phenotype_columns=None, phenotype_column=None, log_file_for_children=None)[source]

Input Module

src.input.read_csv(file_path, chunksize=10000)[source]

Reads a CSV file and returns an iterator over pandas DataFrame chunks.

Parameters:

file_path (str) – Path to the CSV file.
na_values (list) – List of strings to be interpreted as NA/NaN.
keep_default_na (bool) – Whether to include the default NaN values.
chunksize (int) – Number of rows per chunk.

Returns:

DataFrame chunks.

Return type:

Iterator[pd.DataFrame]

src.input.read_tsv(file_path, chunksize=10000)[source]

Reads a TSV file and returns an iterator over pandas DataFrame chunks.

Parameters:

file_path (str) – Path to the TSV file.
chunksize (int) – Number of rows per chunk.

Returns:

DataFrame chunks.

Return type:

Iterator[pd.DataFrame]

src.input.read_json(file_path, chunksize=10000)[source]: Reads a JSON file and returns an iterator over pandas DataFrame chunks. Now gracefully handles empty files or decode errors.

src.input.load_data(file_path, file_type, chunksize=10000)[source]

Loads data from a file based on its type.

Parameters:

file_path (str) – Path to the data file.
file_type (str) – Type of the file (‘csv’, ‘tsv’, ‘json’).
chunksize (int) – Number of rows per chunk (for CSV/TSV).

Returns:

Data iterator for CSV/TSV/JSON.

Return type:

Iterator[pd.DataFrame]

Raises:

ValueError – If the file type is unsupported.

Reporting Module

src.reporting.generate_qc_report(validation_results, missing_data, flagged_records_count, mapping_success_rates, visualization_images, impute_strategy, quality_scores, output_path_or_buffer, report_format='pdf', file_identifier=None)[source]: Generates a quality control report (PDF or Markdown). No changes to other files are required.

src.reporting.create_visual_summary(df, phenotype_columns=None, output_image_path=None)[source]

Creates visual summaries with extra steps to keep axis labels fully visible:

Missingness Heatmap (white/blue)
Bar plot of % missing per column
Numeric histograms ignoring ID columns
Optional bar/pie charts for phenotype columns

src.reporting.create_missingness_distribution(df)[source]: Returns a bar chart showing percent missingness per column.

src.reporting.create_missingness_heatmap(df)[source]: Generates a missingness heatmap with exactly two colors: White for present (0) and a pleasing blue (#3B82F6) for missing (1).

src.reporting.create_numeric_histograms(df, unique_id_cols=None, max_cols=5)[source]: Creates histogram figures for numeric columns, ignoring any columns that appear in unique_id_cols (if provided).

Configuration Module

src.configuration.load_config(config_file)[source]

Loads the configuration from a YAML or JSON file.

Parameters:: config_file (str or file-like) – Path to the configuration file or an UploadedFile object.
Returns:: Configuration settings.
Return type:: dict

src.configuration.save_config(config, config_file)[source]

Saves the configuration to a YAML or JSON file.

Parameters:

config (dict) – Configuration settings.
config_file (str) – Path to save the configuration file.

Logging Module

src.logging_module.setup_logging(log_file=None, mode='w')[source]

Sets up the logging configuration once per process invocation.

If log_file is not specified, we automatically create a new log filename by appending the current date-time to ‘phenoqc_’ for the parent process.

For child processes, we re-use the parent’s log_file but with mode=’a’.

src.logging_module.log_activity(message, level='info')[source]

Logs an activity message to the current log file.

Parameters:

message (str) – Message to log.
level (str) – Logging level (‘info’, ‘warning’, ‘error’, ‘debug’).