API Reference
This section provides detailed API documentation for PhenoQC’s modules.
Validation Module
- class src.validation.DataValidator(df: DataFrame, schema: Dict[str, Any], unique_identifiers: List[str], reference_data: DataFrame | None = None, reference_columns: List[str] | None = None)[source]
Bases:
objectA comprehensive DataValidator that performs both row-level JSON schema validation and cell-level property checks, along with detection of duplicates, conflicts, referential integrity, and anomalies.
- __init__(df: DataFrame, schema: Dict[str, Any], unique_identifiers: List[str], reference_data: DataFrame | None = None, reference_columns: List[str] | None = None)[source]
- Parameters:
df (pd.DataFrame) – The phenotypic data to validate.
schema (dict) – A JSON schema dict describing expected fields, types, constraints, etc.
unique_identifiers (list) – Column names that uniquely identify a record.
reference_data (pd.DataFrame, optional) – A reference dataset for cross-checking references (if any).
reference_columns (list, optional) – Which columns in df must match reference_data.
- validate_format_rowwise() bool[source]
Checks each row as a whole against the JSON schema. If a row fails, we note it in self.integrity_issues and mark a ‘SchemaViolationFlag’ in self.df.
- Returns:
True if all rows pass, False if any row fails.
- Return type:
bool
- validate_row_json_schema(row_idx: int, row_dict: Dict[str, Any]) bool[source]
Validates a single row against the JSON schema. Returns True if valid, False if invalid.
- validate_cells()[source]
Checks each cell in self.df against the schema’s “properties” constraints such as: type, minimum, format, etc.
We store True in self.invalid_mask[row, col] if that cell fails.
- detect_conflicts() DataFrame[source]
Among the identified duplicates, detects rows that have conflicting info in columns other than unique_identifiers.
- verify_integrity() DataFrame[source]
Checks for required fields, typed constraints, referential integrity, etc.
Mapping Module
- class src.mapping.OntologyMapper(config_source)[source]
Bases:
object- CACHE_DIR = '/home/docs/.phenoqc/ontologies'
- __init__(config_source)[source]
Initializes the OntologyMapper by loading ontologies from a config source.
- Parameters:
config_source (Union[str, dict]) – Either: - A string path to the configuration file (YAML/JSON) - An already-loaded dict with configuration data
- load_config(config_path: str) Dict[str, Any][source]
Loads the YAML configuration file.
- Parameters:
config_path (str) – Path to the configuration YAML file.
- Returns:
Configuration parameters.
- Return type:
dict
- load_ontologies() Dict[str, Dict[str, str]][source]
Loads all ontologies specified in the configuration file.
- Returns:
A dictionary where keys are ontology identifiers and values are term mapping dictionaries.
- Return type:
dict
- fetch_ontology_file_with_cache(ontology_id: str, url: str, file_format: str) str[source]
Fetches the ontology file from the cache or downloads it if not present or expired.
- Parameters:
ontology_id (str) – The ontology identifier.
url (str) – The URL to download the ontology from.
file_format (str) – The format of the ontology file (‘obo’, ‘owl’, ‘json’).
- Returns:
Path to the saved ontology file.
- Return type:
str
- parse_ontology(ontology_file_path: str, file_format: str) Dict[str, str][source]
Parses an ontology file into a mapping dictionary.
- Parameters:
ontology_file_path (str) – Path to the ontology file.
file_format (str) – The format of the ontology file (‘obo’, ‘owl’, ‘json’).
- Returns:
Mapping from term names and synonyms to their standardized IDs.
- Return type:
dict
- map_term(term: str, target_ontologies: List[str] | None = None, custom_mappings: Dict[str, str] | None = None) Dict[str, str | None][source]
Maps a phenotypic term to IDs in the specified ontologies.
- Parameters:
term (str) – Phenotypic term to map.
target_ontologies (list, optional) – List of ontology identifiers to map to. If None, maps to the default ontologies.
custom_mappings (dict, optional) – Custom mappings for terms.
- Returns:
Dictionary with ontology IDs mapped for the term.
- Return type:
dict
- map_terms(terms: List[str], target_ontologies: List[str] | None = None, custom_mappings: Dict[str, str] | None = None) Dict[str, Dict[str, str | None]][source]
Maps a list of phenotypic terms to IDs in the specified ontologies.
- Parameters:
terms (list) – List of phenotypic terms to map.
target_ontologies (list, optional) – List of ontology identifiers to map to. If None, maps to the default ontologies.
custom_mappings (dict, optional) – Custom mappings for terms.
- Returns:
Nested dictionary {term: {ontology_id: mapped_id}}.
- Return type:
dict
Missing Data Module
- src.missing_data.detect_missing_data(df)[source]
Detects missing data in the DataFrame.
- Parameters:
df (pd.DataFrame) – DataFrame to check for missing data.
- Returns:
Count of missing values per column.
- Return type:
pd.Series
- src.missing_data.flag_missing_data_records(df)[source]
Flags records with missing data for manual review.
- Parameters:
df (pd.DataFrame) – DataFrame to flag.
- Returns:
DataFrame with an additional ‘MissingDataFlag’ column.
- Return type:
pd.DataFrame
- src.missing_data.impute_missing_data(df, strategy='mean', field_strategies=None)[source]
Imputes missing data in the DataFrame using specified strategies, but only when appropriate (e.g., numeric columns for mean/median).
- Parameters:
df (pd.DataFrame) – DataFrame to impute.
strategy (str) – Default imputation strategy (‘mean’, ‘median’, ‘mode’, ‘knn’, ‘mice’, ‘svd’, ‘none’).
field_strategies (dict) – Dictionary of column-specific imputation strategies. E.g. {“Height_cm”: “median”, “CategoryCol”: “mode”}
- Returns:
DataFrame with imputed values.
- Return type:
pd.DataFrame
Batch Processing Module
- src.batch_processing.child_process_run(file_path, schema, ontology_mapper, unique_identifiers, custom_mappings, impute_strategy, field_strategies, output_dir, target_ontologies, report_format, chunksize, phenotype_columns, log_file_for_children)[source]
This top-level function is what each child process calls. We do the logging re-init in append mode, then run process_file.
- src.batch_processing.unique_output_name(file_path, output_dir, suffix='.csv')[source]
- Creates a unique output filename using:
The original file’s base name (not the entire path),
A short 5-char hash based on that base name (to avoid collisions),
The original extension (e.g. .json -> ‘_json’),
And finally the desired suffix (.csv, _report.pdf, etc.).
- src.batch_processing.convert_nans_to_none_for_string_cols(df, schema)[source]
Converts NaN to None for columns declared as type=[“string”,”null”] (or “string”) in the JSON schema. This ensures row-level validation won’t flag them as float(‘NaN’).
- src.batch_processing.get_file_type(file_path)[source]
Returns ‘csv’, ‘tsv’, or ‘json’ depending on the file extension. Raises ValueError if unsupported.
- src.batch_processing.process_file(file_path, schema, ontology_mapper, unique_identifiers, custom_mappings=None, impute_strategy='mean', field_strategies=None, output_dir='reports', target_ontologies=None, report_format='pdf', chunksize=10000, phenotype_columns=None)[source]
Processes a single file, generating an output CSV and a PDF/MD report. We only changed how we build the final filenames and how we display the file name in the PDF’s “Source file” reference.
- src.batch_processing.batch_process(files, schema_path, config_path, unique_identifiers, custom_mappings_path=None, impute_strategy='mean', output_dir='reports', target_ontologies=None, report_format='pdf', chunksize=10000, phenotype_columns=None, phenotype_column=None, log_file_for_children=None)[source]
Input Module
- src.input.read_csv(file_path, chunksize=10000)[source]
Reads a CSV file and returns an iterator over pandas DataFrame chunks.
- Parameters:
file_path (str) – Path to the CSV file.
na_values (list) – List of strings to be interpreted as NA/NaN.
keep_default_na (bool) – Whether to include the default NaN values.
chunksize (int) – Number of rows per chunk.
- Returns:
DataFrame chunks.
- Return type:
Iterator[pd.DataFrame]
- src.input.read_tsv(file_path, chunksize=10000)[source]
Reads a TSV file and returns an iterator over pandas DataFrame chunks.
- Parameters:
file_path (str) – Path to the TSV file.
chunksize (int) – Number of rows per chunk.
- Returns:
DataFrame chunks.
- Return type:
Iterator[pd.DataFrame]
- src.input.read_json(file_path, chunksize=10000)[source]
Reads a JSON file and returns an iterator over pandas DataFrame chunks. Now gracefully handles empty files or decode errors.
- src.input.load_data(file_path, file_type, chunksize=10000)[source]
Loads data from a file based on its type.
- Parameters:
file_path (str) – Path to the data file.
file_type (str) – Type of the file (‘csv’, ‘tsv’, ‘json’).
chunksize (int) – Number of rows per chunk (for CSV/TSV).
- Returns:
Data iterator for CSV/TSV/JSON.
- Return type:
Iterator[pd.DataFrame]
- Raises:
ValueError – If the file type is unsupported.
Reporting Module
- src.reporting.generate_qc_report(validation_results, missing_data, flagged_records_count, mapping_success_rates, visualization_images, impute_strategy, quality_scores, output_path_or_buffer, report_format='pdf', file_identifier=None)[source]
Generates a quality control report (PDF or Markdown). No changes to other files are required.
- src.reporting.create_visual_summary(df, phenotype_columns=None, output_image_path=None)[source]
- Creates visual summaries with extra steps to keep axis labels fully visible:
Missingness Heatmap (white/blue)
Bar plot of % missing per column
Numeric histograms ignoring ID columns
Optional bar/pie charts for phenotype columns
- src.reporting.create_missingness_distribution(df)[source]
Returns a bar chart showing percent missingness per column.
Configuration Module
Logging Module
- src.logging_module.setup_logging(log_file=None, mode='w')[source]
Sets up the logging configuration once per process invocation.
If log_file is not specified, we automatically create a new log filename by appending the current date-time to ‘phenoqc_’ for the parent process.
For child processes, we re-use the parent’s log_file but with mode=’a’.