DiagnosticFunctions

A set of diagnostics functions for cross-sectional datasets.

`fit_lmm_safe(df, formula_fixed, group_col='batch', reml=False, min_group_n=10, var_threshold=1e-08, optimizers=('lbfgs', 'bfgs', 'powell', 'cg'), maxiter=400, boundary_pvalue=True)`

Fit a random-intercept LMM with warnings captured and safe fallbacks. This is a helper function for Run_LMM_cross_sectional to fit the LMM for each feature with robust error handling and diagnostics.

Returns a dictionary with

success mdf / ols optimizer_used notes warning_types / warning_messages stats status

`Run_LMM_cross_sectional(Data, batch, covariates=None, feature_names=None, group_col_name='batch', covariate_names=None, min_group_n=2, var_threshold=1e-08, reml=False, optimizers=('lbfgs', 'bfgs', 'powell', 'cg'), maxiter=400, boundary_pvalue=True)`

Run a random-intercept linear mixed model for each feature.

Batch is treated as the grouping variable, and any supplied covariates are included as fixed effects.

Parameters:

Name	Description	Default
`Data`	Array-like data matrix with shape `(n_samples, n_features)`.	required
`batch`	Array-like batch labels with length `n_samples`.	required
`covariates`	Optional covariate matrix or DataFrame with one row per sample.	`None`
`feature_names`	Optional names for the feature columns.	`None`
`group_col_name`	Column name used for the grouping variable in the temporary modeling DataFrame.	`'batch'`
`covariate_names`	Optional names for covariate columns.	`None`
`min_group_n`	Minimum batch size required before attempting mixed-model fitting.	`2`
`var_threshold`	Variance threshold below which a feature is skipped.	`1e-08`
`reml`	Whether to fit the mixed model with REML instead of ML.	`False`
`optimizers`	Optimizers to try in sequence for model fitting.	`('lbfgs', 'bfgs', 'powell', 'cg')`
`maxiter`	Maximum iterations per optimizer.	`400`
`boundary_pvalue`	Whether to compute the mixture p-value for variance components on the boundary.	`True`

Returns:

Type	Description
`DataFrame`	tuple[pd.DataFrame, dict[str, Any]]: A tuple containing the per-feature
`dict[str, Any]`	results DataFrame and a summary dictionary of notes and warnings across
`tuple[DataFrame, dict[str, Any]]`	features.

Notes

Features below var_threshold are skipped. Small batch groups trigger an OLS fallback instead of mixed-model fitting. Warnings raised during fitting are captured and included in the returned results.

`RobustOLS_Orig(data, covariates, batch, covariate_names, covariate_types, report=None)`

Defining this function that can be called by cohen's d, variance ratio and KS test functions to residualise out covariate effects before calculating batch effects. Here, we support Dummy encoding of categorical covariates and mean-centering of continuous covariates Look for a variable which describes whether covariates are: 0 binary, 2 categorical, 3 Continous. If variable not given, we will attempt to infer from the unique observations Batch always categorical, create Dummy array for batch

`RobustOLS(data, covariates, batch, covariate_names, covariate_types=None, report=None)`

This is a helper function for residualising out covariate effects while preserving batch effects: It is used in; Cohens_D, Variance_Ratios and KS_Test functions to ensure that the batch effect calculations are not confounded by covariate effects.

Parameters:

Name	Description	Default
`data`	np.ndarray of shape (n_samples, n_features) - the data matrix to be residualised	required
`covariates`	np.ndarray of shape (n_samples, n_covariates) - the covariate matrix	required
`batch`	array-like of shape (n_samples,) - batch labels for each sample	required
`covariate_names`	list of length n_covariates - names for each covariate	required
`covariate_types`	list of length n_covariates with values 0 (binary), 2 (categorical), 3 (continuous) - optional, if not provided will be inferred	`None`
`report`	optional object with method log_text(str) for logging messages about covariate type inference and processing steps.	`None`

Returns: data_resid: np.ndarray of shape (n_samples, n_features) - the data matrix with covariate effects removed but batch effects preserved.

`z_score(data, MAD=False)`

Z-score normalization of the data matrix (samples x features). Use median centered by default as is more robust to outliers and non-normal distributions.

`robust_z_score(data, method='mad', eps=1e-12)`

Apply robust z-scoring to a data matrix.

Parameters:

Name	Description	Default
`data`	Input data with shape `(n_samples, n_features)` or `(n_samples,)`.	required
`method`	Scaling method. Use `"mad"` for median absolute deviation, `"iqr"` for interquartile range, or `"std"` for a standard deviation scale around the median.	`'mad'`
`eps`	Small value used to avoid division by zero.	`1e-12`

Returns:

Type	Description
`ndarray`	np.ndarray: The normalized data array.

`Cohens_D(Data, batch_indices, covariates=None, BatchNames=None, covariate_names=None, covariate_types=None)`

Compute Cohen's d for each batch against the pooled remainder.

This function reports batch-versus-rest effect sizes to give a global view of how each batch differs from the overall distribution after optional covariate residualization.

Parameters:

Name	Description	Default
`Data`	Data matrix with shape `(n_samples, n_features)`.	required
`batch_indices`	Batch labels for each sample.	required
`covariates`	Optional covariate matrix to residualize before effect-size calculation.	`None`
`BatchNames`	Optional display names for batches.	`None`
`covariate_names`	Optional names for covariate columns.	`None`
`covariate_types`	Optional covariate type codes used by the residualizing workflow.	`None`

Returns:

Type	Description
`ndarray`	tuple[np.ndarray, list[tuple[str, str]]]: Cohen's d values with shape
`list[tuple[str, str]]`	`(n_batches, n_features)` and the corresponding batch-vs-rest labels.

Notes

Cohen's d is computed as (mean_batch - mean_other) / std_other, where mean_other and std_other are the unweighted averages across the other batches.

`PC_Correlations(Data, batch, N_components=None, covariates=None, variable_names=None, *, enforce_min_components_for_plotting=True)`

Perform PCA and correlate top PCs with batch and covariates if given, returning explained variance, scores, and correlation results.

Parameters:

Name	Description	Default
`Data`	np.ndarray of shape (n_samples, n_features) - the data matrix.	required
`batch`	array-like of shape (n_samples,) - batch labels for each sample (can	required
`N_components`	int or None - number of principal components to compute (default None means min(n_samples, n_features)).	`None`
`covariates`	optional np.ndarray of shape (n_samples, n_covariates)	`None`
`variable_names`	optional list of length covariates	`None`

Returns: explained_variance: np.ndarray of shape (n_components,) with percentage of variance explained by each PC. scores: np.ndarray of shape (n_samples, n_components) with the PCA scores for each sample. PC_correlations: dict mapping variable name to dict with keys 'correlation' (array of shape (n_components,)) and 'p_value' (array of shape (n_components,)) for the Pearson correlation of each PC with that variable.

`Mahalanobis_Distance(Data=None, batch=None, covariates=None)`

Calculate the Mahalanobis distance between batches in the data. Takes optional covariates and returns distances between each batch pair both before and after regressing out covariates. Additionally provides distance of each batch to the overall centroid before and after residualizing covariates.

Parameters:

Name	Type	Description	Default
`Data`	`ndarray`	Data matrix where rows are samples (n) and columns are features (p).	`None`
`batch`	`ndarray`	1D array-like batch labels for each sample (length n).	`None`
`covariates`	`ndarray`	Covariate matrix (n x k). An intercept will be added automatically.	`None`

Returns:

Type	Description
`dict[str, Any]`	dict[str, Any]: A dictionary containing pairwise and centroid
`dict[str, Any]`	Mahalanobis distances before and, when covariates are provided, after
`dict[str, Any]`	residualization. Inner dictionary keys use tuples such as `(b1, b2)` or
`dict[str, Any]`	`(b, "global")`.

`Variance_Ratios(data, batch, covariates=None, covariate_names=None, covariate_types=None, mode='rest')`

Calculate feature-wise variance ratios for batches.

Multiple comparison modes are available depending on the desired reference set.

Parameters:

Name	Description	Default
`data`	NumPy array with shape `(n_samples, n_features)`.	required
`batch`	Batch labels with length `n_samples`.	required
`covariates`	Optional covariates passed to `RobustOLS` before computing the ratios.	`None`
`covariate_names`	Optional covariate names passed to `RobustOLS`.	`None`
`covariate_types`	Optional covariate type codes passed to `RobustOLS`.	`None`
`mode`	One of `{"pairwise", "rest", "unweighted_mean", "weighted_mean"}`.	`'rest'`

Returns:

Type	Description
`dict[Any, ndarray]`	dict[Any, np.ndarray]: A dictionary of variance-ratio arrays keyed by
`dict[Any, ndarray]`	batch label or batch-pair tuple, depending on `mode`.

Notes

pairwise compares every unique batch pair. rest compares each batch against all remaining samples. unweighted_mean and weighted_mean compare each batch against the mean variance of the other batches.

`Levenes_Test(data, batch, centre='median')`

Perform Levene's test for variance differences between each unique batch pair. Args: data: np.ndarray of shape (n_samples, n_features) - the data matrix. batch: np.ndarray of shape (n_samples,) - the batch labels. centre: str, optional - the method to calculate the center for Levene's test ('median', 'mean', or 'trimmed'). Default is 'median' which is more robust to outliers and non-normal distributions. Returns: dict: A dictionary where keys are tuples of batch pairs (batch1, batch2) and values are dictionaries containing 'statistic' and 'p_value' arrays of shape (n_features

`KS_Test(data, batch, feature_names=None, covariates=None, compare_pairs=False, compare_to_overall_excluding_batch=True, min_batch_n=3, alpha=0.05, do_fdr=True, residualize_covariates=True, covariate_names=None, covariate_types=None)`

Perform two-sample Kolmogorov-Smirnov tests across batches.

The function can compare each batch against the pooled data, the pooled data excluding that batch, and optionally all unique batch pairs.

Parameters:

Name	Description	Default
`data`	NumPy array with shape `(n_samples, n_features)`.	required
`batch`	Batch labels with length `n_samples`.	required
`feature_names`	Optional feature names.	`None`
`covariates`	Optional covariate matrix for residualization.	`None`
`compare_pairs`	Whether to include pairwise batch-versus-batch tests.	`False`
`compare_to_overall_excluding_batch`	Whether to compare each batch against the pooled data excluding that batch.	`True`
`min_batch_n`	Minimum samples required per group for a feature-level test.	`3`
`alpha`	Significance threshold used in summaries.	`0.05`
`do_fdr`	Whether to compute Benjamini-Hochberg adjusted p-values.	`True`
`residualize_covariates`	Whether to residualize covariates before testing.	`True`
`covariate_names`	Optional names for covariate columns.	`None`
`covariate_types`	Optional covariate type codes used during residualization.	`None`

Returns:

Type	Description
`dict[tuple[Any, Any], dict[str, Any]]`	dict[tuple[Any, Any], dict[str, Any]]: A dictionary keyed by comparison
`dict[tuple[Any, Any], dict[str, Any]]`	labels such as `(batch, "overall")` or `(batch1, batch2)`. Each value
`dict[tuple[Any, Any], dict[str, Any]]`	contains the per-feature KS statistics, p-values, optional FDR-adjusted
`dict[tuple[Any, Any], dict[str, Any]]`	p-values, sample counts, and summary metrics.