Skip to content

DiagnosticFunctions

A set of diagnostics functions for cross-sectional datasets.

fit_lmm_safe(df, formula_fixed, group_col='batch', reml=False, min_group_n=10, var_threshold=1e-08, optimizers=('lbfgs', 'bfgs', 'powell', 'cg'), maxiter=400, boundary_pvalue=True)

Fit a random-intercept LMM with warnings captured and safe fallbacks. This is a helper function for Run_LMM_cross_sectional to fit the LMM for each feature with robust error handling and diagnostics.

Returns a dictionary with

success mdf / ols optimizer_used notes warning_types / warning_messages stats status

Run_LMM_cross_sectional(Data, batch, covariates=None, feature_names=None, group_col_name='batch', covariate_names=None, min_group_n=2, var_threshold=1e-08, reml=False, optimizers=('lbfgs', 'bfgs', 'powell', 'cg'), maxiter=400, boundary_pvalue=True)

Run a random-intercept linear mixed model for each feature.

Batch is treated as the grouping variable, and any supplied covariates are included as fixed effects.

Parameters:

Name Type Description Default
Data

Array-like data matrix with shape (n_samples, n_features).

required
batch

Array-like batch labels with length n_samples.

required
covariates

Optional covariate matrix or DataFrame with one row per sample.

None
feature_names

Optional names for the feature columns.

None
group_col_name

Column name used for the grouping variable in the temporary modeling DataFrame.

'batch'
covariate_names

Optional names for covariate columns.

None
min_group_n

Minimum batch size required before attempting mixed-model fitting.

2
var_threshold

Variance threshold below which a feature is skipped.

1e-08
reml

Whether to fit the mixed model with REML instead of ML.

False
optimizers

Optimizers to try in sequence for model fitting.

('lbfgs', 'bfgs', 'powell', 'cg')
maxiter

Maximum iterations per optimizer.

400
boundary_pvalue

Whether to compute the mixture p-value for variance components on the boundary.

True

Returns:

Type Description
DataFrame

tuple[pd.DataFrame, dict[str, Any]]: A tuple containing the per-feature

dict[str, Any]

results DataFrame and a summary dictionary of notes and warnings across

tuple[DataFrame, dict[str, Any]]

features.

Notes

Features below var_threshold are skipped. Small batch groups trigger an OLS fallback instead of mixed-model fitting. Warnings raised during fitting are captured and included in the returned results.

RobustOLS_Orig(data, covariates, batch, covariate_names, covariate_types, report=None)

Defining this function that can be called by cohen's d, variance ratio and KS test functions to residualise out covariate effects before calculating batch effects. Here, we support Dummy encoding of categorical covariates and mean-centering of continuous covariates Look for a variable which describes whether covariates are: 0 binary, 2 categorical, 3 Continous. If variable not given, we will attempt to infer from the unique observations Batch always categorical, create Dummy array for batch

RobustOLS(data, covariates, batch, covariate_names, covariate_types=None, report=None)

This is a helper function for residualising out covariate effects while preserving batch effects: It is used in; Cohens_D, Variance_Ratios and KS_Test functions to ensure that the batch effect calculations are not confounded by covariate effects.

Parameters:

Name Type Description Default
data

np.ndarray of shape (n_samples, n_features) - the data matrix to be residualised

required
covariates

np.ndarray of shape (n_samples, n_covariates) - the covariate matrix

required
batch

array-like of shape (n_samples,) - batch labels for each sample

required
covariate_names

list of length n_covariates - names for each covariate

required
covariate_types

list of length n_covariates with values 0 (binary), 2 (categorical), 3 (continuous) - optional, if not provided will be inferred

None
report

optional object with method log_text(str) for logging messages about covariate type inference and processing steps.

None

Returns: data_resid: np.ndarray of shape (n_samples, n_features) - the data matrix with covariate effects removed but batch effects preserved.

z_score(data, MAD=False)

Z-score normalization of the data matrix (samples x features). Use median centered by default as is more robust to outliers and non-normal distributions.

robust_z_score(data, method='mad', eps=1e-12)

Apply robust z-scoring to a data matrix.

Parameters:

Name Type Description Default
data

Input data with shape (n_samples, n_features) or (n_samples,).

required
method

Scaling method. Use "mad" for median absolute deviation, "iqr" for interquartile range, or "std" for a standard deviation scale around the median.

'mad'
eps

Small value used to avoid division by zero.

1e-12

Returns:

Type Description
ndarray

np.ndarray: The normalized data array.

Cohens_D(Data, batch_indices, covariates=None, BatchNames=None, covariate_names=None, covariate_types=None)

Compute Cohen's d for each batch against the pooled remainder.

This function reports batch-versus-rest effect sizes to give a global view of how each batch differs from the overall distribution after optional covariate residualization.

Parameters:

Name Type Description Default
Data

Data matrix with shape (n_samples, n_features).

required
batch_indices

Batch labels for each sample.

required
covariates

Optional covariate matrix to residualize before effect-size calculation.

None
BatchNames

Optional display names for batches.

None
covariate_names

Optional names for covariate columns.

None
covariate_types

Optional covariate type codes used by the residualizing workflow.

None

Returns:

Type Description
ndarray

tuple[np.ndarray, list[tuple[str, str]]]: Cohen's d values with shape

list[tuple[str, str]]

(n_batches, n_features) and the corresponding batch-vs-rest labels.

Notes

Cohen's d is computed as (mean_batch - mean_other) / std_other, where mean_other and std_other are the unweighted averages across the other batches.

PC_Correlations(Data, batch, N_components=None, covariates=None, variable_names=None, *, enforce_min_components_for_plotting=True)

Perform PCA and correlate top PCs with batch and covariates if given, returning explained variance, scores, and correlation results.

Parameters:

Name Type Description Default
Data

np.ndarray of shape (n_samples, n_features) - the data matrix.

required
batch

array-like of shape (n_samples,) - batch labels for each sample (can

required
N_components

int or None - number of principal components to compute (default None means min(n_samples, n_features)).

None
covariates

optional np.ndarray of shape (n_samples, n_covariates)

None
variable_names

optional list of length covariates

None

Returns: explained_variance: np.ndarray of shape (n_components,) with percentage of variance explained by each PC. scores: np.ndarray of shape (n_samples, n_components) with the PCA scores for each sample. PC_correlations: dict mapping variable name to dict with keys 'correlation' (array of shape (n_components,)) and 'p_value' (array of shape (n_components,)) for the Pearson correlation of each PC with that variable.

Mahalanobis_Distance(Data=None, batch=None, covariates=None)

Calculate the Mahalanobis distance between batches in the data. Takes optional covariates and returns distances between each batch pair both before and after regressing out covariates. Additionally provides distance of each batch to the overall centroid before and after residualizing covariates.

Parameters:

Name Type Description Default
Data ndarray

Data matrix where rows are samples (n) and columns are features (p).

None
batch ndarray

1D array-like batch labels for each sample (length n).

None
covariates ndarray

Covariate matrix (n x k). An intercept will be added automatically.

None

Returns:

Type Description
dict[str, Any]

dict[str, Any]: A dictionary containing pairwise and centroid

dict[str, Any]

Mahalanobis distances before and, when covariates are provided, after

dict[str, Any]

residualization. Inner dictionary keys use tuples such as (b1, b2) or

dict[str, Any]

(b, "global").

Variance_Ratios(data, batch, covariates=None, covariate_names=None, covariate_types=None, mode='rest')

Calculate feature-wise variance ratios for batches.

Multiple comparison modes are available depending on the desired reference set.

Parameters:

Name Type Description Default
data

NumPy array with shape (n_samples, n_features).

required
batch

Batch labels with length n_samples.

required
covariates

Optional covariates passed to RobustOLS before computing the ratios.

None
covariate_names

Optional covariate names passed to RobustOLS.

None
covariate_types

Optional covariate type codes passed to RobustOLS.

None
mode

One of {"pairwise", "rest", "unweighted_mean", "weighted_mean"}.

'rest'

Returns:

Type Description
dict[Any, ndarray]

dict[Any, np.ndarray]: A dictionary of variance-ratio arrays keyed by

dict[Any, ndarray]

batch label or batch-pair tuple, depending on mode.

Notes

pairwise compares every unique batch pair. rest compares each batch against all remaining samples. unweighted_mean and weighted_mean compare each batch against the mean variance of the other batches.

Levene_Test(data, batch, centre='median')

Perform Levene's test for variance differences between each unique batch pair. Args: data: np.ndarray of shape (n_samples, n_features) - the data matrix. batch: np.ndarray of shape (n_samples,) - the batch labels. centre: str, optional - the method to calculate the center for Levene's test ('median', 'mean', or 'trimmed'). Default is 'median' which is more robust to outliers and non-normal distributions. Returns: dict: A dictionary where keys are tuples of batch pairs (batch1, batch2) and values are dictionaries containing 'statistic' and 'p_value' arrays of shape (n_features

KS_Test(data, batch, feature_names=None, covariates=None, compare_pairs=False, compare_to_overall_excluding_batch=True, min_batch_n=3, alpha=0.05, do_fdr=True, residualize_covariates=True, covariate_names=None, covariate_types=None)

Perform two-sample Kolmogorov-Smirnov tests across batches.

The function can compare each batch against the pooled data, the pooled data excluding that batch, and optionally all unique batch pairs.

Parameters:

Name Type Description Default
data

NumPy array with shape (n_samples, n_features).

required
batch

Batch labels with length n_samples.

required
feature_names

Optional feature names.

None
covariates

Optional covariate matrix for residualization.

None
compare_pairs

Whether to include pairwise batch-versus-batch tests.

False
compare_to_overall_excluding_batch

Whether to compare each batch against the pooled data excluding that batch.

True
min_batch_n

Minimum samples required per group for a feature-level test.

3
alpha

Significance threshold used in summaries.

0.05
do_fdr

Whether to compute Benjamini-Hochberg adjusted p-values.

True
residualize_covariates

Whether to residualize covariates before testing.

True
covariate_names

Optional names for covariate columns.

None
covariate_types

Optional covariate type codes used during residualization.

None

Returns:

Type Description
dict[tuple[Any, Any], dict[str, Any]]

dict[tuple[Any, Any], dict[str, Any]]: A dictionary keyed by comparison

dict[tuple[Any, Any], dict[str, Any]]

labels such as (batch, "overall") or (batch1, batch2). Each value

dict[tuple[Any, Any], dict[str, Any]]

contains the per-feature KS statistics, p-values, optional FDR-adjusted

dict[tuple[Any, Any], dict[str, Any]]

p-values, sample counts, and summary metrics.