DiagnosticFunctions
A set of diagnostics functions for cross-sectional datasets.
fit_lmm_safe(df, formula_fixed, group_col='batch', reml=False, min_group_n=10, var_threshold=1e-08, optimizers=('lbfgs', 'bfgs', 'powell', 'cg'), maxiter=400, boundary_pvalue=True)
Fit a random-intercept LMM with warnings captured and safe fallbacks. This is a helper function for Run_LMM_cross_sectional to fit the LMM for each feature with robust error handling and diagnostics.
Returns a dictionary with
success mdf / ols optimizer_used notes warning_types / warning_messages stats status
Run_LMM_cross_sectional(Data, batch, covariates=None, feature_names=None, group_col_name='batch', covariate_names=None, min_group_n=2, var_threshold=1e-08, reml=False, optimizers=('lbfgs', 'bfgs', 'powell', 'cg'), maxiter=400, boundary_pvalue=True)
Run a random-intercept linear mixed model for each feature.
Batch is treated as the grouping variable, and any supplied covariates are included as fixed effects.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
Data
|
Array-like data matrix with shape |
required | |
batch
|
Array-like batch labels with length |
required | |
covariates
|
Optional covariate matrix or DataFrame with one row per sample. |
None
|
|
feature_names
|
Optional names for the feature columns. |
None
|
|
group_col_name
|
Column name used for the grouping variable in the temporary modeling DataFrame. |
'batch'
|
|
covariate_names
|
Optional names for covariate columns. |
None
|
|
min_group_n
|
Minimum batch size required before attempting mixed-model fitting. |
2
|
|
var_threshold
|
Variance threshold below which a feature is skipped. |
1e-08
|
|
reml
|
Whether to fit the mixed model with REML instead of ML. |
False
|
|
optimizers
|
Optimizers to try in sequence for model fitting. |
('lbfgs', 'bfgs', 'powell', 'cg')
|
|
maxiter
|
Maximum iterations per optimizer. |
400
|
|
boundary_pvalue
|
Whether to compute the mixture p-value for variance components on the boundary. |
True
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
tuple[pd.DataFrame, dict[str, Any]]: A tuple containing the per-feature |
dict[str, Any]
|
results DataFrame and a summary dictionary of notes and warnings across |
tuple[DataFrame, dict[str, Any]]
|
features. |
Notes
Features below var_threshold are skipped.
Small batch groups trigger an OLS fallback instead of mixed-model
fitting.
Warnings raised during fitting are captured and included in the
returned results.
RobustOLS_Orig(data, covariates, batch, covariate_names, covariate_types, report=None)
Defining this function that can be called by cohen's d, variance ratio and KS test functions to residualise out covariate effects before calculating batch effects. Here, we support Dummy encoding of categorical covariates and mean-centering of continuous covariates Look for a variable which describes whether covariates are: 0 binary, 2 categorical, 3 Continous. If variable not given, we will attempt to infer from the unique observations Batch always categorical, create Dummy array for batch
RobustOLS(data, covariates, batch, covariate_names, covariate_types=None, report=None)
This is a helper function for residualising out covariate effects while preserving batch effects: It is used in; Cohens_D, Variance_Ratios and KS_Test functions to ensure that the batch effect calculations are not confounded by covariate effects.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
np.ndarray of shape (n_samples, n_features) - the data matrix to be residualised |
required | |
covariates
|
np.ndarray of shape (n_samples, n_covariates) - the covariate matrix |
required | |
batch
|
array-like of shape (n_samples,) - batch labels for each sample |
required | |
covariate_names
|
list of length n_covariates - names for each covariate |
required | |
covariate_types
|
list of length n_covariates with values 0 (binary), 2 (categorical), 3 (continuous) - optional, if not provided will be inferred |
None
|
|
report
|
optional object with method log_text(str) for logging messages about covariate type inference and processing steps. |
None
|
Returns: data_resid: np.ndarray of shape (n_samples, n_features) - the data matrix with covariate effects removed but batch effects preserved.
z_score(data, MAD=False)
Z-score normalization of the data matrix (samples x features). Use median centered by default as is more robust to outliers and non-normal distributions.
robust_z_score(data, method='mad', eps=1e-12)
Apply robust z-scoring to a data matrix.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
Input data with shape |
required | |
method
|
Scaling method. Use |
'mad'
|
|
eps
|
Small value used to avoid division by zero. |
1e-12
|
Returns:
| Type | Description |
|---|---|
ndarray
|
np.ndarray: The normalized data array. |
Cohens_D(Data, batch_indices, covariates=None, BatchNames=None, covariate_names=None, covariate_types=None)
Compute Cohen's d for each batch against the pooled remainder.
This function reports batch-versus-rest effect sizes to give a global view of how each batch differs from the overall distribution after optional covariate residualization.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
Data
|
Data matrix with shape |
required | |
batch_indices
|
Batch labels for each sample. |
required | |
covariates
|
Optional covariate matrix to residualize before effect-size calculation. |
None
|
|
BatchNames
|
Optional display names for batches. |
None
|
|
covariate_names
|
Optional names for covariate columns. |
None
|
|
covariate_types
|
Optional covariate type codes used by the residualizing workflow. |
None
|
Returns:
| Type | Description |
|---|---|
ndarray
|
tuple[np.ndarray, list[tuple[str, str]]]: Cohen's d values with shape |
list[tuple[str, str]]
|
|
Notes
Cohen's d is computed as (mean_batch - mean_other) / std_other, where
mean_other and std_other are the unweighted averages across the
other batches.
PC_Correlations(Data, batch, N_components=None, covariates=None, variable_names=None, *, enforce_min_components_for_plotting=True)
Perform PCA and correlate top PCs with batch and covariates if given, returning explained variance, scores, and correlation results.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
Data
|
np.ndarray of shape (n_samples, n_features) - the data matrix. |
required | |
batch
|
array-like of shape (n_samples,) - batch labels for each sample (can |
required | |
N_components
|
int or None - number of principal components to compute (default None means min(n_samples, n_features)). |
None
|
|
covariates
|
optional np.ndarray of shape (n_samples, n_covariates) |
None
|
|
variable_names
|
optional list of length covariates |
None
|
Returns: explained_variance: np.ndarray of shape (n_components,) with percentage of variance explained by each PC. scores: np.ndarray of shape (n_samples, n_components) with the PCA scores for each sample. PC_correlations: dict mapping variable name to dict with keys 'correlation' (array of shape (n_components,)) and 'p_value' (array of shape (n_components,)) for the Pearson correlation of each PC with that variable.
Mahalanobis_Distance(Data=None, batch=None, covariates=None)
Calculate the Mahalanobis distance between batches in the data. Takes optional covariates and returns distances between each batch pair both before and after regressing out covariates. Additionally provides distance of each batch to the overall centroid before and after residualizing covariates.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
Data
|
ndarray
|
Data matrix where rows are samples (n) and columns are features (p). |
None
|
batch
|
ndarray
|
1D array-like batch labels for each sample (length n). |
None
|
covariates
|
ndarray
|
Covariate matrix (n x k). An intercept will be added automatically. |
None
|
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
dict[str, Any]: A dictionary containing pairwise and centroid |
dict[str, Any]
|
Mahalanobis distances before and, when covariates are provided, after |
dict[str, Any]
|
residualization. Inner dictionary keys use tuples such as |
dict[str, Any]
|
|
Variance_Ratios(data, batch, covariates=None, covariate_names=None, covariate_types=None, mode='rest')
Calculate feature-wise variance ratios for batches.
Multiple comparison modes are available depending on the desired reference set.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
NumPy array with shape |
required | |
batch
|
Batch labels with length |
required | |
covariates
|
Optional covariates passed to |
None
|
|
covariate_names
|
Optional covariate names passed to |
None
|
|
covariate_types
|
Optional covariate type codes passed to |
None
|
|
mode
|
One of |
'rest'
|
Returns:
| Type | Description |
|---|---|
dict[Any, ndarray]
|
dict[Any, np.ndarray]: A dictionary of variance-ratio arrays keyed by |
dict[Any, ndarray]
|
batch label or batch-pair tuple, depending on |
Notes
pairwise compares every unique batch pair.
rest compares each batch against all remaining samples.
unweighted_mean and weighted_mean compare each batch against the
mean variance of the other batches.
Levene_Test(data, batch, centre='median')
Perform Levene's test for variance differences between each unique batch pair. Args: data: np.ndarray of shape (n_samples, n_features) - the data matrix. batch: np.ndarray of shape (n_samples,) - the batch labels. centre: str, optional - the method to calculate the center for Levene's test ('median', 'mean', or 'trimmed'). Default is 'median' which is more robust to outliers and non-normal distributions. Returns: dict: A dictionary where keys are tuples of batch pairs (batch1, batch2) and values are dictionaries containing 'statistic' and 'p_value' arrays of shape (n_features
KS_Test(data, batch, feature_names=None, covariates=None, compare_pairs=False, compare_to_overall_excluding_batch=True, min_batch_n=3, alpha=0.05, do_fdr=True, residualize_covariates=True, covariate_names=None, covariate_types=None)
Perform two-sample Kolmogorov-Smirnov tests across batches.
The function can compare each batch against the pooled data, the pooled data excluding that batch, and optionally all unique batch pairs.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
NumPy array with shape |
required | |
batch
|
Batch labels with length |
required | |
feature_names
|
Optional feature names. |
None
|
|
covariates
|
Optional covariate matrix for residualization. |
None
|
|
compare_pairs
|
Whether to include pairwise batch-versus-batch tests. |
False
|
|
compare_to_overall_excluding_batch
|
Whether to compare each batch against the pooled data excluding that batch. |
True
|
|
min_batch_n
|
Minimum samples required per group for a feature-level test. |
3
|
|
alpha
|
Significance threshold used in summaries. |
0.05
|
|
do_fdr
|
Whether to compute Benjamini-Hochberg adjusted p-values. |
True
|
|
residualize_covariates
|
Whether to residualize covariates before testing. |
True
|
|
covariate_names
|
Optional names for covariate columns. |
None
|
|
covariate_types
|
Optional covariate type codes used during residualization. |
None
|
Returns:
| Type | Description |
|---|---|
dict[tuple[Any, Any], dict[str, Any]]
|
dict[tuple[Any, Any], dict[str, Any]]: A dictionary keyed by comparison |
dict[tuple[Any, Any], dict[str, Any]]
|
labels such as |
dict[tuple[Any, Any], dict[str, Any]]
|
contains the per-feature KS statistics, p-values, optional FDR-adjusted |
dict[tuple[Any, Any], dict[str, Any]]
|
p-values, sample counts, and summary metrics. |