Skip to content

DiagnosticFunctionsLong

A set of diagnostic functions used in longitudinal, test-retest scenarios.

SubjectOrder_long(idp_matrix, subjects, timepoints, idp_names=None, nPerm=10000, seed=None)

Compute pairwise Spearman correlations between matched subjects across all ordered timepoint pairs, with permutation-based significance testing.

For each pair of timepoints (order preserved from first appearance), subjects present at both timepoints are matched and Spearman’s rho is computed for each IDP (column). Significance is assessed via permutation testing by shuffling the second-timepoint values within matched pairs (nPerm iterations). P-values use the +1 correction: p = (1 + count_ge) / (1 + valid_null_count).

Args:

idp_matrix : array-like, shape (n_samples, n_idps)
    Numeric matrix of IDP values.
subjects : sequence of length n_samples
    Subject identifiers (matched across timepoints).
timepoints : sequence of length n_samples
    Timepoint labels (order defines comparison order).
idp_names : sequence of length n_idps, optional
    Names of IDPs; defaults to ["idp_1", ...].
nPerm : int, default=10000
    Number of permutations (>=1).
seed : int, optional
    Random seed for reproducibility.

Returns:

Type Description
DataFrame

pd.DataFrame Columns: ["TimeA","TimeB","IDP","nPairs", "SpearmanRho","NullMeanRho","pValue"]. Rows with fewer than 3 matched pairs return NaNs for statistics.

WithinSubjVar_long(idp_matrix, subjects, timepoints, idp_names=None)

Compute within-subject variability (percent) for each IDP across timepoints.

For each subject, variability is calculated per IDP using available (non-NaN) observations:

  • If exactly 2 timepoints: absolute percent difference relative to the mean, |x1 - x2| / mean * 100.
  • If >2 timepoints: coefficient of variation (sample SD, ddof=1) relative to the mean, SD / mean * 100.
  • If mean is 0 or no valid data: returns NaN.

Parameters:

Name Type Description Default
idp_matrix ndarray

Numeric matrix of IDP values with shape (n_samples, n_idps).

required
subjects Sequence

Subject identifiers used to group repeated measurements.

required
timepoints Sequence

Timepoint labels required for input alignment.

required
idp_names Optional[Sequence[str]]

Optional names of IDPs. Defaults to ["idp_1", ...].

None

Returns:

Type Description
DataFrame

pd.DataFrame: One row per subject with columns `["subject", ,

DataFrame

, ...]`, where each IDP value represents within-subject percent

DataFrame

variability.

Raises:

Type Description
ValueError

If idp_matrix is not 2-D, if input sequence lengths do not match the number of rows, or if idp_names length does not match the number of columns.

MultiVariateBatchDifference_long(idp_matrix, batch, idp_names=None, return_info=False)

Compute multivariate batch/site differences as Mahalanobis distances of site means from the overall mean, with numerically-stable handling of covariance estimation and inversion.

For each batch (site) this routine:

  • Computes the site mean vector after dropping rows with any NaN across features.
  • Estimates each site's covariance (zero matrix if n_samples_retained <= 1).
  • Averages site covariances to form an overall covariance.
  • Computes the Mahalanobis distance (MD) between each site mean and the overall mean. If the overall covariance is ill-conditioned or singular, the function falls back to an SVD-based pseudoinverse (with tolerance-based truncation) for numeric stability.

Parameters:

Name Type Description Default
idp_matrix ndarray

Numeric matrix of features or IDPs with shape (n_samples, n_features).

required
batch Series | Sequence

Batch or site labels for each row.

required
idp_names Optional[Sequence[str]]

Optional feature names used for intermediate DataFrame columns.

None
return_info bool

Whether to also return a diagnostics dictionary containing site categories, retained counts, covariance conditioning metadata, and the averaged covariance matrix.

False

Returns:

Type Description
DataFrame | Tuple[DataFrame, Dict[str, Any]]

pd.DataFrame | tuple[pd.DataFrame, dict[str, Any]]: A DataFrame with

DataFrame | Tuple[DataFrame, Dict[str, Any]]

columns ["batch", "mdval"], optionally returned alongside a diagnostics

DataFrame | Tuple[DataFrame, Dict[str, Any]]

dictionary when return_info is True.

Raises:

Type Description
ValueError

If idp_matrix is not 2-D, if batch length does not match the number of rows, or if idp_names length does not match the number of features.

Notes

Rows with any NaN across features are dropped when computing a site's mean

and covariance. If a site has zero retained rows, a warning is emitted and its mean is left as NaN (MD will be NaN). If a site has one retained row, its covariance is taken to be the zero matrix. The averaged covariance is the simple mean of per-site covariances. Mahalanobis distances are computed as sqrt((mu_i - mu_overall)' Σ^{-1} (mu_i - mu_overall)). For numerical stability, the function attempts a direct linear solve when the averaged covariance is well-conditioned; otherwise it uses an SVD-based pseudoinverse with a tolerance derived from machine epsilon. The function returns NaN for MD if a site's mean vector is NaN.

build_mixed_formula(tbl_in, response_var, fix_eff, ran_eff, batch_vars, force_categorical=(), force_numeric=(), zscore_vars=(), zscore_response=True)

Build the longitudinal mixed-model formulas used by the pipeline.

Parameters:

Name Type Description Default
tbl_in DataFrame

Input DataFrame containing the response and predictor columns.

required
response_var str

Response variable to model.

required
fix_eff Iterable[str]

Fixed-effect terms to include.

required
ran_eff Iterable[str]

Random-effect grouping terms to include.

required
batch_vars Iterable[str]

Batch-related terms to include in the full model only.

required
force_categorical Iterable[str]

Columns to coerce to categorical dtype.

()
force_numeric Iterable[str]

Columns to coerce to numeric dtype.

()
zscore_vars Iterable[str]

Columns to z-score before formula construction.

()
zscore_response bool

Whether to z-score the response column.

True

Returns:

Type Description
DataFrame

tuple[pd.DataFrame, list[str]]: The transformed DataFrame and a list of

List[str]

formulas ordered as full model, subject-only or null model, and

Tuple[DataFrame, List[str]]

fixed-effects-only model.

MixedEffects_long(idp_matrix, subjects, timepoints, batches, idp_names, *, covariates=None, fix_eff=(), ran_eff=(), force_categorical=(), force_numeric=(), zscore_var=(), do_zscore=True, p_thr=0.05, p_corr=1, reml=True)

Run a mixed-effects modeling pipeline per-IDP for longitudinal, multi-site data.

For each IDP (column) this function:

  1. Builds three formulas (full, subject-only / null, no-batch) via build_mixed_formula, honoring forced categorical/numeric conversions and optional z-scoring of variables.
  2. Fits a full mixed model (fixed effects including batch + specified random effects) using statsmodels MixedLM.
  3. Runs pairwise Wald contrasts between batch levels to count significant site differences.
  4. Fits a subject-only random-intercept model to extract subject (between) variance and residual (within) variance, then computes ICC and WCV.
  5. Fits a fixed-effects-only model (no batch terms) to extract coefficient estimates, p-values and confidence intervals for requested fixed effects.
  6. Collects diagnostics and returns one summary row per IDP. Model failures yield NaNs for that IDP but do not stop the pipeline.

Parameters:

Name Type Description Default
idp_matrix ndarray

Numeric matrix of IDP values with shape (n_samples, n_idps).

required
subjects Sequence

Subject identifiers.

required
timepoints Sequence

Timepoint labels.

required
batches Sequence

Batch or site labels, converted internally to categorical.

required
idp_names Sequence

Names for IDP columns.

required
covariates Optional[Dict[str, Sequence]]

Optional mapping of name -> sequence for additional covariates.

None
fix_eff Sequence

Fixed-effect variable names to include.

()
ran_eff Sequence

Random-effect grouping variables.

()
force_categorical Sequence

Columns to coerce to categorical dtype.

()
force_numeric Sequence

Columns to coerce to numeric dtype.

()
zscore_var Sequence

Variables to z-score before model fitting.

()
do_zscore bool

Whether to use zscore_... columns when available.

True
p_thr float

Nominal alpha for pairwise Wald tests.

0.05
p_corr int

Multiple-comparison correction mode for pairwise tests.

1
reml bool

Whether to fit MixedLM using REML.

True

Returns:

Type Description
DataFrame

tuple[pd.DataFrame, list]: A tuple (results_df, model_defs) where

list

results_df contains one row per IDP with mixed-model diagnostics and

Tuple[DataFrame, list]

fixed-effect summaries, and model_defs stores the formulas used for

Tuple[DataFrame, list]

each feature.

Raises:

Type Description
ValueError

If idp_matrix is not 2-D or if input sequence lengths do not match the number of rows.

KeyError

If requested variables in fix_eff, ran_eff, or force_* are not present in the assembled DataFrame.

Notes

Column names exposed to users (for fix_eff / ran_eff /

force_*) are exactly: 'subjects', 'timepoints', 'batches' — these names are inserted into the working DataFrame so callers should use them when referring to these variables. The function reorders batch categories so the largest group becomes the reference level before fitting (helps stable parameterization of contrasts). The primary grouping column for mixed models is the first valid entry of ran_eff (or 'subjects' when ran_eff was not specified). - If model fitting fails for an IDP, the pipeline records NaNs for that IDP and continues (failures do not stop the whole run). - Pairwise contrasts are computed with pairwise_site_tests using the fit object's parameters and covariance; p-values are two-sided Wald z-tests. - Confidence intervals and p-values are extracted from the fitted statsmodels result objects when available; missing names or extraction failures result in NaNs for those fields.

AdditiveEffect_long(data=None, idp_matrix=None, subjects=None, timepoints=None, batch_name=None, idp_names=None, covariates=None, *, idvar=None, batchvar=None, timevar=None, fix_eff=None, ran_eff=None, do_zscore=True, reml=False, verbose=True)

Test for additive (mean/location) batch effects per feature using mixed models.

For each feature (IDP) this routine:

  • Builds a per-feature local DataFrame including the response, specified fixed predictors and the batch column; numeric predictors are z-scored per-feature.
  • Optionally z-scores the response per-feature when do_zscore=True (default).
  • Fits a full mixed model lhs ~ <fixed_terms> + C(batch) with random effects given by ran_eff (defaults to idvar when ran_eff is None).
  • Fits a reduced mixed model lhs ~ <fixed_terms> (same random structure).
  • Primary test: likelihood-ratio test (LRT) using model log-likelihoods: LR = 2 * (llf_full - llf_reduced), df = n_levels(batch) - 1 (fallback to 1 if unknown). If LRT is not available or fails, falls back to a multivariate Wald test on the batch-related parameters (or a pseudoinverse-based Wald if the covariance is singular).
  • Records test statistic, degrees of freedom, p-value and which method was used: "LRT", "Wald", or "Wald_pinv".

Parameters:

Name Type Description Default
data Optional[DataFrame]

Optional DataFrame used directly when provided.

None
idp_matrix Optional[ndarray]

Optional feature matrix used when data is not supplied.

None
subjects Optional[Sequence]

Optional subject IDs used when building a DataFrame from arrays.

None
timepoints Optional[Sequence]

Optional timepoint labels used when building a DataFrame from arrays.

None
batch_name Optional[Sequence]

Optional batch labels used when building a DataFrame from arrays.

None
idp_names Optional[Iterable[str]]

Optional feature names for idp_matrix.

None
covariates Optional[Dict[str, Sequence]]

Optional mapping of name -> sequence for additional covariates.

None
idvar Optional[str]

Column name for subject IDs.

None
batchvar Optional[str]

Column name for batch labels.

None
timevar Optional[str]

Column name for timepoints.

None
fix_eff Optional[Iterable[str]]

Fixed-effect predictors.

None
ran_eff Optional[Iterable[str]]

Random-effect grouping variables.

None
do_zscore bool

Whether to z-score the response per feature.

True
reml bool

Whether to fit MixedLM using REML.

False
verbose bool

Whether to print progress and model formulas.

True

Returns:

Type Description
DataFrame

pd.DataFrame: One row per feature with additive batch-effect test

DataFrame

statistics, degrees of freedom, p-values, and method labels.

Raises:

Type Description
KeyError

If ran_eff variables are not found in the assembled DataFrame.

ValueError

If idp_matrix is not 2-D or if input sequence lengths do not match the number of rows.

Notes

Per-feature predictor z-scoring is always applied to numeric fix_eff

(local to each feature) via _build_fixed_formula_terms. do_zscore=True (default): z-scores the response per feature and uses the z-scored response (z_<feature>) as LHS. Set do_zscore=False to keep original units. reml=False (default): mixed models are fitted with REML disabled. Pass reml=True to use REML. Rows with NaN responses are dropped per-feature. Features with fewer than 3 retained rows are skipped and returned with NaNs. If the full or reduced mixed fit fails, that feature is reported with NaNs. The Wald fallback constructs contrasts for batch-related parameters found in the fitted parameter names and uses the parameter covariance matrix to compute a chi-square statistic; pseudoinverse is used if needed. Because predictors are z-scored per-feature, coefficient magnitudes are comparable across features only in the z-scored scale (unless do_zscore=False).

MultiplicativeEffect_long(data=None, idp_matrix=None, subjects=None, timepoints=None, batch_name=None, idp_names=None, covariates=None, *, idvar=None, batchvar=None, timevar=None, fix_eff=None, ran_eff=None, do_zscore=True, reml=False, verbose=True)

Test for multiplicative (variance / heteroskedasticity) batch effects per feature.

For each feature (IDP) this routine:

  • Builds a per-feature local DataFrame including the response, specified fixed predictors and the batch column; numeric predictors are z-scored per-feature.
  • Optionally z-scores the response per-feature when do_zscore=True (default).
  • Fits a full mixed model lhs ~ <fixed_terms> + C(batch) with random effects given by ran_eff (defaults to idvar when ran_eff is None) — the residuals from this fit are used for variance comparisons.
  • Tests whether residual variability differs across batches using Fligner's test (a robust, non-parametric test for homogeneity of variances). The reported statistic is the Fligner chi-square and the p-value is from that test.
  • Records test statistic, DF (n_groups - 1), p-value and method "Fligner".

Parameters:

Name Type Description Default
data Optional[DataFrame]

Optional DataFrame used directly when provided.

None
idp_matrix Optional[ndarray]

Optional feature matrix used when data is not supplied.

None
subjects Optional[Sequence]

Optional subject IDs used when building a DataFrame from arrays.

None
timepoints Optional[Sequence]

Optional timepoint labels used when building a DataFrame from arrays.

None
batch_name Optional[Sequence]

Optional batch labels used when building a DataFrame from arrays.

None
idp_names Optional[Iterable[str]]

Optional feature names for idp_matrix.

None
covariates Optional[Dict[str, Sequence]]

Optional mapping of name -> sequence for additional covariates.

None
idvar Optional[str]

Column name for subject IDs.

None
batchvar Optional[str]

Column name for batch labels.

None
timevar Optional[str]

Column name for timepoints.

None
fix_eff Optional[Iterable[str]]

Fixed-effect predictors.

None
ran_eff Optional[Iterable[str]]

Random-effect grouping variables.

None
do_zscore bool

Whether to z-score the response per feature.

True
reml bool

Whether to fit MixedLM using REML.

False
verbose bool

Whether to print progress and model formulas.

True

Returns:

Type Description
DataFrame

pd.DataFrame: One row per feature with multiplicative batch-effect test

DataFrame

statistics, degrees of freedom, p-values, and method labels.

Raises:

Type Description
KeyError

If fix_eff or ran_eff variables are not found in the assembled DataFrame.

ValueError

If idp_matrix is not 2-D or if input sequence lengths do not match the number of rows.

Notes

Per-feature predictor z-scoring is always applied to numeric fix_eff

(local to each feature) via _build_fixed_formula_terms. do_zscore=True (default): z-scores the response per feature and uses the z-scored response (z_<feature>) as LHS. Set do_zscore=False to keep original units. reml=False (default): MixedLM fits are run with REML disabled. Pass reml=True to change that. ran_eff defaults to [idvar] (where idvar defaults to 'subject_ids'). Residuals used for the variance test are extracted from the full mixed-model fit. If fitting fails for a given feature, that feature is returned with NaNs. - The Fligner test requires at least two groups and non-empty residual samples for each group; otherwise the test is not run for that feature. - Rows with NaN responses are dropped per-feature. Features with fewer than 3 retained rows are skipped and returned with NaNs. - Because predictors and possibly responses are z-scored per-feature, the residuals used for heteroskedasticity testing are on the z-scored scale when do_zscore=True.