DiagnosticFunctionsLong

A set of diagnostic functions used in longitudinal, test-retest scenarios.

`SubjectOrder_long(idp_matrix, subjects, timepoints, idp_names=None, nPerm=10000, seed=None)`

Compute pairwise Spearman correlations between matched subjects across all ordered timepoint pairs, with permutation-based significance testing.

For each pair of timepoints (order preserved from first appearance), subjects present at both timepoints are matched and Spearman’s rho is computed for each IDP (column). Significance is assessed via permutation testing by shuffling the second-timepoint values within matched pairs (nPerm iterations). P-values use the +1 correction: p = (1 + count_ge) / (1 + valid_null_count).

Args:

idp_matrix : array-like, shape (n_samples, n_idps)
    Numeric matrix of IDP values.
subjects : sequence of length n_samples
    Subject identifiers (matched across timepoints).
timepoints : sequence of length n_samples
    Timepoint labels (order defines comparison order).
idp_names : sequence of length n_idps, optional
    Names of IDPs; defaults to ["idp_1", ...].
nPerm : int, default=10000
    Number of permutations (>=1).
seed : int, optional
    Random seed for reproducibility.

Returns:

Type	Description
`DataFrame`	pd.DataFrame Columns: ["TimeA","TimeB","IDP","nPairs", "SpearmanRho","NullMeanRho","pValue"]. Rows with fewer than 3 matched pairs return NaNs for statistics.

`WithinSubjVar_long(idp_matrix, subjects, timepoints, idp_names=None)`

Compute within-subject variability (percent) for each IDP across timepoints.

This version uses one consistent metric for all subjects: mean pairwise RPD across all available non-missing measurements.

Output columns: - subject - n_obs - metric_type - one column per IDP

Parameters

idp_matrix : np.ndarray Numeric matrix of shape (n_samples, n_idps). subjects : Sequence Subject identifiers aligned to rows of idp_matrix. timepoints : Sequence Timepoint labels aligned to rows of idp_matrix. Kept for validation. idp_names : Optional[Sequence[str]] Optional list of IDP names. Defaults to idp_1, idp_2, ...

Returns

pd.DataFrame One row per subject with within-subject variability values.

`MultiVariateBatchDifference_long(idp_matrix, batch, idp_names=None, return_info=False)`

Compute multivariate batch/site differences as Mahalanobis distances of site means from the overall mean, with numerically-stable handling of covariance estimation and inversion.

For each batch (site) this routine:

Computes the site mean vector after dropping rows with any NaN across features.
Estimates each site's covariance (zero matrix if n_samples_retained <= 1).
Averages site covariances to form an overall covariance.
Computes the Mahalanobis distance (MD) between each site mean and the overall mean. If the overall covariance is ill-conditioned or singular, the function falls back to an SVD-based pseudoinverse (with tolerance-based truncation) for numeric stability.

Parameters:

Name	Type	Description	Default
`idp_matrix`	`ndarray`	Numeric matrix of features or IDPs with shape `(n_samples, n_features)`.	required
`batch`	`Series \| Sequence`	Batch or site labels for each row.	required
`idp_names`	`Optional[Sequence[str]]`	Optional feature names used for intermediate DataFrame columns.	`None`
`return_info`	`bool`	Whether to also return a diagnostics dictionary containing site categories, retained counts, covariance conditioning metadata, and the averaged covariance matrix.	`False`

Returns:

Type	Description
`DataFrame \| Tuple[DataFrame, Dict[str, Any]]`	pd.DataFrame \| tuple[pd.DataFrame, dict[str, Any]]: A DataFrame with
`DataFrame \| Tuple[DataFrame, Dict[str, Any]]`	columns `["batch", "mdval"]`, optionally returned alongside a diagnostics
`DataFrame \| Tuple[DataFrame, Dict[str, Any]]`	dictionary when `return_info` is `True`.

Raises:

Type	Description
`ValueError`	If `idp_matrix` is not 2-D, if `batch` length does not match the number of rows, or if `idp_names` length does not match the number of features.

Notes

Rows with any NaN across features are dropped when computing a site's mean

and covariance. If a site has zero retained rows, a warning is emitted and its mean is left as NaN (MD will be NaN). If a site has one retained row, its covariance is taken to be the zero matrix. The averaged covariance is the simple mean of per-site covariances. Mahalanobis distances are computed as sqrt((mu_i - mu_overall)' Σ^{-1} (mu_i - mu_overall)). For numerical stability, the function attempts a direct linear solve when the averaged covariance is well-conditioned; otherwise it uses an SVD-based pseudoinverse with a tolerance derived from machine epsilon. The function returns NaN for MD if a site's mean vector is NaN.

`build_mixed_formula(tbl_in, response_var, fix_eff, ran_eff, batch_vars, force_categorical=(), force_numeric=(), zscore_vars=(), zscore_response=True)`

Build the longitudinal mixed-model formulas used by the pipeline.

Parameters:

Name	Type	Description	Default
`tbl_in`	`DataFrame`	Input DataFrame containing the response and predictor columns.	required
`response_var`	`str`	Response variable to model.	required
`fix_eff`	`Iterable[str]`	Fixed-effect terms to include.	required
`ran_eff`	`Iterable[str]`	Random-effect grouping terms to include.	required
`batch_vars`	`Iterable[str]`	Batch-related terms to include in the full model only.	required
`force_categorical`	`Iterable[str]`	Columns to coerce to categorical dtype.	`()`
`force_numeric`	`Iterable[str]`	Columns to coerce to numeric dtype.	`()`
`zscore_vars`	`Iterable[str]`	Columns to z-score before formula construction.	`()`
`zscore_response`	`bool`	Whether to z-score the response column.	`True`

Returns:

Type	Description
`DataFrame`	tuple[pd.DataFrame, list[str]]: The transformed DataFrame and a list of
`List[str]`	formulas ordered as full model, subject-only or null model, and
`Tuple[DataFrame, List[str]]`	fixed-effects-only model.

`AdditiveEffect_long(data=None, idp_matrix=None, subjects=None, timepoints=None, batch_name=None, idp_names=None, covariates=None, *, idvar=None, batchvar=None, timevar=None, fix_eff=None, ran_eff=None, do_zscore=True, reml=False, verbose=True)`

Test for additive (mean/location) batch effects per feature using mixed models.

For each feature (IDP) this routine:

Builds a per-feature local DataFrame including the response, specified fixed predictors and the batch column; numeric predictors are z-scored per-feature.
Optionally z-scores the response per-feature when do_zscore=True (default).
Fits a full mixed model lhs ~ <fixed_terms> + C(batch) with random effects given by ran_eff (defaults to idvar when ran_eff is None).
Fits a reduced mixed model lhs ~ <fixed_terms> (same random structure).
Primary test: likelihood-ratio test (LRT) using model log-likelihoods: LR = 2 * (llf_full - llf_reduced), df = n_levels(batch) - 1 (fallback to 1 if unknown). If LRT is not available or fails, falls back to a multivariate Wald test on the batch-related parameters (or a pseudoinverse-based Wald if the covariance is singular).
Records test statistic, degrees of freedom, p-value and which method was used: "LRT", "Wald", or "Wald_pinv".

Parameters:

Name	Type	Description	Default
`data`	`Optional[DataFrame]`	Optional DataFrame used directly when provided.	`None`
`idp_matrix`	`Optional[ndarray]`	Optional feature matrix used when `data` is not supplied.	`None`
`subjects`	`Optional[Sequence]`	Optional subject IDs used when building a DataFrame from arrays.	`None`
`timepoints`	`Optional[Sequence]`	Optional timepoint labels used when building a DataFrame from arrays.	`None`
`batch_name`	`Optional[Sequence]`	Optional batch labels used when building a DataFrame from arrays.	`None`
`idp_names`	`Optional[Iterable[str]]`	Optional feature names for `idp_matrix`.	`None`
`covariates`	`Optional[Dict[str, Sequence]]`	Optional mapping of `name -> sequence` for additional covariates.	`None`
`idvar`	`Optional[str]`	Column name for subject IDs.	`None`
`batchvar`	`Optional[str]`	Column name for batch labels.	`None`
`timevar`	`Optional[str]`	Column name for timepoints.	`None`
`fix_eff`	`Optional[Iterable[str]]`	Fixed-effect predictors.	`None`
`ran_eff`	`Optional[Iterable[str]]`	Random-effect grouping variables.	`None`
`do_zscore`	`bool`	Whether to z-score the response per feature.	`True`
`reml`	`bool`	Whether to fit `MixedLM` using REML.	`False`
`verbose`	`bool`	Whether to print progress and model formulas.	`True`

Returns:

Type	Description
`DataFrame`	pd.DataFrame: One row per feature with additive batch-effect test
`DataFrame`	statistics, degrees of freedom, p-values, and method labels.

Raises:

Type	Description
`KeyError`	If `ran_eff` variables are not found in the assembled DataFrame.
`ValueError`	If `idp_matrix` is not 2-D or if input sequence lengths do not match the number of rows.

Notes

Per-feature predictor z-scoring is always applied to numeric fix_eff

(local to each feature) via _build_fixed_formula_terms. do_zscore=True (default): z-scores the response per feature and uses the z-scored response (z_<feature>) as LHS. Set do_zscore=False to keep original units. reml=False (default): mixed models are fitted with REML disabled. Pass reml=True to use REML. Rows with NaN responses are dropped per-feature. Features with fewer than 3 retained rows are skipped and returned with NaNs. If the full or reduced mixed fit fails, that feature is reported with NaNs. The Wald fallback constructs contrasts for batch-related parameters found in the fitted parameter names and uses the parameter covariance matrix to compute a chi-square statistic; pseudoinverse is used if needed. Because predictors are z-scored per-feature, coefficient magnitudes are comparable across features only in the z-scored scale (unless do_zscore=False).

`MultiplicativeEffect_long(data=None, idp_matrix=None, subjects=None, timepoints=None, batch_name=None, idp_names=None, covariates=None, *, idvar=None, batchvar=None, timevar=None, fix_eff=None, ran_eff=None, do_zscore=True, reml=False, verbose=True)`

Test for multiplicative (variance / heteroskedasticity) batch effects per feature.

For each feature (IDP) this routine:

Builds a per-feature local DataFrame including the response, specified fixed predictors and the batch column; numeric predictors are z-scored per-feature.
Optionally z-scores the response per-feature when do_zscore=True (default).
Fits a full mixed model lhs ~ <fixed_terms> + C(batch) with random effects given by ran_eff (defaults to idvar when ran_eff is None) — the residuals from this fit are used for variance comparisons.
Tests whether residual variability differs across batches using Fligner's test (a robust, non-parametric test for homogeneity of variances). The reported statistic is the Fligner chi-square and the p-value is from that test.
Records test statistic, DF (n_groups - 1), p-value and method "Fligner".

Parameters:

Name	Type	Description	Default
`data`	`Optional[DataFrame]`	Optional DataFrame used directly when provided.	`None`
`idp_matrix`	`Optional[ndarray]`	Optional feature matrix used when `data` is not supplied.	`None`
`subjects`	`Optional[Sequence]`	Optional subject IDs used when building a DataFrame from arrays.	`None`
`timepoints`	`Optional[Sequence]`	Optional timepoint labels used when building a DataFrame from arrays.	`None`
`batch_name`	`Optional[Sequence]`	Optional batch labels used when building a DataFrame from arrays.	`None`
`idp_names`	`Optional[Iterable[str]]`	Optional feature names for `idp_matrix`.	`None`
`covariates`	`Optional[Dict[str, Sequence]]`	Optional mapping of `name -> sequence` for additional covariates.	`None`
`idvar`	`Optional[str]`	Column name for subject IDs.	`None`
`batchvar`	`Optional[str]`	Column name for batch labels.	`None`
`timevar`	`Optional[str]`	Column name for timepoints.	`None`
`fix_eff`	`Optional[Iterable[str]]`	Fixed-effect predictors.	`None`
`ran_eff`	`Optional[Iterable[str]]`	Random-effect grouping variables.	`None`
`do_zscore`	`bool`	Whether to z-score the response per feature.	`True`
`reml`	`bool`	Whether to fit `MixedLM` using REML.	`False`
`verbose`	`bool`	Whether to print progress and model formulas.	`True`

Returns:

Type	Description
`DataFrame`	pd.DataFrame: One row per feature with multiplicative batch-effect test
`DataFrame`	statistics, degrees of freedom, p-values, and method labels.

Raises:

Type	Description
`KeyError`	If `fix_eff` or `ran_eff` variables are not found in the assembled DataFrame.
`ValueError`	If `idp_matrix` is not 2-D or if input sequence lengths do not match the number of rows.

Notes

Per-feature predictor z-scoring is always applied to numeric fix_eff

(local to each feature) via _build_fixed_formula_terms. do_zscore=True (default): z-scores the response per feature and uses the z-scored response (z_<feature>) as LHS. Set do_zscore=False to keep original units. reml=False (default): MixedLM fits are run with REML disabled. Pass reml=True to change that. ran_eff defaults to [idvar] (where idvar defaults to 'subject_ids'). Residuals used for the variance test are extracted from the full mixed-model fit. If fitting fails for a given feature, that feature is returned with NaNs. - The Fligner test requires at least two groups and non-empty residual samples for each group; otherwise the test is not run for that feature. - Rows with NaN responses are dropped per-feature. Features with fewer than 3 retained rows are skipped and returned with NaNs. - Because predictors and possibly responses are z-scored per-feature, the residuals used for heteroskedasticity testing are on the z-scored scale when do_zscore=True.