Skip to content

DiagnosticFunctionsLong

A set of diagnostic functions used in longitudinal, test-retest scenarios.

SubjectOrder_long(idp_matrix, subjects, timepoints, idp_names=None, nPerm=10000, seed=None)

Compute pairwise Spearman correlations between matched subjects across all ordered timepoint pairs, with permutation-based significance testing.

For each pair of timepoints (order preserved from first appearance), subjects present at both timepoints are matched and Spearman’s rho is computed for each IDP (column). Significance is assessed via permutation testing by shuffling the second-timepoint values within matched pairs (nPerm iterations). P-values use the +1 correction: p = (1 + count_ge) / (1 + valid_null_count).

Args:

idp_matrix : array-like, shape (n_samples, n_idps)
    Numeric matrix of IDP values.
subjects : sequence of length n_samples
    Subject identifiers (matched across timepoints).
timepoints : sequence of length n_samples
    Timepoint labels (order defines comparison order).
idp_names : sequence of length n_idps, optional
    Names of IDPs; defaults to ["idp_1", ...].
nPerm : int, default=10000
    Number of permutations (>=1).
seed : int, optional
    Random seed for reproducibility.

Returns:

Type Description
DataFrame

pd.DataFrame Columns: ["TimeA","TimeB","IDP","nPairs", "SpearmanRho","NullMeanRho","pValue"]. Rows with fewer than 3 matched pairs return NaNs for statistics.

WithinSubjVar_long(idp_matrix, subjects, timepoints, idp_names=None)

Compute within-subject variability (percent) for each IDP across timepoints.

This version uses one consistent metric for all subjects: mean pairwise RPD across all available non-missing measurements.

Output columns: - subject - n_obs - metric_type - one column per IDP

Parameters

idp_matrix : np.ndarray Numeric matrix of shape (n_samples, n_idps). subjects : Sequence Subject identifiers aligned to rows of idp_matrix. timepoints : Sequence Timepoint labels aligned to rows of idp_matrix. Kept for validation. idp_names : Optional[Sequence[str]] Optional list of IDP names. Defaults to idp_1, idp_2, ...

Returns

pd.DataFrame One row per subject with within-subject variability values.

MultiVariateBatchDifference_long(idp_matrix, batch, idp_names=None, return_info=False)

Compute multivariate batch/site differences as Mahalanobis distances of site means from the overall mean, with numerically-stable handling of covariance estimation and inversion.

For each batch (site) this routine:

  • Computes the site mean vector after dropping rows with any NaN across features.
  • Estimates each site's covariance (zero matrix if n_samples_retained <= 1).
  • Averages site covariances to form an overall covariance.
  • Computes the Mahalanobis distance (MD) between each site mean and the overall mean. If the overall covariance is ill-conditioned or singular, the function falls back to an SVD-based pseudoinverse (with tolerance-based truncation) for numeric stability.

Parameters:

Name Type Description Default
idp_matrix ndarray

Numeric matrix of features or IDPs with shape (n_samples, n_features).

required
batch Series | Sequence

Batch or site labels for each row.

required
idp_names Optional[Sequence[str]]

Optional feature names used for intermediate DataFrame columns.

None
return_info bool

Whether to also return a diagnostics dictionary containing site categories, retained counts, covariance conditioning metadata, and the averaged covariance matrix.

False

Returns:

Type Description
DataFrame | Tuple[DataFrame, Dict[str, Any]]

pd.DataFrame | tuple[pd.DataFrame, dict[str, Any]]: A DataFrame with

DataFrame | Tuple[DataFrame, Dict[str, Any]]

columns ["batch", "mdval"], optionally returned alongside a diagnostics

DataFrame | Tuple[DataFrame, Dict[str, Any]]

dictionary when return_info is True.

Raises:

Type Description
ValueError

If idp_matrix is not 2-D, if batch length does not match the number of rows, or if idp_names length does not match the number of features.

Notes

Rows with any NaN across features are dropped when computing a site's mean

and covariance. If a site has zero retained rows, a warning is emitted and its mean is left as NaN (MD will be NaN). If a site has one retained row, its covariance is taken to be the zero matrix. The averaged covariance is the simple mean of per-site covariances. Mahalanobis distances are computed as sqrt((mu_i - mu_overall)' Σ^{-1} (mu_i - mu_overall)). For numerical stability, the function attempts a direct linear solve when the averaged covariance is well-conditioned; otherwise it uses an SVD-based pseudoinverse with a tolerance derived from machine epsilon. The function returns NaN for MD if a site's mean vector is NaN.

build_mixed_formula(tbl_in, response_var, fix_eff, ran_eff, batch_vars, force_categorical=(), force_numeric=(), zscore_vars=(), zscore_response=True)

Build the longitudinal mixed-model formulas used by the pipeline.

Parameters:

Name Type Description Default
tbl_in DataFrame

Input DataFrame containing the response and predictor columns.

required
response_var str

Response variable to model.

required
fix_eff Iterable[str]

Fixed-effect terms to include.

required
ran_eff Iterable[str]

Random-effect grouping terms to include.

required
batch_vars Iterable[str]

Batch-related terms to include in the full model only.

required
force_categorical Iterable[str]

Columns to coerce to categorical dtype.

()
force_numeric Iterable[str]

Columns to coerce to numeric dtype.

()
zscore_vars Iterable[str]

Columns to z-score before formula construction.

()
zscore_response bool

Whether to z-score the response column.

True

Returns:

Type Description
DataFrame

tuple[pd.DataFrame, list[str]]: The transformed DataFrame and a list of

List[str]

formulas ordered as full model, subject-only or null model, and

Tuple[DataFrame, List[str]]

fixed-effects-only model.

AdditiveEffect_long(data=None, idp_matrix=None, subjects=None, timepoints=None, batch_name=None, idp_names=None, covariates=None, *, idvar=None, batchvar=None, timevar=None, fix_eff=None, ran_eff=None, do_zscore=True, reml=False, verbose=True)

Test for additive (mean/location) batch effects per feature using mixed models.

For each feature (IDP) this routine:

  • Builds a per-feature local DataFrame including the response, specified fixed predictors and the batch column; numeric predictors are z-scored per-feature.
  • Optionally z-scores the response per-feature when do_zscore=True (default).
  • Fits a full mixed model lhs ~ <fixed_terms> + C(batch) with random effects given by ran_eff (defaults to idvar when ran_eff is None).
  • Fits a reduced mixed model lhs ~ <fixed_terms> (same random structure).
  • Primary test: likelihood-ratio test (LRT) using model log-likelihoods: LR = 2 * (llf_full - llf_reduced), df = n_levels(batch) - 1 (fallback to 1 if unknown). If LRT is not available or fails, falls back to a multivariate Wald test on the batch-related parameters (or a pseudoinverse-based Wald if the covariance is singular).
  • Records test statistic, degrees of freedom, p-value and which method was used: "LRT", "Wald", or "Wald_pinv".

Parameters:

Name Type Description Default
data Optional[DataFrame]

Optional DataFrame used directly when provided.

None
idp_matrix Optional[ndarray]

Optional feature matrix used when data is not supplied.

None
subjects Optional[Sequence]

Optional subject IDs used when building a DataFrame from arrays.

None
timepoints Optional[Sequence]

Optional timepoint labels used when building a DataFrame from arrays.

None
batch_name Optional[Sequence]

Optional batch labels used when building a DataFrame from arrays.

None
idp_names Optional[Iterable[str]]

Optional feature names for idp_matrix.

None
covariates Optional[Dict[str, Sequence]]

Optional mapping of name -> sequence for additional covariates.

None
idvar Optional[str]

Column name for subject IDs.

None
batchvar Optional[str]

Column name for batch labels.

None
timevar Optional[str]

Column name for timepoints.

None
fix_eff Optional[Iterable[str]]

Fixed-effect predictors.

None
ran_eff Optional[Iterable[str]]

Random-effect grouping variables.

None
do_zscore bool

Whether to z-score the response per feature.

True
reml bool

Whether to fit MixedLM using REML.

False
verbose bool

Whether to print progress and model formulas.

True

Returns:

Type Description
DataFrame

pd.DataFrame: One row per feature with additive batch-effect test

DataFrame

statistics, degrees of freedom, p-values, and method labels.

Raises:

Type Description
KeyError

If ran_eff variables are not found in the assembled DataFrame.

ValueError

If idp_matrix is not 2-D or if input sequence lengths do not match the number of rows.

Notes

Per-feature predictor z-scoring is always applied to numeric fix_eff

(local to each feature) via _build_fixed_formula_terms. do_zscore=True (default): z-scores the response per feature and uses the z-scored response (z_<feature>) as LHS. Set do_zscore=False to keep original units. reml=False (default): mixed models are fitted with REML disabled. Pass reml=True to use REML. Rows with NaN responses are dropped per-feature. Features with fewer than 3 retained rows are skipped and returned with NaNs. If the full or reduced mixed fit fails, that feature is reported with NaNs. The Wald fallback constructs contrasts for batch-related parameters found in the fitted parameter names and uses the parameter covariance matrix to compute a chi-square statistic; pseudoinverse is used if needed. Because predictors are z-scored per-feature, coefficient magnitudes are comparable across features only in the z-scored scale (unless do_zscore=False).

MultiplicativeEffect_long(data=None, idp_matrix=None, subjects=None, timepoints=None, batch_name=None, idp_names=None, covariates=None, *, idvar=None, batchvar=None, timevar=None, fix_eff=None, ran_eff=None, do_zscore=True, reml=False, verbose=True)

Test for multiplicative (variance / heteroskedasticity) batch effects per feature.

For each feature (IDP) this routine:

  • Builds a per-feature local DataFrame including the response, specified fixed predictors and the batch column; numeric predictors are z-scored per-feature.
  • Optionally z-scores the response per-feature when do_zscore=True (default).
  • Fits a full mixed model lhs ~ <fixed_terms> + C(batch) with random effects given by ran_eff (defaults to idvar when ran_eff is None) — the residuals from this fit are used for variance comparisons.
  • Tests whether residual variability differs across batches using Fligner's test (a robust, non-parametric test for homogeneity of variances). The reported statistic is the Fligner chi-square and the p-value is from that test.
  • Records test statistic, DF (n_groups - 1), p-value and method "Fligner".

Parameters:

Name Type Description Default
data Optional[DataFrame]

Optional DataFrame used directly when provided.

None
idp_matrix Optional[ndarray]

Optional feature matrix used when data is not supplied.

None
subjects Optional[Sequence]

Optional subject IDs used when building a DataFrame from arrays.

None
timepoints Optional[Sequence]

Optional timepoint labels used when building a DataFrame from arrays.

None
batch_name Optional[Sequence]

Optional batch labels used when building a DataFrame from arrays.

None
idp_names Optional[Iterable[str]]

Optional feature names for idp_matrix.

None
covariates Optional[Dict[str, Sequence]]

Optional mapping of name -> sequence for additional covariates.

None
idvar Optional[str]

Column name for subject IDs.

None
batchvar Optional[str]

Column name for batch labels.

None
timevar Optional[str]

Column name for timepoints.

None
fix_eff Optional[Iterable[str]]

Fixed-effect predictors.

None
ran_eff Optional[Iterable[str]]

Random-effect grouping variables.

None
do_zscore bool

Whether to z-score the response per feature.

True
reml bool

Whether to fit MixedLM using REML.

False
verbose bool

Whether to print progress and model formulas.

True

Returns:

Type Description
DataFrame

pd.DataFrame: One row per feature with multiplicative batch-effect test

DataFrame

statistics, degrees of freedom, p-values, and method labels.

Raises:

Type Description
KeyError

If fix_eff or ran_eff variables are not found in the assembled DataFrame.

ValueError

If idp_matrix is not 2-D or if input sequence lengths do not match the number of rows.

Notes

Per-feature predictor z-scoring is always applied to numeric fix_eff

(local to each feature) via _build_fixed_formula_terms. do_zscore=True (default): z-scores the response per feature and uses the z-scored response (z_<feature>) as LHS. Set do_zscore=False to keep original units. reml=False (default): MixedLM fits are run with REML disabled. Pass reml=True to change that. ran_eff defaults to [idvar] (where idvar defaults to 'subject_ids'). Residuals used for the variance test are extracted from the full mixed-model fit. If fitting fails for a given feature, that feature is returned with NaNs. - The Fligner test requires at least two groups and non-empty residual samples for each group; otherwise the test is not run for that feature. - Rows with NaN responses are dropped per-feature. Features with fewer than 3 retained rows are skipped and returned with NaNs. - Because predictors and possibly responses are z-scored per-feature, the residuals used for heteroskedasticity testing are on the z-scored scale when do_zscore=True.