HarmonisationFunctions

A set of harmonisation functions that are frequently used in the literature. We will continue to expand these with methods we develop.

Script containing self contained harmonisation functions that can be used in conjunction with the diagnostic tools:

`design_mat(mod, numerical_covariates, batch_levels)`

Construct design matrix for ComBat, ensuring batch levels are in the correct order and handling numerical covariates.

`aprior(delta_hat)`

Calculate the aprior parameter for the inverse gamma distribution based on the method of moments.

`bprior(delta_hat)`

Calculate the bprior parameter for the inverse gamma distribution based on the method of moments.

`postmean(g_hat, g_bar, n, d_star, t2)`

Calculate the posterior mean for the batch effect parameters.

`postvar(sum2, n, a, b)`

Calculate the posterior variance for the batch effect parameters.

`itSol(sdat_batch, gamma_hat, delta_hat, gamma_bar, t2, a, b, conv=0.001, return_hist=False)`

Iteratively solve for posterior mean and variance of batch-effect parameters.

If return_hist=True, also returns: count : number of iterations hist : dictionary storing EB values at each iteration

`combat(data, batch, mod=[], parametric=True, DeltaCorrection=True, UseEB=True, ReferenceBatch=None, RegressCovariates=False, GammaCorrection=True, covbat_mode=False, return_priors=False)`

Run ComBat harmonisation on the data and return the harmonized data.

This version accepts numpy arrays or pandas DataFrame/Series for data, batch, and mod. If a DataFrame is supplied, columns are treated as samples (so data.shape == (n_features, n_samples)). The function will auto-transpose data or mod if it detects that samples were provided as rows. The returned bayesdata is the same type as input data (DataFrame -> DataFrame, ndarray -> ndarray).

Note: helper functions aprior, bprior, itSol must be defined in scope.

Parameters:

Name	Type	Description	Default
`data`	`array or DataFrame`	The data matrix to be harmonized, with shape (n_features, n_samples).	required
`batch`	`array or Series`	A vector of batch labels for each sample, with length n_samples.	required
`mod`	`array or DataFrame`	An optional design matrix of covariates to adjust for, with shape (n_samples, n_covariates).	`[]`
`parametric`	`bool`	Whether to use parametric adjustments. Default is True.	`True`
`DeltaCorrection`	`bool`	Whether to apply delta (scale) correction. Default is True.	`True`
`UseEB`	`bool`	Whether to use empirical Bayes adjustments. Default is True.	`True`
`ReferenceBatch`	`str or int`	If provided, the name or index of the reference batch to use for fitting priors. Default is None (no reference).	`None`
`RegressCovariates`	`bool`	Whether to regress out covariate effects before harmonisation. Default is False.	`False`
`GammaCorrection`	`bool`	Whether to apply gamma (mean) correction. Default is True.	`True`
`covbat_mode`	`bool`	Whether to run in CovBat mode which includes additional covariance correction steps. Default is False.	`False`
`return_priors`	`bool`	Whether to return the estimated parameters from the ComBat model along with the harmonized data. Default is False.	`False`

Returns:

Name	Type	Description
`bayesdata`	`array or DataFrame`	The harmonized data, in the same format as the input data.
`priors`	`(dict, optional)`	A dictionary containing the estimated parameters from the ComBat model, including: - gamma_hat: raw batch effect mean estimates (n_batch, n_features) - delta_hat: raw batch effect variance estimates (n_batch, n_features) - gamma_star: empirical Bayes adjusted batch effect means (n_batch, n_features) - delta_star: empirical Bayes adjusted batch effect variances (n_batch, n_features) - gamma_bar: mean of gamma_hat across batches (n_batch,) - t2: variance of gamma_hat across batches (n_batch,) - a_prior: aprior parameters for each batch (n_batch,) - b_prior: bprior parameters for each batch (n_batch,)

Note: If using this version of ComBat, please cite:

Jean-Philippe Fortin, Drew Parker, Birkan Tunc, Takanori Watanabe, Mark A Elliott, Kosha Ruparel, David R Roalf, Theodore D Satterthwaite, Ruben C Gur, Raquel E Gur, Robert T Schultz, Ragini Verma, Russell T Shinohara. Harmonisation Of Multi-Site Diffusion Tensor Imaging Data. NeuroImage, 161, 149-170, 2017 Jean-Philippe Fortin, Nicholas Cullen, Yvette I. Sheline, Warren D. Taylor, Irem Aselcioglu, Philip A. Cook, Phil Adams, Crystal Cooper, Maurizio Fava, Patrick J. McGrath, Melvin McInnis, Mary L. Phillips, Madhukar H. Trivedi, Myrna M. Weissman, Russell T. Shinohara. Harmonisation of cortical thickness measurements across scanners and sites. NeuroImage, 167, 104-120, 2018 W. Evan Johnson and Cheng Li, Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics, 8(1):118-127, 2007.

`combat_modular(data, batch, mod=None, mean_model='ols', gam_opts=None, prior_mode='global', prior_weight_methods=None, prior_weight_opts=None, parametric=True, DeltaCorrection=True, UseEB=True, ReferenceBatch=None, RegressCovariates=False, GammaCorrection=True, covbat_mode=False, return_priors=False, **kwargs)`

Modular ComBat entrypoint.

This function is a modular wrapper around the existing combat implementation. By default (mean_model='ols' and prior_mode='global') it calls the original combat function to preserve exact baseline behaviour. Experimental modular paths (GAM mean models, local priors) will be implemented here incrementally.

Parameters (additional options):
    - `mean_model`: 'ols' (default) or 'gam' to fit flexible spline-based covariate effects.
    - `prior_mode`: 'global' (default) or 'local' to enable feature-wise local priors.
    - `prior_weight_methods`: sequence of weight method names (see `_construct_local_priors`).
    - `prior_weight_opts`: dict of options passed to `_construct_local_priors` (method weights, spatial coords, directional params).

Returns:
    - If `return_priors=True`, a dict with keys `bayesdata`, `B_hat`, `priors` (includes `local_priors` when requested).
    - Otherwise returns the harmonized data array.

`it_sol(sdat, g_hat, d_hat, g_bar, t2, a, b, conv=0.0001)`

Iteratively solve for the posterior mean and variance of the batch effect parameters. This version is used by CovBat and taken from: https://github.com/andy1764/CovBat_Harmonisation Chen, A. A., Beer, J. C., Tustison, N. J., Cook, P. A., Shinohara, R. T., Shou, H., & Initiative, T. A. D. N. (2022). Mitigating site effects in covariance for machine learning in neuroimaging data. Human Brain Mapping, 43(4), 1179–1195. https://doi.org/10.1002/hbm.25688)

`lme_harmonisation(data, batch, mod, variable_names)`

Fits a per feature linear mixed model to harmonize data across batches while adjusting for covariates. This function is an alternative to ComBat that uses mixed effects modeling to estimate and remove batch effects.

Parameters:

Name	Type	Description	Default
`data`	`DataFrame or array`	The data matrix to be harmonized, with shape (n_samples, n_features).	required
`batch`	`Series or array`	A vector of batch labels for each sample, with length n_samples.	required
`mod`	`DataFrame or array`	A design matrix of covariates to adjust for, with shape (n_samples, n_covariates).	required
`variable_names`	`list of str`	A list of column names corresponding to the covariates in `mod`, used for formula construction.	required

Returns: np.array: A harmonized data matrix of the same shape as the input data, with batch effects removed according to the fitted mixed models.

Note

This function fits a separate linear mixed model for each feature (column) in the data matrix, with the batch variable as a random effect and the covariates as fixed effects. The residuals from these models are returned as the harmonized data.

lme_iqm_harmonise(data, idp_list, qc_list, preserve_covars=('age',), adjust_covars=('timepoint',), reverse_guard_covars=None, categorical_covars=('timepoint', 'batch'), batch_col='batch', subject_col='subjectID', age_source_col='Final_Age', batch_source_col='Site', age_col='age', iqm_variance=95, p_thr=0.05, max_qcs=float('inf'), apply_pca=True, outfilename='iqm_harmonised.csv', summary_csv='qc_selection_summary.csv', additive_detail_csv='qc_selection_additive_details.csv', multiplicative_detail_csv='qc_selection_multiplicative_details.csv', selected_additive_qcs_csv='selected_additive_qcs_by_volume.csv', selected_multiplicative_qcs_csv='selected_multiplicative_qcs_by_volume.csv', allow_ols_fallback=True, optimizer_order=('lbfgs', 'powell', 'cg', 'nm'), maxiter=2000, verbose_model_fits=False)

Harmonise IDPs using IQM/QC-based additive and multiplicative correction.

The function selects QC features that: - are associated with batch/site effects, - do not encode preserve covariates, - do not encode the response proxy, - and pass reverse-guard tests when enabled.

Depending on apply_pca, QC variables are either: - z-scored and reduced with PCA, or - used directly after standardisation.

Parameters

data : pandas.DataFrame Main input dataframe containing IDPs, QC metrics, subject IDs, and covariates. idp_list : Sequence[str] List of IDP columns to harmonise. qc_list : Sequence[str] List of QC metric columns used as candidate harmonisation regressors. preserve_covars : Sequence[str], default=("age",) Covariates that QC variables are not allowed to encode. adjust_covars : Sequence[str], default=("timepoint",) Covariates included as adjustment variables in the models. reverse_guard_covars : Sequence[str] | None, default=None Optional subset of preserve covariates to test with reverse-response guards. If None, numeric preserve covariates are used automatically. categorical_covars : Sequence[str], default=("timepoint", "batch") Covariates treated as categorical in regression formulas. batch_col : str, default="batch" Batch/site variable used as the harmonisation target. subject_col : str, default="subjectID" Subject identifier used for random intercepts in MixedLM. age_source_col : str, default="Final_Age" Source column used to create age_col if age_col is missing. batch_source_col : str, default="Site" Source column used to create batch_col if batch_col is missing. age_col : str, default="age" Working age column used in the models. iqm_variance : float, default=95 Percentage of QC variance to retain when PCA is applied. p_thr : float, default=0.05 P-value threshold used for QC selection and multiplicative testing. max_qcs : float, default=inf Maximum number of QCs allowed per volume. apply_pca : bool, default=True If True, run PCA on z-scored QC variables and retain enough PCs to explain iqm_variance percent variance. If False, use z-scored raw QCs. outfilename : str, default="iqm_harmonised.csv" Output CSV for the harmonised dataframe. summary_csv : str, default="qc_selection_summary.csv" Output CSV for the per-volume summary table. additive_detail_csv : str, default="qc_selection_additive_details.csv" Output CSV for additive QC selection diagnostics. multiplicative_detail_csv : str, default="qc_selection_multiplicative_details.csv" Output CSV for multiplicative QC selection diagnostics. selected_additive_qcs_csv : str, default="selected_additive_qcs_by_volume.csv" Output CSV listing selected additive QCs by volume. selected_multiplicative_qcs_csv : str, default="selected_multiplicative_qcs_by_volume.csv" Output CSV listing selected multiplicative QCs by volume. allow_ols_fallback : bool, default=True If MixedLM fitting fails, fall back to OLS. optimizer_order : Sequence[str], default=("lbfgs", "powell", "cg", "nm") Optimizers attempted in order for MixedLM fitting. maxiter : int, default=2000 Maximum number of optimizer iterations for MixedLM. verbose_model_fits : bool, default=False If True, print detailed model fitting diagnostics.

Returns

data : pandas.DataFrame Input dataframe with harmonised_<IDP> columns added. qc_selection : dict[str, Any] Summary of selected QCs and correction decisions. add_detail_df : pandas.DataFrame Additive QC selection diagnostic table. mult_detail_df : pandas.DataFrame Multiplicative QC selection diagnostic table. summary_df : pandas.DataFrame Per-volume summary table.

Examples

df = pd.read_csv("test_data/alldata.csv") idp_list = pd.read_csv("test_data/IDP_list.csv", header=None).iloc[:, 0].dropna().astype(str).str.strip().tolist() iqm_list = pd.read_csv("test_data/IQM_list.csv", header=None).iloc[:, 0].dropna().astype(str).str.strip().tolist() data_out, qc_selection, add_detail_df, mult_detail_df, summary_df = lme_iqm_harmonise( ... data=df, ... idp_list=idp_list, ... qc_list=iqm_list, ... preserve_covars=("age",), ... adjust_covars=("timepoint",), ... reverse_guard_covars=("age",), ... categorical_covars=("timepoint", "scan_session"), ... batch_col="scan_session", ... subject_col="subject", ... age_source_col="age", ... batch_source_col="scan_session", ... age_col="age", ... iqm_variance=95, ... p_thr=0.05, ... apply_pca=True, ... verbose_model_fits=True, ... )

Examples

Local priors: directional bias weighting

from DiagnoseHarmonisation.HarmonisationFunctions import combat_modular

# data: (n_features, n_samples) array
# batch: length n_samples vector of batch labels
# coords: optional feature coordinates used for spatial weighting (n_features, 2)
out = combat_modular(
    data,
    batch,
    mean_model='ols',
    prior_mode='local',
    prior_weight_methods=['correlation_similarity', 'directional_bias'],
    prior_weight_opts={
        'method_weights': {'correlation_similarity': 0.4, 'directional_bias': 0.6},
        'min_effective': 5,
        'directional': {'dir_sigma': 0.5, 'dir_power': 1.5, 'use_global_sign': True}
    },
    return_priors=True,
)
priors = out['priors']
local = priors['local_priors']
weights = local['weights']  # shape (n_batch, n_features, n_features)

GAM mean model with spline fallback

out = combat_modular(data, batch, mod=covariates_df, mean_model='gam', gam_opts={'n_splines':8})