Harmonisation Methods
This document provides an overview of the harmonisation methods implemented and available in the DiagnoseHarmonisation package.
Overview
Batch effects are systematic biases introduced by differences in data collection equipment, scanners, protocols, or sites. Harmonisation (or harmonization) aims to remove or reduce these batch effects while preserving biological or phenotypic signals of interest.
The methods below range from simple parametric approaches to advanced machine learning techniques.
1. Linear Modelling
The simplest harmonisation approach models batch effects as fixed or random effects in a linear regression framework.
Method: Fits a linear model of the form: $\(Y = X\beta + Z_b u_b + \epsilon\)$
where \(Y\) is the data, \(X\) contains covariates of interest, \(Z_b\) encodes batch membership, \(u_b\) are batch random effects, and \(\epsilon\) is residual error.
Advantages: - Highly interpretable and transparent - Computationally efficient - No tuning hyperparameters - Direct inference on covariate effects
Limitations: - Assumes linear relationships - May not capture complex batch interactions - Assumes homogeneous variance across batches
Use case: Quick reference harmonisation or when interpretability is paramount.
2. ComBat
Combat (Correcting Batch Effects using Empirical Bayes) is a widely-used batch harmonisation method from bioinformatics (Johnson et al., 2007).
Method: ComBat uses empirical Bayes to estimate and correct batch-specific location (mean) and scale (variance) shifts: 1. Fits a parametric model with batch-specific parameters. 2. Uses empirical Bayes shrinkage to estimate these parameters. 3. Adjusts data by subtracting batch means and rescaling by batch variance.
Key features: - Handles variable batch sizes - Empirical Bayes shrinkage prevents overfitting - Can incorporate biological covariates (mod variable) to avoid removing true signal - Fast and widely benchmarked
Advantages: - Well-established and validated in neuroimaging studies - Preserves biological covariates - Computationally efficient
Limitations: - Assumes location-scale model (mean and variance only) - Homogeneity assumption across features - May over-correct in small-sample scenarios
Use case: Standard choice for multi-site neuroimaging studies.
3. ComBat with GAM
ComBat-GAM extends ComBat by replacing the parametric mean model with a Generalized Additive Model (GAM).
Method: 1. Uses GAM to model covariate effects (more flexible than linear) 2. Combines flexible covariate adjustment with ComBat's batch correction 3. Estimates batch effects on the GAM residuals
Key features: - Captures non-linear covariate relationships - Spline basis functions for smooth covariate effects - Falls back to linear model if spline fitting fails
Advantages: - More flexible than ComBat for non-linear covariate patterns - Maintains empirical Bayes batch correction - Graceful degradation (falls back to linear)
Limitations: - More complex and less interpretable than linear/ComBat - Requires more data to accurately estimate splines - Slightly slower than parametric ComBat
Use case: When covariate effects are expected to be non-linear (e.g., age, scanner field strength).
4. CovBat
CovBat (Covariate-adjusted ComBat) refines ComBat by jointly modelling covariate and batch effects without requiring a separate mod variable.
Method: 1. Jointly estimates covariate and batch effects in a unified model 2. Uses empirical Bayes for improved robustness 3. Solves for optimal ridge regression weights
Key features: - Internally handles covariate adjustment - No separate mod matrix required - Provides confidence intervals for batch effect estimates - Computationally stable ridge regression approach
Advantages: - Simpler interface than ComBat (no mod matrix) - Better calibrated confidence intervals - Unified treatment of covariates and batch
Limitations: - Slightly slower than standard ComBat - Less widely used (newer method)
Use case: When you want integrated covariate and batch handling without manual mod matrix construction.
5. Reference-Based ComBat
Reference-Based ComBat extends ComBat to a multi-site setting by designating one or more sites as a "reference" against which other sites are harmonised.
Method: 1. Treats reference site(s) as gold standard 2. Estimates batch effects of non-reference sites relative to reference 3. Applies ComBat correction targeting reference distribution
Key features: - Preserves reference site characteristics - Useful when one acquisition protocol is known to be high-quality - Can use multiple reference sites
Advantages: - Meaningful reference frame (e.g., gold-standard scanner) - Avoids blending all sites into an average - Interpretable target distribution
Limitations: - Requires prior designation of reference site(s) - May not work well if reference is atypical - Assumes reference effects are minimal
Use case: Multi-site studies where one site is known to have optimal data quality.
6. IQM-Based Harmonisation
IQM-Based Harmonisation (Intrinsic Quality Metric harmonisation) leverages data quality metrics to guide harmonisation.
Method: 1. Computes Intrinsic Quality Metrics (IQM) for each scan (e.g., contrast-to-noise, signal stability) 2. Identifies scanner/site effects using IQM 3. Adjusts data based on IQM-derived quality scores
Key features: - Incorporates domain knowledge about quality metrics - Can be combined with other methods - Adaptive to acquisition variations - Physically interpretable
Advantages: - Links harmonisation to measurable quality indicators - Handles non-linear quality-related batch effects - Transparent and auditable
Limitations: - Requires valid IQM computation - Sensitive to IQM accuracy - May not capture all batch factors
Use case: MRI, diffusion imaging, or other modalities with standard quality metrics.
7. SV-ComBat (In Development)
SV-ComBat (Similarity-guided ComBat) is an in-development method that uses similarity matrices to inform batch prior pooling in ComBat.
Method: 1. Computes pairwise similarity between batch effects (e.g., based on covariate distributions, effect correlation) 2. Uses similarity to inform Bayesian prior pooling 3. Dynamically pools batch parameters based on similarity structure 4. Applies adjusted ComBat correction
Key features: - Adaptive prior pooling based on batch relationships - Preserves structure in batch effects - Can incorporate multiple similarity metrics - Research-stage implementation
Advantages: - Exploits relationships between batch effects - Improved prior estimates when batches are similar - Flexible similarity metric design - Principled Bayesian approach
Limitations: - Still in active development - Not yet validated in large-scale studies - Requires tuning similarity metric - Increased computational cost
Use case: Studies with many similar batches (e.g., same scanner model at different sites, replicated protocols).
Status: ⚠️ Experimental/In Development — Use with caution and report findings carefully.
Comparison Table
| Method | Complexity | Speed | Covariate Flexibility | Batch Effect Model | Status |
|---|---|---|---|---|---|
| Linear Modelling | Low | Very fast | Linear | Random effect | Stable |
| ComBat | Medium | Fast | Linear (mod matrix) | Location-scale | Stable |
| ComBat-GAM | Medium-High | Fast | Non-linear (GAM) | Location-scale + GAM | Stable |
| CovBat | Medium | Moderate | Linear (integrated) | Location-scale (unified) | Stable |
| Reference-Based ComBat | Medium | Fast | Linear (mod matrix) | Location-scale (relative to reference) | Stable |
| IQM-Based | Medium-High | Moderate | Metric-driven | Quality-metric-informed | Stable |
| SV-ComBat | High | Moderate-Slow | Linear (mod matrix) | Similarity-pooled priors | Experimental |
When to Use Each Method
Quick/exploratory analysis
→ Linear Modelling
Standard multi-site neuroimaging
→ ComBat
Non-linear covariate effects expected
→ ComBat-GAM
Integrated covariate handling preferred
→ CovBat
One high-quality reference site available
→ Reference-Based ComBat
Physical quality metrics are key
→ IQM-Based Harmonisation
Many structurally similar batches
→ SV-ComBat (with caution — experimental)
Implementation in DiagnoseHarmonisation
Most methods are implemented in the HarmonisationFunctions module, particularly via the modular combat_modular() function which supports:
- Different mean models ('ols', 'gam', etc.)
- Prior mode selection ('global', 'local', etc.)
- Custom prior weighting strategies
For programmatic usage, see the API documentation.
References
-
ComBat: Johnson, W. E., Li, C., & Rabinovic, A. (2007). Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics.
Fortin et al., (2018). Harmonization of cortical thickness measurements across scanners and sites. NeuroImage, 167, 104–120. https://doi.org/10.1016/j.neuroimage.2017.11.024
Fortin et al., Harmonization of multi-site diffusion tensor imaging data. NeuroImage, 161, 149–170. https://doi.org/10.1016/j.neuroimage.2017.08.047
-
ComBat-GAM: Pomponio et al., (2020). Harmonization of large MRI datasets for the analysis of brain imaging patterns throughout the lifespan. NeuroImage, 208, 116450. https://doi.org/10.1016/j.neuroimage.2019.116450
-
Reference batch ComBat Jacob Turnbull et al., (2026). bioRxiv 2026.05.22.726536; doi: https://doi.org/10.64898/2026.05.22.726536
-
CovBat: Chen, A. A., Beer, J. C., Tustison, N. J., Cook, P. A., Shinohara, R. T., Shou, H., & Initiative, T. A. D. N. (2022). Mitigating site effects in covariance for machine learning in neuroimaging data. Human Brain Mapping, 43(4), 1179–1195. https://doi.org/10.1002/hbm.25688
-
IQM-based: Emma Prevot, Dieter A. Häring, Laura Gaetano, Russell T. Shinohara, Chris C. Holmes, Thomas E. Nichols, Habib Ganjgahi (2025). BARTharm: MRI Harmonization Using Image Quality Metrics and Bayesian Non-parametric bioRxiv 2025.06.04.657792; doi: https://doi.org/10.1101/2025.06.04.657792
Gaurav Bhalerao et al., (2026). Harmonising Structural Brain MRI from Multiple Sites with Limited Sample Sizes. medRxiv 2026.04.21.26351106; doi: https://doi.org/10.64898/2026.04.21.26351106
Citation
When using any harmonisation method from DiagnoseHarmonisation in your research, please cite the original method papers as well as this package.