Skip to main navigation Skip to search Skip to main content

A Two-Axis Framework for Comparing Bias Metrics in Encoder-Only Language Models

  • Danial Hosseinpour

Student thesis: Master's Thesis

Abstract

This thesis investigates bias and fairness evaluation for encoder-only Transformer language models used in downstream classification, motivated by the heterogeneity of bias metrics and the possibility of inconsistent conclusions even within a single model family. Prior work provides limited evidence on whether bias measured on pre-trained checkpoints corresponds to disparity behaviour observed after fine-tuning, and it remains unclear which metric families provide mutually comparable evidence across checkpoint stages and measurement signals. Accordingly, this the-sis quantifies the extent of correspondence (correlation) between bias metrics along two organising axes: checkpoint stage (before versus after fine-tuning) and signal type (embedding-based, probability-based, and output-based). Two research questions are addressed: (i) to what extent do before fine-tuning bias metrics correlate with after fine-tuning bias metrics; and (ii) to what extent do embedding-, probability-,and output-based metrics correlate with one another within the evaluated cohort. A systematic comparative, correlational design is adopted using a stage × signal framework that separates checkpoint stage (pre-trained versus fine-tuned check-points) from signal type (representations, probabilities, outputs), enabling metric categorisation to be attributed transparently to stage or signal source. The empirical study evaluates seven encoder-only models at both pre-trained and fine-tuned checkpoints. The downstream case study is toxicity detection in the Jigsaw /Civil Comments setting, using mention-based gender cohorts (male versus female mentions) and a fixed gap convention ∆ = male − female. The metric inventory spans: (i) CEAT (CEAT-6/7/8) computed before and after fine-tuning; (ii) StereoSet Stereotype Score computed on pre-trained checkpoints under SSLL and SSPLL; (iii)threshold-independent downstream diagnostics on fine-tuned checkpoints (overall ROC–AUC and subgroup gaps: ∆Subgroup AUC, ∆BPSN AUC, ∆BNSP AUC);and (iv) threshold-dependent output metrics on fine-tuned checkpoints, with emphasis on ∆FPR under a fixed decision threshold. Within the stage × signal framework, Spearman’s rank correlation with two-sided testing is used to compare heterogeneous metrics across models without score normalisation across incomparable scales; where the objective is alignment in disparity magnitude, correlations are computed on absolute gaps. After fine-tuning, strong rank correspondence was observed between the threshold-independent probability-based diagnostics (∆BPSN AUC and ∆BNSP AUC gaps) and the threshold-dependent output-based disparity ∆FPR. A further rank correspondence was observed between CEAT-6 after fine-tuning and ∆FPR. Across all evaluated models, ∆FPR is positive (male minus female), indicating that a higher proportion of non-toxic items were incorrectly classified as toxic for the male-mention slice than for the female-mention slice. The thesis contributes a stage × signal organisational framework for bias-metric comparability and an empirical correlation study spanning pre-trained and finetuned check points across embedding-, probability-, and output-based metrics. Conclusions are explicitly bounded to encoder-only models, the toxicity detection setting, mention-based male versus female cohorts, and associational inference rather than causal claims.
Date of Award29 Apr 2026
Original languageEnglish
SupervisorRichard Hill (Main Supervisor) & Clay Palmeira Da Silva (Co-Supervisor)

Cite this

'