Word embeddings (WEs) often reflect biases present in their training data, and
various bias mitigation and evaluation techniques have been proposed to address
this. Existing benchmarks for comparing different debiasing methods overlook
two factors: the choice of training words and model hyper-parameters. We
propose a robust comparison methodology that incorporates them using nested
cross-validation, hyper-parameter optimization, and the corrected paired
Student's t-test. Our results show that when using our evaluation approach many
recent debiasing methods do not offer statistically significant improvements
over the original hard debiasing model.