Word embeddings (WEs) often reflect biases present in their training data, and various bias mitigation and evaluation techniques have been proposed to address this. Existing benchmarks for comparing different debiasing methods overlook two factors: the choice of training words and model hyper-parameters. We propose a robust comparison methodology that incorporates them using nested cross-validation, hyper-parameter optimization, and the corrected paired Student's t-test. Our results show that when using our evaluation approach many recent debiasing methods do not offer statistically significant improvements over the original hard debiasing model.