Most predictive modeling in information fusion is performed using ensembles. When designing ensembles, the prevailing opinion is that base classifier diversity is vital for how well the ensemble will generalize to new observations. Unfortunately, the key term diversity is not uniquely defined, leading to several diversity measures and many methods for diversity creation. In addition, no specific diversity measure has shown to have high correlation with generalization accuracy. The purpose of this paper is to empirically evaluate ten different diversity measures, using neural network ensembles and 8 publicly available data sets. The main result is that all diversity measures evaluated show low or very low correlation with test set accuracy. In addition, it is obvious that the most diverse ensembles often obtain very poor accuracy. Based on these results, it appears to be quite a challenge to explicitly utilize diversity when optimizing ensembles.