Ethical yet realistic evaluation of usable security mechanisms is both critical and challenging. We study a particular and important case: the security achieved by different defenses against phishing, where users play a key role in detecting the attacks. We argue that proper evaluation of such anti-phishing defenses, requires users to act `naturally??, similarly to their real-life behavior, without excessive awareness of being tested for detecting attacks.We focus on our experience from conducting one of the most extensive, long-term usable security experiments, evaluating anti-phishing defenses [5]. We discuss the different ethical and operational challenges and present our recommendations.