There is considerable debate about routine outcome monitoring (ROM) for scientific or benchmarking purposes. We discuss pitfalls associated with the assessment, analysis, and interpretation of ROM data, using data of 376 patients. 206 patients (55 %) completed one or more follow-up measurements. Mixed-model analysis showed significant improvement in symptomatology, quality of life, and autonomy, and differential improvement for different subgroups. Effect sizes were small to large, depending on the outcome measure and subgroup. Subtle variations in analytic strategies influenced effect sizes substantially. We illustrate how problems inherent to design and analysis of ROM data prevent drawing conclusions about (comparative) treatment effectiveness.