In health research, ordinal scales are extensively used. Reproducibility of ratings using these scales is important to assess their quality. This study aimed to compare two methods analyzing reproducibility: weighted Kappa statistic and log-linear models.Contributions of each method to the reproducibility assessment of ratings using ordinal scales were compared using intra- and interobserver data chosen in three different fields: Crow's feet scale in dermatology, dysplasia scale in oncology, updated Sydney scale in gastroenterology.Both methods provided an agreement level. In addition, log-linear models allowed evaluation of the structure of agreement. For the Crow's feet scale, both methods gave equivalent high agreement levels. For the dysplasia scale, log-linear models highlighted scale defects and Kappa statistic showed a moderate agreement. For the updated Sydney scale, log-linear models underlined a null distinguishability between two adjacent categories, whereas Kappa statistic gave a high global agreement level.Methods that can investigate level and structure of agreement between ordinal ratings are valuable tools, since they may highlight heterogeneities within the scales structure and suggest modifications to improve their reproducibility.