The use of p values in null hypothesis statistical tests (NHST) is controversial in the history of applied statistics, owing to a number of problems. They are: arbitrary levels of Type I error, failure to trade off Type I and Type II error, misunderstanding of p values, failure to report effect sizes, and overlooking better means of reporting estimates of policy impacts, such as effect sizes, interpreted confidence intervals, and conditional frequentist tests. This paper analyzes the theory of p values and summarizes the problems with NHST. Using a large data set of public school districts in the United States, we demonstrate empirically the unreliability of p values and hypothesis tests as predicted by the theory. We offer specific suggestions for reporting policy research.