We describe results of a case study whose intent was to determine whether new techniques for hardware/software partitioning of an application's binary are competitive with partitioning at the C source code level. While such competitiveness has been shown previously for standard benchmark suites involving smaller or unoptimized applications, the case study instead focuses on a complete 16,000-line highly-optimized commercial-grade application, namely an H.264 video decoder. The several month study revealed that binary partitioning was indeed competitive, achieving nearly identical 2.5x speedups as source level partitioning, compared to a standard microprocessor. Furthermore, the study revealed that several simple C-level coding modifications, including pass by value-return, function specialization, algorithmic specialization, hardware-targeted reimplementation, global array elimination, hoisting and sinking of error code, and conversion to explicit control flow, could lead to improved application speedups approaching 7x for both source level and binary level partitioning.