Audio-visual speech recognition systems can be categorised into systems that integrate audio-visual features before decisions are made (feature fusion) and those that integrate decisions of separate recognisers for each modality (decision fusion). Decision fusion has been applied at the level of individual analysis time frames, phone segments and for isolated word recognition but in its basic form cannot be used for continuous speech recognition because of the combinatorial explosion of possible word string hypotheses that have to be evaluated.We present a case for decision fusion at the utterance level and propose an algorithm that can be applied efficiently to continuous speech recognition tasks, which we call N-best decision fusion. The system was tested on a single-speaker, continuous digit recognition task where the audio stream was contaminated by additive multi-speaker babble noise.The audio-visual recognition system resulted in lower word error rates for all signal-to-noise conditions tested compared to the audio-alone system. The magnitude of the improvement was dependent on the signal-to-noise ratio.