In this paper, we present a research on designing and processing an audio-visual speech database for an automatic Russian speech recognition system using Oktava MK-012 microphone and JAI Pulnix RMC-6740GE high-speed camera (200 frames per second). Developed audio-visual speech recording system is described, it provides synchronization and fusion of audio and video data recorded by the independent sensors. The system automatically detects voice activity in audio signal and stores only speech fragments discarding non-informative signals. Also it takes into account and processes natural asynchrony of both speech modalities. Methods for feature extraction of acoustic (based on Mel-frequency cepstral coefficients) and visual speech (pixel-based features of mouth region) and multimodal data temporal segmentation (by forced alignment) are presented.