The DIALOG corpus is one of two collections of spoken language gathered in the audio-visual studio at the Czech Language Institute of the Czech Academy of Sciences. The article begins by recalling the establishment of the corpus in 1997 as part of the project 'Dialogue in a World of People and Machines', defines the aim motivating the collection of data for this corpus, formulates distinctive criteria for this corpus as a specifically 'spoken' one in terms of time, interaction and genre and partially even as topic-specific, and attempts to define the types of spoken dialogues which the corpus can aid in analysing. It characterizes speech in the media, which makes up a focal point here, and details the procedures for storing audio and video recordings of this speech and the resulting transcriptions. The second part provides an overview of the fundamentals of transcription systems and offers theoretical support for transcription method selection as determined by the aim of capturing segmental, supra-segmental, sequential, para-linguistic and extra-linguistic phenomena, including several examples of practical solutions. The third part reports on how this corpus has been thus far utilized in linguistic research, both in the creation of a contemporary Czech theory of dialogue and in the analysis of specific features of spoken Czech. The article concludes by detailing the prospects for further use of this corpus.
Financed by the National Centre for Research and Development under grant No. SP/I/1/77065/10 by the strategic scientific research and experimental development program:
SYNAT - “Interdisciplinary System for Interactive Scientific and Scientific-Technical Information”.