Domain corpus independent vocabulary generation for embedded continuous speech recognition

Minkyu Lim; Kwang-Ho Kim; Ji-Hwan Kim

doi:10.1109/TCE.2009.5278036

Source

IEEE Transactions on Consumer Electronics > 2009 > 55 > 3 > 1631 - 1636

Abstract

This paper proposes a domain corpus independent vocabulary generation algorithm in order to improve the coverage of vocabulary for embedded continuous speech recognition (CSR). A vocabulary in CSR is normally derived from a word frequency list. Therefore, the vocabulary coverage is dependent on a domain corpus. We present an improved way of vocabulary generation using part-of-speech (POS) tagged corpus and knowledge base. We investigate 152 POS tags defined in a POS tagged corpus and word-POS tag pairs. We analyze all words paired with 101 among 152 POS tags and decide on a set of words which have to be included in vocabularies of any size. The other 51 POS tags are mainly categorized with noun-related, named entity (NE)-related and verb-related POSs. We introduce a domain corpus independent word inclusion method for noun-, verb-, and NE-related POS tags using knowledge base. For noun-related POS tags, we generate synonym groups and analyze their relative importance using Google search. Then, we categorize verbs by lemma and analyze relative importance of each lemma from a pre-analyzed statistic for verbs. We determine the inclusion order of NEs through Google search. The proposed method shows at least 28.6% relative improvement of coverage for a SMS text corpus when the sizes of vocabulary are 5 K, 10 K, 15 K and 20 K. In particular, the coverage of 15 K size vocabulary generated by the proposed method reaches up to 97.8% with the relative improvement of 44.2%.

Identifiers

journal ISSN :	0098-3063
DOI	10.1109/TCE.2009.5278036

Keywords

vocabulary search engines speech recognition verb-related part of speech domain corpus independent vocabulary generation algorithm embedded continuous speech recognition word frequency list part-of-speech tagged corpus domain corpus independent word inclusion method Google search knowledge base system named entity related part-of-speech Artificial neural networks Knowledge based systems Acoustics Data mining Speech Embedded speech recognition Domain corpus independent Coverage

Additional information

Data set: ieee

Publisher

IEEE

Fields of science

No field of science has been suggested yet.

INFONA - science communication portal

Domain corpus independent vocabulary generation for embedded continuous speech recognition

Source

Abstract

Identifiers

Authors

Minkyu Lim

Kwang-Ho Kim

Ji-Hwan Kim

Keywords

Additional information

Publisher

Fields of science


Assign to other user
	×
Wrong email address

INFONA - science communication portal

Domain corpus independent vocabulary generation for embedded continuous speech recognition $("#expandableTitles").expandable();

Source

Abstract

Identifiers

Authors

User assignment

Assignment remove confirmation

You're going to remove this assignment. Are you sure?

Minkyu Lim

Kwang-Ho Kim

Ji-Hwan Kim

Keywords

Additional information

Publisher

Fields of science

Fields of science

Share

Export to bibliography

Reporting an error / abuse

Sending the report failed

Accessibility options

Domain corpus independent vocabulary generation for embedded continuous speech recognition