The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
The growing use of informal social text messages on Twitter is one of the known sources of big data. These type of messages are noisy and frequently rife with acronyms, slangs, grammatical errors and non-standard words causing grief for natural language processing (NLP) techniques. In this study, our contribution is to target non-standard words in the short text and propose a method to which the given...
Due to the high volume of information and electronic documents on the Web, it is almost impossible for a human to study, research and analyze this volume of text. Summarizing the main idea and the major concept of the context enables the humans to read the summary of a large volume of text quickly and decide whether to further dig into details. Most of the existing summarization approaches have applied...
We propose learning sentiment-specific word embeddings dubbed sentiment embeddings in this paper. Existing word embedding learning algorithms typically only use the contexts of words but ignore the sentiment of texts. It is problematic for sentiment analysis because the words with similar contexts but opposite sentiment polarity, such as good and bad, are mapped to neighboring word vectors. We address...
An important factor of a corpus is its domain, usually the quality of a SMT system trained on an in-domain corpus increases by adding out-of-domain sentences to its training corpus. In this paper we have shown out-of-domain corpora may also contains sentences which are proper for improving the quality of in-domain corpus. These sentences have words and phrases that occur in indomain corpora so, their...
Text understanding models exploit semantic networks of words as basic components. Automatically enriching and expanding these resources is then an important challenge for NLP. Existing models for enriching semantic resources based on lexical-syntactic patterns make little use of structural properties of target semantic relations. In this paper, we propose a novel approach to include transitivity in...
Human conversation is the best example of loosely coupled distributed communication known to man. This study specifically considers those parts of the human conversation that show evidence of dynamism and attempts to model these parts with the objective of translating the knowledge of conversational dynamics into the domain of computerised distributed systems. The discipline of metaphorical modelling...
This paper describes the analysis results about the influence of coarticulation effect and syllable duration on variations of Vietnamese tones in continuous speech. Based on these results a new method for generating F0 contours is proposed. Firstly the contour of tonal register, which is calculated from relative register ratios between two adjacent tones, is produced. And then the tone patterns are...
In this paper, we introduced an unsupervised method to remove fillers in spoken dialogues semi-automatically based on their probability distribution and the effect of removing fillers to induce semantic classes. We conduct the unigram and bigram distribution of fillers on our Chinese voice search data and find that only using these distributions, fillers are in the first 1% of all words. We also test...
In this paper, we introduced an unsupervised method to remove fillers in spoken dialogues semi-automatically based on their probability distribution. Disfluencies such as fillers, repairs often make the sentence ill-formed, longer and hard to process. Fillers were emphasized instead of repairs in this paper. We conduct the unigram and bigram distribution of fillers on our Chinese voice search data...
As a kind of data model, a formal context must be extracted from some actual data sources such as documents. For case of unstructured Chinese document, it is the first question to decide how to express the document. Vector space model (VSM) which is the dominant model of document expression now takes a single word as a feature item, so that neglects the lexical semantic relationship between words...
It is well known that bursts and voiced formant transitions serve as separate cues to the place of articulation of initial stop consonants. The Vietnamese presents three final voiceless stop consonants /p, t, k/ without bursts. It is an opportunity to study these final stop consonants and to compare their characteristics with those of the corresponding initial stop consonants. As final consonants...
Emotion recognition on text has wide applications. In this study we propose a method of emotion recognition at sentence level based on a relative large emotion annotation corpus (Ren-CECps). From this corpus, we get the emotion lexicons for the eight basic emotions (expect, joy, love, surprise, anxiety, sorrow, angry and hate). Statistics show that the emotion lexicons derived from Ren-CECps are used...
In this paper, we would like to introduce a new approach to recover Vietnamese text's accents. Given a Vietnamese text in which accents are lost, our goal is to seek for a recovered text that yields a best lexical probability. Using a dynamic programming approach, we first build a model of language for Vietnamese as a lexical database which gives lexical probabilities to Vietnamese sentences. Second,...
Classical information retrieval models are based on representation of document terms without considering linguistic elements. This article presents a model based on the Discourse Nominal Structure; which lets us take linguistic characteristics of text into account. The model presented is evaluated in comparison with the vector space model. Based on observations during the experimentation we propose...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.