The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
Product and service reviews are abundantly available online, but selecting relevant information from them involves a significant amount of time. The authors address this problem with Starlet, a novel approach for extracting multidocument summarizations that considers aspect rating distributions and language modeling. These features encourage the inclusion of sentences in the summary that preserve...
Due to a limited coverage of the existing bilingual dictionary, it is often difficult to translate the Out-Of-Vocabulary terms (OOV) in many natural language processing tasks. In this paper, we propose a general cascade mining technique of three steps, it leverages OOV category to optimize the effectiveness of each step. OOV category based expansion policy is suggested to get more relevant mixed-language...
Text data pertaining to socio-technical networks often are analyzed separately from relational data, or are reduced to the fact and strength of the flow of information between nodes. Disregarding the content of text data for network analysis can limit our understanding of the effects of language use in networks. We present a computational and interdisciplinary methodology that addresses this limitation...
Aiming at the deficiencies of traditional blog hotness evaluation methods, the paper presents a blog hotness evaluation model based on text opinion analysis (named BHEM-TOA). The model not only considers the number of reviews, comments and publication time of the blog topic, but also focuses on the comment opinion. BHEM-TOA emphasizes subjective opinions of reviewers about the blog topic. It utilizes...
It is well known that information retrieval systems based entirely on syntactic contents have serious limitations. In order to achieve high precision and recall on IR systems, the incorporation of natural language processing techniques that provide semantic information is needed. For this reason, by determining the semantic for the constituents of documents, a clustering method is presented in this...
In view of ignoring semantic relationship between words, high dimensionality of data and computational complexity when current text clustering algorithms deal with Chinese texts. This paper presents a new method to cluster Chinese texts based on semantics in a specific field-TCBS (Text Clustering Based on Semantics) algorithm. The algorithm is based on the agglomerative hierarchical clustering algorithm,...
Analysis of emotions in texts has wide-ranging applications. In the analysis of emotional expressions, degree words are important for expressing emotion intensity of emotions. With the support of a large Chinese emotion corpus (Ren-CECps), in this paper, we present analysis on degree words for Chinese emotion expressions based on syntactic parse and rules. At first, Ren-CECps is used to extract the...
In order to solve the problem of Katakana reduced to English in Japanese-English translation, we employ the phrase-based statistical machine translation model to perform Katakana phrase (or word) translation from Japanese to English. The Katakana phrase is segmented into words by CRF, and then Japanese-English and English-Japanese bi-directional integration translation is carried out on those segmented...
We present Antelogue, a novel pronoun resolution architecture for dialogues based on efficient filtering of potential antecedents through a simple look-up of information using existing resources (gender, number, NER, etc). Our system does not require large labelled datasets for training or complex handcrafted rules. We will demo the system's real time performance on dialogues extracted from the screenplays...
With the rapid development of text summarization, evaluation methods for automatic Chinese text summarization system are becoming more and more important in natural language processing, which can promote development of text summarization greatly. This paper analyzes the existed methods for automatic summarization evaluation, and introduces a new evaluation method based on cluster. The main idea of...
In this paper, we discuss a method to improve the sentence ordering task in Chinese. The way we approach is based on the analysis of Markov model, which can train transition probability in raw corpus. We iteratively calculate the largest transition probability path in Markov model to confirm the correct order. The method avoids judging the first sentence, which could lead to an instable result in...
Traditional English text chunking approach is to identify phrases using only one model and same features. It is shown that one model could not consider each phrasepsilas characteristics, and same features are not suitable to all phrases. In this paper, a multi-agent text chunking model is proposed. This model uses individual sensitive features of each phrase to identify different phrases. Through...
Steganography is a technique for embedding secret messages into carriers. Linguistic steganography is a branch of text steganography. Research on attacking methods against linguistic steganography plays an important role in information security (IS) area. In this paper, a linguistic steganography detecting algorithm using statistical language model (SLM) is presented. An experiment to detect text...
Classical information retrieval models are based on representation of document terms without considering linguistic elements. This article presents a model based on the Discourse Nominal Structure; which lets us take linguistic characteristics of text into account. The model presented is evaluated in comparison with the vector space model. Based on observations during the experimentation we propose...
Textual Entailment (TE) recognition is a task which consists in recognizing if a textual expression, the text T, entails another expression, the hypothesis H. Recently it is treated as a common solution for modeling language variability. Textual entailment captures a broad range of semantic oriented inferences needed for many Natural Language Processing (NLP) applications, like Information Retrieval...
Chinese features extraction is indispensable in a processing of Chinese natural language because it is beneficial to Chinese text knowledge discovery and information retrieval. Chinese Segmentation is the precondition of features extraction. To conquer the disadvantage of current Chinese segmentation methods, such as lexicon-based scheme, syntax and rules-based scheme, statistics-based scheme and...
Motivated by the probabilistic characteristics of syntax compositions especially POS (part of speech) matching of Chinese textual information and the inner structures of most unlexicalized Chinese domain terms, a system framework to recognize and extract domain-specific Chinese terms based on hidden Markov model (HMM) was proposed and implemented. The system learns the HMM parameters by the input...
In this paper, we propose an algorithm and data structure for computing the term contributed frequency (tcf) for all N-grams in a text corpus. Although term frequency is one of the standard notions of frequency in Corpus-Based Natural Language Processing (NLP), there are some problems regarding the use of the concept to N-grams approaches such as the distortion of phrase frequencies. We attempt to...
Conditional random fields (CRFs) model is the valid probabilistic model to segment and label sequence data. Comparing with other statistical models, such as HMM, MEHMM, CRFs process the data sequence in terms of the context of data. Chunk analysis is a shallow parsing method to simplify natural language processing. And entity relation extraction is used in establishing relationship between entities...
With the rapid development of text summarization, evaluation methods for automatic summarization system is becoming more and more important in natural language processing, which can promote development of text summarization greatly. This paper analyzes the existed methods for automatic summarization evaluation, and introduces a new evaluation method based on HowNet. The original tests have shown that...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.