The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
The capture of temporal dynamics of news streams has drawn increasing attentions in recent sequential data mining works. Most of them are based on the intuition that a “burst” of a topic is signaled by a growth of relevant words in a high intensity during a period of time. Such “burst features” can be efficiently identified by Kleinberg's two-state automaton model. The resolution is an important parameter...
Hot events detection in text streams has drawn increasing attention in recent sequential data mining works. Different from traditional TDT task which find all the real events' cluster, hot events detection only identify hot events concerned by public. This paper proposes a novel approach to identify those events based on burst terms, terms co-occurrence and generative probabilistic model. Experiments...
This paper is aim to improve the discrimination capability of LDA model through unsupervised feature selection. Experimental results show that if the interference of general word and general topic can be removed, the discrimination capability of LDA model will be increased. The key problem is how to find supervised information to evaluate features. The LDA topics are assumed reasonable. Therefore,...
Error-correcting output code (ECOC) is an effective approach to solve the problem of multiclass SVM. In this paper, a probabilistic approach that is based on ECOC is proposed. In the training stage, a coding scheme is predefined, and a special model is trained by samples. In the classification stage, besides the labels from SVM as usual, posterior probabilities of labels are also calculated. They...
This paper focuses on the task of text sentiment analysis in hybrid online articles and web pages. Traditional approaches of text sentiment analysis typically work at a particular level, such as phrase, sentence or document level, which might not be suitable for the documents with too few or too many words. Considering every level analysis has its own advantages, we expect that a combination model...
Cross-document coreference resolution plays an import part in the filed of natural language processing (NLP). It captures the ability of gathering documents for information about a certain entity. Most previous algorithms identify the underlying entity of a given document depending on the original text, which is unreliable if the original text contains multiple parts of different themes. In this paper,...
In natural languages, compound words play an important role and their automatically extraction is very helpful in information retrieval, information extraction and text classification. We introduce a semi-supervised Chinese compound extraction approach based on HMM using bootstrapping in this paper. First, we define a set of tags BEMI {beginning, end, middle, independence}, which means the position...
This paper is to introduce an algorithm to cluster Chinese short texts for mining web topics based on Chinese chunks. Aiming at the characteristics of Chinese short texts, the algorithm employs N-gram feature extraction to capture Chinese chunks from texts, which reflect the text semantic structure and character dependency. Then RPCL algorithm is applied to realizing text clustering with high precision,...
This paper introduces a locality discriminating indexing (LDI) algorithm for text categorization. The LDI algorithm offers a manifold way of discriminant analysis. Based on the hypothesis that samples from different classes reside in class-specific manifold structures, the algorithm depicts the manifold structures by a nearest-native graph and a invader graphs. And a new locality discriminant criterion...
Finding experts accurately and automatically is becoming difficult especially in a large organization. This paper presents a probabilistic model which applies language modeling techniques to find experts in enterprise corpora. The expertise of each candidate expert is modeled through the associated experience. We employ a qualification of experience, and validate this qualification as a measure of...
In Chinese word segmentation, disambiguation and unknown words identification are becoming the two key issues. In this paper, a two-stage strategy based system is constructed to deal with these problems. First, an n-gram based model is applied to do the basic segmentation as well as disambiguation in some extent. Then, in the second stage, a language tagging template, named POC-NLW, is adopted to...
It is generally thought that semantic and grammatical information was very significant to better understanding and processing of text. But in simple text categorization task, absence of this information does not always lead to the degradation of classifier performance. In this paper, we discuss the application of the character-level statistical method in text categorization, which extract character-level...
In order to improve the performance of Chinese text categorization, a new Chinese text categorization method based on angle distribution is presented. The new method describes the text with a more precise model and proposed a new categorization algorithm by employing angle distribution. Simulation results on open Chinese text collection show that the precision and recall of most classes have been...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.