The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
With the development of computer and network techniques, and the digital Chinese news texts explosion, facing a massive unstructured news data, a better way for knowledge extraction and storage, on the one hand, can help readers understand the core content of news, on the other hand, completed news knowledge accumulation will support the reportage. In recent years, information extraction technology...
Chinese text categorization, which is a key technology of massive Chinese text data processing, has been applied to information retrieval, document management, text filtering, etc. However, the categorization accuracy has been the major difficulties faced by the application upgrade. To improve the performance of the Chinese text categorization, feature selection, as an important and indispensable...
Text classification is the foundation and core of text mining. Naive Bayes is an effective method for text classification. This paper improves the accuracy of Naive Bayes classification using improved information gain, one of methods of feature extraction, by reducing the impact of low-frequency words. In this paper, we use a widely corpus of NLTK. According to the test results, The accuracy of the...
Subjective text recognition is the premise of emotion computation. The current method is the dictionary-based method and statistical-based method, while ignoring the subjective clues which contain rich emotional information and the accuracy is not high. To solve this problem, this paper selects the associated word, the emotional word as well as the indicative verb, the interjection, the degree adverb,...
This paper presents our experimental work on machine classification of Nepali texts. We have implemented a Naive Bayes classifier for the task and then augmented it through a multinomial lexicon pooling. The lexicon-pooled Naive Bayes Classifier obtains better results on classification task as compared to a normal Naive Bayes implementation. This hybrid approach also helps in dealing with the unavailability...
Text categorization task have gained the attention of researchers in last 10 years with the increase in web-based contents of documents. For searching a particular document from the web or any large document collection text or document categorization is most useful task. We demand some better system and enhanced machine learning classifiers to accomplish task of document categorization. We designed...
Identifying the language of a text is a very important preliminary phase in the categorization of multilingual documents or even in information retrieval. This phase becomes difficult if we just consider the word as a basic unit of information in texts. Because It could be possible for some languages as French or English but very difficult for some other languages as German, Chinese and Arabic. In...
Identifying the language of a text means that we assign this text to a language in which it is written. This identification becomes important because of the increased diversity of textual data in different languages on the web. A real recognition of the text language is not possible if we just consider the word as a basic unit of information. It could be possible in some languages but very difficult...
A major difficulty of text categorization is extremely high dimensionality of text feature space. The use of feature selection techniques for large-scale text categorization task is desired for improving the accuracy and efficiency. χ2 statistic and simplified χ2 are two effective feature selection methods in text categorization. Using these two feature selection criteria, for a term, one needs to...
Every day, the mass of information available to us increases. This information would be irrelevant if our ability to efficiently access did not increase as well. For maximum benefit, we need tools that allow us to search, sort, index, store, and analyze the available data. We also need tools helping us to find in a reasonable time the desired information by performing certain tasks for us. One of...
Error-correcting output code (ECOC) is an effective approach to solve the problem of multiclass SVM. In this paper, a probabilistic approach that is based on ECOC is proposed. In the training stage, a coding scheme is predefined, and a special model is trained by samples. In the classification stage, besides the labels from SVM as usual, posterior probabilities of labels are also calculated. They...
In this paper, we introduced the overview of short text research and the short text classification firstly. On the foundation of several common used classic text classification algorithms, mainly according to the major feature extraction methods, the short text classification based on statistics and rules is proposed. Experiments show that this algorithm has better performance than other algorithms...
On the basis of analyzing the basic concepts and the process of text excavation, the present study proposes some new methods in extraction of text features, deflation of characteristic collection, extraction of study and knowledge pattern, and appraisal of model quality. Meanwhile, it makes a comparison of two types of text categorization, text classifications and text cluster, and it briefly explores...
Aiming to noise samples in the training dataset, a new method for reducing the amount of training dataset is proposed in the paper which is applicable to text classification. This method describes the distribution of training dataset according to the representativeness score of samples in the class they belong to, so as to show representative samples and noise samples in each class. The new method...
This work proposes a hybrid model for text document classification for information retrieval using Naive Bayes and Rough set theory. Rough set theory is used for feature reduction and Naive Bayes theorem is used for classification of documents into the predefined categories by means of the probabilistic values. The deployment of the proposed model is planned through an enhanced method of the utilization...
A new term weighting approach is used to construct the simplest linear weighting classifier (SL). By probability standard deviation of terms as a base line weighting regulated with terms distributed parameters based on subjective logic reasoning, the weighting is computed. In the assessment process of terms distributed parameters, the model of the term reputation in documents categories based on Beta...
Although an improvement of hierarchical text classification can be achieved by using hierarchical structure information, existing hierarchical text classification methods suffer from two problems: data skew (especially in large-scale hierarchy) and error propagation. In this paper, we first define the concept of path-based semantic vector for the presentation of categories. Then a set of additional...
Feature selection is an important method for improving the efficiency and accuracy of text categorization algorithms by removing redundant and irrelevant terms from the corpus.Extensive researches have been done to improve the performance of individual feature selection methods, but not much on their combinations.In this paper, we propose a method of combining multiple feature selection methods by...
When the traditional text classification technologies classify academic dissertations, the dimension of extracted feature terms is high, and they can't represent the theme of thesis. it makes the efficiency is very low and the accuracy rate is not high. The topic words are small in quantity and can reflect the theme of thesis well. Accordingly, the paper proposes to extract the topic words with topic...
In order to realize the text classification and spam filtering, the Naive Bayesian algorithm estimate what class are the text in by basing on some statistical probability values in accordance with the characteristic in straining sample, but it is easy to expose the overflow problem, this article will optimize the algorithm by setting the threshold, the optimization strategy is comparing the times...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.