The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
Stemming is a technique used to reduce inflected and derived words to their basic forms (stem or root). It is a very important step of pre-processing in text mining, and generally used in many areas of research such as: Natural language Processing NLP, Text Categorization TC, Text Summarizing TS, Information Retrieval IR, and other tasks in text mining. Stemming is frequently useful in text categorization...
A major difficulty of text categorization is the high dimensionality of the feature space. Feature selection is an important step in text categorization to reduce the feature space. Automatic feature selection methods such as document frequency thresholding (DF), information gain (IG), mutual information (MI), and so on are commonly applied in text categorization, but they do not use term frequency...
Naïve Bayes classifier is proved to be one of the most effective classifier an be used widely. It applies statistical theory to text classification. This paper researched and implemented a Chinese text classifier using JAVA base on Naïve Bayes Method. First of all, this paper described test classification system, the content includes text information expressing, extracting and the method of Chinese...
Feature selection method is the critical technique of the automatic text categorization. A new method of the text feature selection based on the quantum genetic algorithm is proposed in this paper. First of all, using the ECE statistical method to remove redundant features and noise features for the original feature set, Genetic algorithms are used to optimal feature subset; finally the best feature...
Text Categorization (TC) is an important component in many information organization and information management tasks. In many TC applications, the case-base grows at a fast rate and this causes inefficiency in the case retrieval process. Using Case-Base Maintenance learning via the GC (Generalization Capability) algorithm, which can reduce the case number into KNN algorithm, can improve efficiency...
Classification and clustering are frequently-used methods in data excavation technology. This paper introduces the idea of text clustering into the categorization algorithm study. The authors also attempt to use the text categorization pattern of self'-initiated learning to design a clustering-based text categorization algorithm, in the purpose of reducing the dimension of training set and raising...
The determination of a Bayesian network structure, especially in the case of wide domains, can be often complex, time consuming and imprecise. Therefore the interest of scientific community in learning Bayesian network structure from data is increasing: many techniques or disciplines, as data mining, text categorization, ontology building, can take advantage from structural learning. In literature...
Traditional statistics based text classification methods almost construct their characteristic vectors with some key terms, and they consider terms are independent of each other and there are no semantic relations among them. However, in the real world, words used to have semantic relationships, such as synonym, hyponymy and so on. Therefore, classification methods based on statistics do not conform...
Multi-instance learning (MIL) has many applications, including image and text categorization. One of the most effective approaches to MIL is by using support vector machines with multi-instance kernels. In this paper we propose a multi-instance kernel, called MIR-kernel, that takes into account the relational information of instances when computing similarities between bags. The relational information...
Text classification is enduring to be one of the most researched problems due to continuously-increasing amount of electronic documents and digital data. Naive Bayes is an effective and a simple classifier for data mining tasks, but does not show much satisfactory results in automatic text classification problems. In this paper, the performance of naive Bayes classifier is analyzed by training the...
Feature selection is a hot topic in current search field, especially in the field of text categorization. To overcome the shortcomings of traditional χ2 (CHI) approach, an improved χ2 (CHI) statistics method is proposed in this paper. It comprehensively takes criterions such as Document Frequency and Class Accuracy of the traditional statistical methods to improve χ2 (CHI) statistical method. The...
Text classification plays an important role in information extraction and summarization, text retrieval, and question-answering. The discriminative multinomial naive Bayes classifier has been a focus of research in the field of text classification. This paper increases the accuracy of discriminative multinomial Bayesian classifier with the usage of the feature selection technique that evaluates the...
Text classification is continuing to be one of the most researched problems due to continuously-increasing amount of electronic documents and digital data. Classifying documents to closely related categories is the most complex task in text categorization. Feature selection is an essential preprocessing step for improving the efficiency and accuracy of the text classifiers by removing redundant and...
Aspect identification in entity reviews involving multiple aspects is a top priority for aspect-based opinion mining. Most of previous studies adopted machine learning techniques taking it as a multi-class text classification task. However, since building labeled training data is often expensive, some researchers put more interest in unsupervised techniques. With the subject of online restaurant reviews,...
We consider the problem of both supervised and unsupervised classification for multidimensional data that are non-Gaussian and of mixed types (continuous and/or discrete). An important subclass of graphical model techniques called generalized linear statistics (GLS) is used to capture the underlying statistical structure of these complex data. GLS exploits the properties of exponential family distributions,...
Text classification refers to determine the class of an unknown text according to its content in the given classification system. In order to extract fewer features to express the information in the text as much as possible, the paper analysis the various features' statistical properties and to extract the global features according to Zipf's law; and then, based on the statistical analysis of the...
Researchers have concentrated on topic-based text classification while the genre of a document is rarely considered. In this article, we discuss the automatic genre classification and its application. We argue that word level features and sentence level features are two important measures which vary in number among different genres. Word level features include word frequency and POS (part of speech)...
Key words are expressions that indicate and express the subject concept of a text, the major property of key words is to denote subject. Based on the domainal inhomogeneity and critical region of key words, subject degree is brought up and calculated by statistical model to cue textpsila subject concept. Based on key words and itspsila subject degree, constructed a comprehensive auto-indexing system,...
Due to the exponential growth of available text documents in digital form, it is of great importance to develop techniques for automatic document classification based on the textual contents. Earlier document classification techniques have used keyword-based features and related statistics to achieve good results when applied to certain datasets. More recently, some of these techniques have been extended...
Text categorization is a key issue of text mining. Although there are many studies on this problem, the majority of them are focused on classification of rough categories. In this kind of problem, there are obviously different features that can differentiate one category from others. Only very few researches concerned fine text categorization (FTC) problem which is characterized by many duplicated...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.