The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
The field of Text Mining has evolved over the past years to analyze textual resources. However, it can be used in several other applications. In this research, we are particularly interested in performing text mining techniques on audio materials after translating them into texts in order to detect the speakers' emotions. We describe our overall methodology and present our experimental results. In...
Since the traditional classification algorithm does not work well in the case of short-text classification, we propose a search-based method employing Na'iveBayes classification algorithm. This paper describes the whole process, including the classification algorithms, training and the evaluation. The results indicate that the classifier has better performance comparing with other methods.
Stemming is a fundamental step in processing textual data preceding the tasks of text mining, Information Retrieval (IR), and natural language processing (NLP). The common goal of stemming is to standardize words by reducing a word to its base (root or stem), thus can be also considered a feature reduction technique. This paper aims at presenting a new dictionary free, content-based Arabic stemmer...
The traditional TF-IDF algorithm is a common method that is used to measure feature weight in text categorization. However, the algorithm doesn't take the distribution of feature terms in inter-class and intra-class into consideration. Consequently, the algorithm can't effectively weigh the distribution proportion of feature items. In order to solve this problem, information entropy in inter-class...
The paper deals with the text classification problem where labeled training samples are very limited while unlabeled data are readily available in large quantities. The paper proposes an efficient classification algorithm that incorporates a weighted k-means clustering scheme into an Expectation Maximization (EM) process. It aims to balance predictive values between labeled and unlabeled training...
Feature selection is the key issue in text classification because there are a large number of attributes. In this paper, we propose a new algorithm OR+SVM-RFE that integrates Odds Radio(OR) with recursive feature elimination based on SVM(SVM-RFE). Odds Radio is first used to roughly and rapidly select a feature subset. Then SVM-RFE is used to delicately select a smaller feature subset. Experiment...
Aiming to noise samples in the training dataset, a new method for reducing the amount of training dataset is proposed in the paper which is applicable to text classification. This method describes the distribution of training dataset according to the representativeness score of samples in the class they belong to, so as to show representative samples and noise samples in each class. The new method...
The Internet has been a huge resource for sharing and collecting information including health related information. Some health related information is written by patients (lay persons) discussing their experience about health problems and treatments. This paper introduces our initial work on providing physicians with clinically useful patient health writings. More specifically, the paper presented...
The continuing explosive growth of textual content within the World Wide Web has given rise to the need for sophisticated Text Classification (TC) techniques that combine efficiency with high quality of results. E-mail filtering is one application that has the potential to affect every user of the internet. Even though a large body of research has delved into this problem, there is a paucity of survey...
Since the automatic word segmentation of Chinese text will bring the lack of information, method of word segmentation according to lexical chunk as segmentation unit are proposed. Use traditional segmentation method segment Chinese text based calculate mutual information between two lexical entries and adjacent frequency of two or more lexical entries, according to this calculated value judge and...
Question classification plays a crucial important role in the question answering system. Recent research on question classification for open-domain mostly concentrates on using machine learning methods to resolve the special kind of text classification. This paper presents our research about Chinese question classification using machine learning method and gives our approach based on SVM and semantic...
Feature selection and feature weight calculating are key preprocesses in text classification. A new feature selection approach based on average interaction gain (AIG) is presented and a new feature weight adjustment technique (WA) taking inter-class distribution and intra-class distribution into consideration is presented too. Then a new approach combining AIG with WA called AIG-WA is presented. In...
Text classification is one of the core applications in data mining due to the huge amount of not categorized digital data available. Training a text classifier generates a model that reflects the characteristics of the domain. However, if no training data is available, labeled data from a related but different domain might be exploited to perform cross-domain classification. In our work, we aim to...
Automatic text categorization has been one of the hotspots in the information processing field. To aim at the important impact of feature weight calculating on text classification accuracy, first, the relationship between text representation model and feature weight calculating is studied, and the existed methods of feature weight calculating are analyzed, then the common idea of feature weighting...
In this paper, we introduce a method for categorizing digital items according to their topic, only relying on the document's metadata, such as author name and title information. The proposed approach is based on a set of lexical resources constructed for our purposes (e.g., journal titles, conference names) and on a traditional machine-learning classifier that assigns one category to each document...
Text classification categories Web documents in large collections into predefined classes based on their contents. Unfortunately, the classification process can be time-consuming and users are still required to spend considerable amount of time scanning through the classified Web documents to identify the ones that satisfy their information needs. In solving this problem, we first introduce CorSum,...
Text classification is continuing to be one of the most researched problems due to continuously-increasing amount of electronic documents and digital data. Classifying documents to closely related categories is the most complex task in text categorization. Feature selection is an essential preprocessing step for improving the efficiency and accuracy of the text classifiers by removing redundant and...
Most Chinese text classification methods are based on Chinese word segmentation and bag of words (BOW). The classification performance largely relies on the accuracy of segmentation. Unfortunately, perfect precision and disambiguation of segmentation cannot be reached. In order to solve this problem, a novel Chinese text classification method using string kernel is presented. String kernel computes...
We introduce a new method for dimensionality reduction by attribute extraction and evaluate its impact on text classification. The textual contents in body sections of the news in Reuters-21758 are the selected attributes for classification. Using the offered method, high dimension of attributes- words extracted from the news bodies- are projected onto a new hyper plane having dimensions equal to...
Automatically classifying text documents is an important field in machine learning. Unsupervised text classification does not need training data but is often criticized to cluster blindly. Supervised text classification needs large quantities of labeled training data to achieve high accuracy. However, in practice, labeled samples are often difficult, expensive or time consuming to obtain. In the meanwhile,...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.