The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
This paper presents a pilot study on an intelligent tutoring system for domain-independent argument making. Students' responses to an open-ended question were collected as the instances for supervised text classification based on the grade given by the instructor using structured outcome of the learning observation taxonomy. The responses were processed using Cohmetrix as well as n-gram models to...
Text classification is one of the most significant contents in Natural Language Processing research field. In most real cases, text classification is usually a multi-label learning task. Currently, there are three mainstream attribute measures (i.e., information gain, document frequency and chi-square test values) which are often used to describe documents. The three attribute measures have been applied...
The paper proposes a solution for document and aspect levels sentiment analysis for unstructured documents written in the Romanian language. The opinion extraction relies on two different approaches for polarity identification. At the aspect level we propose a rule-based approach. For the document level we consider supervised learning techniques, based on features extracted and filtered in different...
Fishery information processing can help fishery researchers obtain the needed information easily and quickly. The current information processing techniques have not solved the problem of high dimensional features in fishery information processing. In this paper, a feature selection method for fishery texts based on SVM-RFE was put forward in view of the characteristics of fishery texts. It removed...
Domain words clustering have important theoretical and practical significance in text categorization, the ontology research, machine learning and many other research areas. The domain words clustering method in this article is a method based on word2vec and semantic similarity computation. First of all, we get the candidate word set with word2vec tools to preliminary clustering of words. Then we tectonic...
In this paper, a public opinion analysis system is built up. It consists of a crawler used to retrieve online microblog content and a text classifier for distinguishing sentimental content. This system is used to identify public opinions towards certain topics. Microblogs are divided into three categories based on their emotional tendency, namely "positive", "negative" and "objective",...
This paper presents categorization of Croatian texts using Non-Standard Words (NSW) as features. NonStandard Words are: numbers, dates, acronyms, abbreviations, currency, etc. NSWs in Croatian language are determined according to Croatian NSW taxonomy. For the purpose of this research, 390 text documents were collected and formed the SKIPEZ collection with 6 classes: official, literary, informative,...
Classification based on topic (content) rather than genre (form) prevails in the text data mining and search engine circle. To simplify this work, a BOW (Bag of Words) strategy, counting topic-related words as features, is comprehensively utilized to make a final decision. Indeed, texts can be categorized by expression styles rather than their themes. Brief biography is a typical text class which...
Data mining extracts novel and useful knowledge from large repositories of data and has become an effective analysis and decision means in corporation In many information processing tasks, labels are usually expensive and the unlabeled data points are abundant. To reduce the cost on collecting labels, it is crucial to predict which unlabeled examples are the most informative, i.e., improve the classifier...
Identifying the language of a text is a very important preliminary phase in the categorization of multilingual documents or even in information retrieval. This phase becomes difficult if we just consider the word as a basic unit of information in texts. Because It could be possible for some languages as French or English but very difficult for some other languages as German, Chinese and Arabic. In...
In this paper, we present our solution and experimental results of the application of semi-supervised machine learning techniques and the improvement of SVM algorithm to build text classification applications. Firstly, we create a features model which is based on labeled data, and then we will be improved it by the unlabeled data. The technique that is to be added a label into new data is based on...
The classification performance of previous IG algorithm may decline obviously because of the maldistribution of classes and features, due to which an improved text feature selection method UDsIG is proposed. First, we select features by classes to reduce the impact on feature selection when the classes are unevenly distributed. After that, we use feature equilibrium of distribution to decrease the...
Identifying the language of a text means that we assign this text to a language in which it is written. This identification becomes important because of the increased diversity of textual data in different languages on the web. A real recognition of the text language is not possible if we just consider the word as a basic unit of information. It could be possible in some languages but very difficult...
Page segmentation into text and non-text elements is an essential preprocessing step before optical character recognition (OCR) operation. In case of poor segmentation, an OCR classification engine produces garbage characters due to the presence of non-text elements. This paper presents a method to separate the textual and non textual components in document images using a graph-based modeling and...
Feature selection is one of several factors affecting text classification systems. Feature selection aims to choose a representative subset of all features to reduce the complexity of classification problems. Usually a single method is used for feature selection. For English, several attempts were reported examining the combination of different feature selection methods. To the best of our knowledge...
Page segmentation into text and non-text elements is an essential preprocessing step before optical character recognition (OCR) operation. In case of poor segmentation, an OCR classification engine produces garbage characters due to the presence of non-text elements. This paper presents a method to separate the textual and non textual components in document images using a graph-based modeling and...
Dimension reduction is an important component in automatic text categorization, especially biomedical literature classification. Many studies have showed that statistic-based dimension reduction algorithms, like Information Gain (IG), are very effective in document categorization. However these algorithms still suffer from major drawbacks. One facet is that they tend to use all the words as features...
Naive Bayes classifier is widely used in machine learning for its simplicity and efficiency. However, most of the existing work on naïve Bayes focused on improving the Bayes model itself or whether the “naïve assumption” is satisfied. In this paper, the performance of naïve bayes in text classification is analyzed and the corresponding results from different points of view is proposed, then an improving...
The RLS-MARS (Regularized Least Squares-Multi Angle Regression and Shrinkage) feature selection model is used to select the relevant information, in which both, the keeping and the leaving-out of the regularizer are present. The RLS-MARS model is to find a series of directions in multidimensional space, leading the gradient vectors to change along those directions which would make the gradient matrix's...
Given the importance of organizing and managing the rapid growth in knowledge of Arabic electronic content, this study introduces the Weirdness Coefficient (W) as a new feature selection method for Arabic special domain text classification. The proposed method was used to classify a dataset comprising five Islamic topics using Naïve base (NB) and K-nearest neighbor (K-NN) classifiers, and three representation...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.