The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
A fuzzy confusion matrix based cursive handwritten text categorization has been implemented. Printed text is obtained from handwritten text through Modified Optimal Clustering Algorithm (MOCA). Optimal Clustering Algorithm (OCA) groups texts into different subject categories. Learning is conducted to extract the attributes along with corresponding weights for each subjects. Fuzzy confusion matrix...
Feature selection algorithm has a great influence on the accuracy of text categorization. The traditional information gain (IG) feature selection algorithm usually selects the features that rarely appear in the specified categories, but frequently appear in other categories. To overcome this drawback, on the basis of in-depth analysis of the related algorithms, an improved IG feature selection method...
Principal component analysis (PCA) is a commonly used method for feature extraction and dimensionality reduction. This paper proposes PCA based on similarity/correlation criteria instead of covariance to gain low-dimensional features with high performance in text classification. Experimental results have demonstrated the advantages and usefulness of the proposed method in text classification in high-dimensional...
Text classification is one of the key methods used in text mining. Generally, traditional classification algorithms from machine learning field are used in text classification. These algorithms are primarily designed for structured data. In this paper, we propose a new classifier for textual data, called Supervised Meaning Classifier (SMC). The new SMC classifier uses meaning measure, which is based...
Naïve Bayes is a commonly used algorithm in text categorization because of its easy implementation and low complexity. Naïve Bayes has mainly two event models used for text categorization which are multivariate Bernoulli and multinomial models. A very large number of studies choose multinomial model and Laplace smoothing just based on the assumption that it performs better than multivariate model...
Feature selection is a strategy that aims at making text classifiers more efficient and accurate. In this paper, we proposed a novel feature selection method based on Tibetan grammar for Tibetan classification. Tibetan language express grammatical meaning through the function words and word order, and the function word has large proportions. By analyzing the Tibetan grammar and distribution of part...
We offer an automated way of estimating the author of a song using only its lyrics content. To this end, we introduce a complete text classification framework which takes raw lyrics data as input and report estimated songwriter. The performance of the system is evaluated based on its classification and retrieval ability on a large dataset of Turkish songs, which was collected in this study. The results...
Eliminating all stop words from the feature space is a standard practice of preprocessing in text mining, regardless of the domain which it is applied to. However, this may result in loss of important information, which adversely affects the accuracy of the text mining algorithm. Therefore, this paper proposes a novel methodology for selecting the optimal set of domain specific stop words for improved...
The main objective of this work is to classify Hindi and Telugu stories based on their structure into three genres: Fable, Folk-tale and Legend. In this work, each story is divided into three parts: (i) introduction, (ii) main and (iii) climax. The objective of this work is to explore how story genre information is embedded in different parts of the story. We are proposing a framework for story classification...
NeuroinformaticsNatural Language Processing (NeuroNLP) relies on clustering and classification for information categorization of biologically relevant extraction targets and for interconnections to knowledge-related patterns in event and text mined datasets. The accuracy of machine learning algorithms depended on quality of text-mined data while efficacy relied on the context of the choice of techniques...
Machine learning classifiers are widely used for text categorization however a classifier misclassifies some of the instances into a category that is relevant to their actual category. The categorization ability of a classifier can be improved by filtering dataset with better classifier and removing such category for misclassified instances. In this paper we proposed a two level approach where level-1...
Text Categorization plays an important role in the fields of information retrieval, machine learning, natural language processing, data mining and others. With the development of computer and information technology, there have been many classification algorithms. Each text classification algorithms will get result at differing speeds and efficiency due to the various feature of test text. It has been...
Text classification is the foundation and core of text mining. Naive Bayes is an effective method for text classification. This paper improves the accuracy of Naive Bayes classification using improved information gain, one of methods of feature extraction, by reducing the impact of low-frequency words. In this paper, we use a widely corpus of NLTK. According to the test results, The accuracy of the...
Social media is widely used as a channel of communication in general purposes, including the comment that are related to retail business. It is a highly effective communication tool for direct interacting with their customers. Growth rate of the users is rapidly increasing, because they use this channel to receive information and share something interesting. In this paper, we present a comparison...
This paper deals with the problem of topic identification of Arabic noisy texts, which is an important research field, regarding the growing amount of shared textual information in the world. The dataset used in this survey is constructed by collecting several corrupted Arabic texts from different discussion forums related to six different topics. The proposed algorithms use the k-nearest neighbor...
Chinese text classification is always challenging, especially when data are high dimensional and sparse. In this paper, we are interested in the way of text representation and dimension reduction in Chinese text classification. First, we introduces a topic model — Latent Dirichlet Allocation(LDA), which is uses LDA model as a dimension reduction method. Second, we choose Support Vector Machine(SVM)...
Current technology facilitates and increases connections through social media, allowing individuals everywhere to spread their ideas to the world. One social media platform is Twitter. One characteristic of a tweet is the requirement of conveying a message in a limited number of words. Proverbs are a feature of language that convey messages effectively in the least number of words. Therefore, we selected...
In the languages, the occur of words are indicated about meaning of contents in text. Generative models for text, such as the topic model, have the potential to make important contributions to the statistical analysis of large document collections, and the development of a deeper understanding of human language learning and processing. In this paper, we proposed a novel method for building Vietnamese...
The objective of the present work is to design a HADOOP based parallel Marathi content retrieval system using clustering technique to get the efficient and optimized result than existing systems. The system also focuses on providing the personalized documents in Marathi language to the end user based on their interests identified from the browsing history and using time session mechanism for re ranking...
Software companies spend over 45 percent of cost in dealing with software bugs. An inevitable step of fixing bugs is bug triage, which aims to correctly assign a developer to a new bug. To decrease the time cost in manual work, text classification techniques are applied to conduct automatic bug triage. In this paper, we address the problem of data reduction for bug triage, i.e., how to reduce the...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.