The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
This paper presents a detailed study of technologies based on Hadoop and MapReduce available over the cloud for large-scale data mining and predictive analytics. Although some studies may have shown that cloud technologies relying on the MapReduce framework do not perform as well as parallel database management systems, e.g., with ad hoc queries and interactive applications, MapReduce has still been...
Since the inception of the concept of social networking, communication patterns have shifted drastically with the unmitigated trend in socializing over the Internet, especially when people began connecting via mobile devices. Nowadays people tend to use these modern communication systems to share their emotions with each other. Human emotions play a vital role in human relationships and people share...
Webpage text Classification is an important problem that has been studied through different approaches and algorithms. It aims to assign a predefined category to a Webpage based on its content and linguistic features. It has many applications such as word sense disambiguation, document indexing, text filtering, Webpages hierarchical categorization and document organization. This study is a part of...
Large scale hierarchical classification problem researches how to classify web documents into the categories among a class hierarchy. As the class hierarchy is very large that containing thousands or even tens of thousands of categories, the performance of the classification is still lower. While a reduce-and-conquer strategy has been proposed to make the problem tractable, candidate search is a bottleneck...
The classification performance of previous IG algorithm may decline obviously because of the maldistribution of classes and features, due to which an improved text feature selection method UDsIG is proposed. First, we select features by classes to reduce the impact on feature selection when the classes are unevenly distributed. After that, we use feature equilibrium of distribution to decrease the...
The purpose of the present work is creating an intelligent system to retrieve desired documents in Marathi language. The system also focuses on providing the personalized documents in Marathi language to the end user based on their interests identified from the browsing history. This paper presents the automatic categorization of Marathi documents and the literature survey of the related work done...
Feature selection plays an important role in text classification, and contributes directly to the accuracy of the classification. In order to correct the defects, such as mutual information-Based feature selection method tends to select rare words and those words from small samples as features, and negative MI value. This paper proposes a new improved feature evaluation function for automatic text...
In this paper, a pattern classification task was regarded as a sample selection problem where a sparse subset of sample from the labeled training set was chosen. We proposed an adaptive learning algorithm utilizing the least square function to address this problem. Using these selected samples, which we call informative vectors, a classifier capable of recognizing the test samples was established...
Feature selection is one of several factors affecting text classification systems. Feature selection aims to choose a representative subset of all features to reduce the complexity of classification problems. Usually a single method is used for feature selection. For English, several attempts were reported examining the combination of different feature selection methods. To the best of our knowledge...
k-Nearest Neighbor (KNN) algorithm was an efficient text categorization algorithm in recall and accuracy, but the computational overhead of KNN was directly proportional to the sample size, so its classification speed was low in large-scale sample data. Aiming at this problem, the paper presented a density-based method for reducing training data, the method clustered each class of sample data into...
In this paper, we investigate the use of Text Classification techniques to extract contextual information from user reviews for Context Aware Recommendation. We conduct several experiments to identify the best Text Representation settings and the best classification algorithm for our dataset. We carry out our experiments on hotel reviews. We focus on extracting the trip type, as contextual information,...
In order to resolve the comprehension difficulties of theory and implementation about Chinese text classification in “ The principle and application of pattern recognition” curriculum for graduate students, this paper introduces the experiment of Chinese text classification into teaching practice. According to the text classification characteristics, we design the experiment scheme about Chinese text...
Many methods, such as mutual information (MI), document frequency (DF), information gain (IG) and χ2 statistics (CHI) algorithm, have been discussed and applied to the study of meta feature selection. This paper gives a brief review of the recent approaches on this topic. By summarizing and synthesizing these approaches, we propose a framework of the application of meta feature selections, where the...
In the field of Text Classification/Categorization, the k Nearest Neighbor algorithm (kNN) has been to date one of the oldest and most popular methods. It has been experimented upon, implemented and tested by many researchers all over the world. There have been variations in the implementation of this algorithm and I have in this paper done the same. As the name suggests the method is dependent on...
Dimension reduction is an important component in automatic text categorization, especially biomedical literature classification. Many studies have showed that statistic-based dimension reduction algorithms, like Information Gain (IG), are very effective in document categorization. However these algorithms still suffer from major drawbacks. One facet is that they tend to use all the words as features...
Several algorithms are proposed to support the process of automated classification of textual documents. Each of these algorithms has characteristics that influence the classification result. Depending on the amount and nature of the data submitted, the quality of results may vary considerably from one algorithm to another. The generated classes are often noisy. In addition, the number of classes...
Text classification is an important research topic for managing numerous electronic documents. Feature reduction is the key issue for text classification with high dimensional keywords. A document analysis method called discriminant coefficient was proposed to reduce features and achieve high precisiontext classification. However, the main problem of the discriminant based feature reduction method...
Given the importance of organizing and managing the rapid growth in knowledge of Arabic electronic content, this study introduces the Weirdness Coefficient (W) as a new feature selection method for Arabic special domain text classification. The proposed method was used to classify a dataset comprising five Islamic topics using Naïve base (NB) and K-nearest neighbor (K-NN) classifiers, and three representation...
In Chinese text classification field, the content and size of feature space have decisive impact on accuracy and efficiency. Those two kinds feature information of incremental unlabeled training samples are ignored during current incremental learning research. For large scale of high dimensional Chinese texts, this paper presents a flexible, effective and universal feature selection strategy. In this...
Text classification problem receives a lot of research that are based on machine learning, statistical, and information retrieval techniques. In the last decade, the associative classification algorithms which depends on pure data mining techniques appears as an effective method for classification. In this paper, we examine associative classification approach on the Arabic language to mine knowledge...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.