The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
In this technology emerging era, the number of websites is increasing dramatically. The content and category of information are overflowing the Internet World. Finding the right information from almost a billion of websites is considerably hard, but finding the accurate and quality one is even harder. Hence, the need of website categorization's demand is increasing tremendously. Unfortunately, the...
In text mining field, The KNN (K Nearest Neighbors) is one of the oldest and simplest methods of text classification. But it is known to be sensitive to the distance (or similarity) function used in classifying a test instance, this disadvantage can cause low classification accuracy and limit the KNN classifier's utilization in text classification in text mining. In this paper, we introduce Mahalanobis...
Feature selection has a significant role in the precision of text classification algorithms. In this regard, various approaches exist such as information Gain, Chi Square, Document Frequency, Mutual Information, etc. To improve the classification effectiveness combination of some input features may help a lot. In this paper, a new approach based on Ordered-Weighted Averaging (OWA) is proposed for...
Current search engines are not very effective in filtering out harmful information since the technology they use for filtering is often based on traditional text classification in which texts are often classified according to feature words. To improve the effectiveness of filtering, in this paper, we propose a new filtering scheme in which we combine the neural network and ontology categorization...
Feature Selection (FS) is one of the most important issues in Text Classification (TC). A good feature selection can improve the efficiency and accuracy of a text classifier. Based on the analysis of the feature's distributional information, this paper presents a feature selection method named DIFS. In DIFS a new estimation mechanism is proposed to measure the relevance between feature's distribution...
Feature reduction is one of the core technologies of automatic text categorization. As for the scatter difference criterion, poor categorization effect is made when the between-class distance is small and the class density is high. In order to solve this problem, a weighted method based on the sample distribution is shown in the paper, which will make the between-class and within-class scatter matrixes...
Since the traditional classification algorithm does not work well in the case of short-text classification, we propose a search-based method employing Na'iveBayes classification algorithm. This paper describes the whole process, including the classification algorithms, training and the evaluation. The results indicate that the classifier has better performance comparing with other methods.
Text classification is the key technology for topic tracking, and vector space model (VSM) is one of the most simple and effective model for topics representation. On the basis of K-nearest neighbor (KNN) algorithm for text classification and support vector machines (SVM) algorithm for text classification, we have studied how they affect topic tracking. Then we get the variation law that they affect...
Stemming is a fundamental step in processing textual data preceding the tasks of text mining, Information Retrieval (IR), and natural language processing (NLP). The common goal of stemming is to standardize words by reducing a word to its base (root or stem), thus can be also considered a feature reduction technique. This paper aims at presenting a new dictionary free, content-based Arabic stemmer...
Aimming at the ever-present problem of imbalanced data in text classification, the authors study on several forms of imbalanced data, such as text number, class size, subclass and class fold. Some useful conclusions are gotten from a series of correlative experiments: first, when the text of two class is almost the same number, the difference of word number become major factor to affect the accuracy...
For higher text classification precision, a general fusion classification model and algorithm are proposed, which based on model theory of information fusion, adopting multi-Media information on the network. The model includes two layers, one is feature layer, which deals with different Media information with different classification algorithm, and inputs the classification results into the higher...
Text Classification is an important field of research. There are a number of approaches to classify text documents. However, there is an important challenge to improve the computational efficiency and recall. In this paper, we propose a novel framework to segment Chinese words, generate word vectors, train the corpus and make prediction. Based on the text classification technology, we successfully...
The traditional TF-IDF algorithm is a common method that is used to measure feature weight in text categorization. However, the algorithm doesn't take the distribution of feature terms in inter-class and intra-class into consideration. Consequently, the algorithm can't effectively weigh the distribution proportion of feature items. In order to solve this problem, information entropy in inter-class...
This paper analyzes the concentration and dispersion of the integrated feature selection algorithm (TFFS),and finds their shortcomings: it is difficult for concentration to measure the weigh of the low frequent terms; dispersion ignores the impact of term whose mutual information is negative. Propose a modified feature selection algorithm (TFFSL), which makes certain improvements on concentration...
This paper presents a k -nearest neighbor text classification algorithm based on fuzzy integral. It regards the k nearest training samples as k evidences, and fuses it using fuzzy integral, which avoids independence demand of D-S theory and improves performance of text classification. Experiment compares the new method with improved kNN algorithms and other text classification algorithms, which result...
This paper compares the performance of linear and nonlinear kernels of Support Vector Machines (SVM) used for text classification. The study is motivated by the previous viewpoint that linear SVM performs better than nonlinear one, and that, although there are many investigations have proved that SVM performs well in text classification, there is no serious investigation on the comparison between...
In the standard EM-based semi-supervised text classification, the classification performance is not well when the initial labeled samples are a few. How to improve the performance is an important issue. In view of this, a semi-supervised method based on incremental EM algorithm is proposed. This method makes full use of the useful information of intermediate classifier. On the one hand, this method...
Text classification is the key technology for topic tracking, and vector space model (VSM) is one of the most simple and effective model for topics representation. On the basis of VSM and support vector machines (SVM), we have studied how feature space dimension in VSM as well as linearly separable and non-separable SVM affect topic tracking. Then we get the variation law that they affect topic tracking,...
The explosive growth of the Internet inevitably leads to the proliferation of harmful information such as pornography, drug and violence. A great deal of filtering techniques based on image and text categorization is proposed in the literature. Among them, text-based filtering plays a leading role for its good performance. Existing text filtering algorithms can be seen as a classical text categorization...
Feature extraction is the important prerequisite of classifying text effectively and automatically. TF·IDF is widely used to express the text feature weight. But it has some problems. TF·IDF can't reflect the distribution of terms in the text, and then can't reflect the importance degree and the difference between categories. This paper proposes a new feature weighting method-TF·IDF·Ci to which a...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.