The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
Huge amount of data in today's world are stored in the form of electronic documents. Text mining is the process of extracting the information out of those textual documents. Text classification is the process of classifying text documents into fixed number of predefined classes. The application of text classification includes spam filtering, email routing, sentiment analysis, language identification...
Documents indexing is the main step in a conventional document classification or information retrieval framework. This study aims to highlight the influence of features' type on the efficiency of a classification system. Empirical results on Arabic dataset reveal that the choice of extracted feature's type has a significant impact on conserving semantic information and improving classification accuracy,...
In this paper, we implement a systematic approach to text categorization using latent semantic indexing (LSI). A novel feature of our approach is that we iteratively refine the LSI space used for categorization. Using a verification set, we also employ LSI to determine the values of all parameters controlling the steps of the categorization process. Our approach is designed to scale to enterprise-level...
Day by day the number of text documents in digital form is increasing. Text classification is used to organize these text documents. However, text classification has the problem of high dimensionality of feature space. This high dimensionality of feature space is solved by feature selection and feature extraction methods and improves the performance of text categorization. The feature selection and...
In this research, we propose the table based KNN as the approach to the text categorization. In previous works, we discovered that encoding texts into tables improved the performance in the text categorization, so in this research, become to consider the possibility of encoding words into tables as well as texts. In this research, we encode words into tables where entries are texts and their weights,...
This research proposes the table based AHC algorithm as the approach to the word clustering task. The results from encoding texts into tables were successful in the previous works on the text categorization and the text clustering, and if oppositely to the case of the text encoding, texts are assumed to be elements of each word, it becomes to be possible to encode words into tables. In this research,...
A major challenge in topic classification (TC) is the high dimensionality of the feature space. Therefore, feature extraction (FE) plays a vital role in topic classification in particular and text mining in general. FE based on cosine similarity score is commonly used to reduce the dimensionality of datasets with tens or hundreds of thousands of features, which can be impossible to process further...
For the last few years, text mining has been gaining significant importance. Since Knowledge is now available to users through variety of sources i.e. electronic media, digital media, print media, and many more. Due to huge availability of text in numerous forms, a lot of unstructured data has been recorded by research experts and have found numerous ways in literature to convert this scattered text...
Today, as more and more businesses and individuals into the study of cloud computing, data storage in the cloud platform is also growing. So how cloud environment quickly and effectively store, manage and use these data has become a very important and challenging issues. This paper mainly discusses the storage model based on Map/Reduce text categorization, at the same time combining forecasting data...
Data mining extracts novel and useful knowledge from large repositories of data and has become an effective analysis and decision means in corporation In many information processing tasks, labels are usually expensive and the unlabeled data points are abundant. To reduce the cost on collecting labels, it is crucial to predict which unlabeled examples are the most informative, i.e., improve the classifier...
A good text classifier is a classifier that efficiently categorizes large sets of text documents in a reasonable time frame and with an acceptable accuracy. Most of the text classification approaches are based on the statistical analysis of a term, either a word or a phrase. Though statistical term analysis shows the importance of the term, it is tedious to analyze when more than one term has the...
Text mining is a new field that attempts to bring together meaningful information from natural language text. Automatic Text categorization and summarization is the process of assigning pre-defined class labels to incoming, unclassified documents. The class labels are defined based on a set of examples of pre-classified documents used as a training corpus. This research work comprises an automatic...
This paper proposes a model of text categorization named Alida, which combines a model of categorization inspired of the classical cognitive models of categorization of Nosofsky, with a semantic space model as system of semantic knowledge representation. The model addresses large-scale text categorization applications in opinion mining in different domains and different languages. The performance...
Web page classification plays an essential role in facilitating more efficient information retrieval and information processing. Conventionally, web text documents are represented by term frequency matrix for classification purpose. However, considering the limitations of representing documents using terms or keywords, we propose to represent web pages using information extraction patterns that are...
Stemming is a fundamental step in processing textual data preceding the tasks of text mining, Information Retrieval (IR), and natural language processing (NLP). The common goal of stemming is to standardize words by reducing a word to its base (root or stem), thus can be also considered a feature reduction technique. This paper aims at presenting a new dictionary free, content-based Arabic stemmer...
In the context of text clustering, global feature selection tries to identify a single subset of features which are relevant to all clusters. However, the clustering process might be improved by considering different subsets of features for locally describing each cluster. In experiments with local feature selection, it was observed that the resulting partitions were unstable but there were cohesive...
The rapid growth of biomedical literature is evident in the increasing size of the MEDLINE research database. Medical Subject Headings (MeSH), a controlled set of keywords, are used to index all the citations contained in the database to facilitate search and retrieval. This volume of citations calls for efficient tools to assist indexers at the US National Library of Medicine (NLM). Currently, the...
Word Sense Disambiguation (WSD) is main task in the area of natural language processing (NLP). Supervised WSD methods are shown to be more effective than other WSD methods with the limitation of the size of manual annotated learning set. On the other hand, Concept graph is a weighted graph with each of its edges representing the relationships between concepts (relevancy of each pair of concepts)....
Feature selection for text classification is a well-studied problem and the goals are improving classification effectiveness, computational efficiency, or both. In this paper, we propose a two-stage feature selection algorithm based on a kind of feature selection method and latent semantic indexing. Traditional word-matching based text categorization system uses vector space model to represent the...
In text categorization, one well-known document representation is bag-of-words. Although it is simple and popular, it ignores semantics, underlying linguistic information, and word correlations. In this paper, a new representation for text data is proposed which is called Bag-Of-Queries (BOQ). First, a taxonomy of the terms in the local vocabulary is extracted. Extracting a taxonomy is performed by...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.