The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
Clustering is one of the prime topics in data mining. Clustering partitions the data and classifies the data into meaningful subgroups. Document clustering is a set of the document into groups such that two groups show different characteristics with respect to likeness. In this paper, an experimental exploration of similarity based method, HSC for measuring the similarity between data objects particularly...
Aim to multiclass text categorization problem, a classification algorithm based on multiconlitron and 1-a-r method is presented. 1-a-r method is used to convert a multiclass categorization problem to several binary problems. Multiconlitron is constructed for each binary problem in input space. For the text to be classified, its class is decided by multiconlitrons. The classification experiments are...
Nowadays the exponential growth of generation of textual documents and the emergent need to structure them increase the attention to the automated classification of documents into predefined categories. There is wide range of supervised learning algorithms that deal with text classification. This paper deals with an approach for building a machine learning system in R that uses K-Nearest Neighbors...
There is a constantly growing interest in evaluating music information retrieval (MIR) systems that can provide effective management of the music resources. The crucial characteristic of music is its emotion, which reflect the human's perception. To do the automatic classification of Chinese music emotions more effective, we use the lyrics of music to analysis and classify music based on emotion....
Automatic text classification is the key technology to process and organize large-scale text data. It is well known that the high dimensionality of feature space is a main challenge for text classification. In order to attenuate such a problem as well as inspired by existing arts, we propose an effective text feature selection algorithm by novelly fusing the classical methodologies of Gini index and...
Millions of file uploads and downloads happen every minute resulting in big data creation and manual text categorization is not possible. Hence, there is a need for automatic categorization of documents that makes storage and retrieval more efficient. This research paper proposes a hybrid text categorization model that combines both Rocchio algorithm and Random Forest algorithm to perform Multi-label...
The social media generates large volume of data through tweets and text messages during and after any disaster. The analysis and classification of the obtained data at the time of disaster is essential for conveying the information to the appropriate rescue personnel. In this paper, an automated text classification system is proposed in order to classify the data effectively. The classification of...
In text classification, feature selection is essential to improve the classification effectiveness. This paper provides an empirical study of a feature selection method based on genetic algorithms for different text representation methods. This feature selection algorithm can accomplish two goals: in one hand is the search of a feature subset such that the performance of classifier is best; in other...
Document Classification has attracted several attentions from researchers due to the increase of digital form documents and the need of these documents' organization. One of the most popular approaches to deal with this problem is based on machine learning techniques [1]. However, the result of classification much depends on the linguistic preprocess and the document representation. The dependence...
The purpose of this study is to show how n-grams are used for author recognition in the Azerbaijani language. As attribute vectors for analyzing of authorship are taken monogram and digram. We have developed a new approach to the determination of the attribute vectors for recognition of the author of an unknown text.
With the fast-paced prevalence of smartphones, binary short text classification (STC) is becoming a basic and challenging issue, and relevant STC algorithms can be successfully used in spam filtering for short message service (SMS), wechat, microblogging, and so on. In this manuscript, we address the structural feature of SMS documents and propose a structural learning framework, which decomposes...
The advent of social networking and open health web forums such as PatientsLikeMe, WebMD, ehealth forum etc. have provided avenues for social user data that can prove instrumental in suggesting futuristic trends in healthcare. Homophily in social networks is a vital contributor for analyzing patterns for medical conditions, diagnosis and treatment options. Since, members with similar medical issues...
With the rapid development of Web and the rapid expansion of text information, how to effectively organize and manage these information is a great challenge for the current information science. Text automatic classification technology can effectively organize a large number of texts and help people to improve the efficiency of information retrieval. It has become one of the most important research...
This paper proposes an approach using MapReduce-based Rocchio relevance feedback algorithm, which improved the traditional Rocchio algorithm in the MapReduce paradigm, to resolve the problem of massive information filtering. Traditional text classification algorithms have vital impact on information filtering.
In this technology emerging era, the number of websites is increasing dramatically. The content and category of information are overflowing the Internet World. Finding the right information from almost a billion of websites is considerably hard, but finding the accurate and quality one is even harder. Hence, the need of website categorization's demand is increasing tremendously. Unfortunately, the...
With the development of weblogs and social networks, many news providers share their news headlines on different websites and weblogs. One of the main text mining topics is how to classify news into different groups. This study aims to classify news into various groups so that users can identify the most popular news group in the desired country at any given time. Based on Term Frequency-Inverse Document...
Document classification can be defined as the task of automatically categorizing collections of electronic documents into their annotated classes, based on their contents. It is an important problem in Data mining. Due to the exponential growth of documents in the Internet and the emergent need to organize them, developing an efficient document classification method to automatically manipulate web...
Social media such as Twitter create space to explain the thoughts and opinions on various topics and different events, millions of users can share their ideas in this Micrblog, Therefore Twitter is converted as a source to exploration of information; make a decision and an analysis of sentiment. There is a sense in all of the texts, but it is more important to provide strategies for obtaining suitable...
Text categorization with machine learning algorithms usually assumes to have flat set of categories. Such classifiers are very domain specific and not reusable for some other generic text classifications. It is very possible that a hierarchically structured set of categories might have a higher impact on the way classifiers are used and built. As presented in this document, the list of most common...
Feature selection plays an important role in text categorization, and contributes directly to the accuracy of the categorization. In the process of feature selection, due to the lack of consideration of the traditional expected cross entropy algorithm for document frequency, we first improve the expected cross entropy formula of the traditional, and then propose an improved text feature selection...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.