The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
Clustering is one of the prime topics in data mining. Clustering partitions the data and classifies the data into meaningful subgroups. Document clustering is a set of the document into groups such that two groups show different characteristics with respect to likeness. In this paper, an experimental exploration of similarity based method, HSC for measuring the similarity between data objects particularly...
Huge amount of data in today's world are stored in the form of electronic documents. Text mining is the process of extracting the information out of those textual documents. Text classification is the process of classifying text documents into fixed number of predefined classes. The application of text classification includes spam filtering, email routing, sentiment analysis, language identification...
Youtube is one of the most popular video sharing platform in Indonesia. A person can react to a video by commenting on the video. A comment may contain an emotion that can be identified automatically. In this study, we conducted experiments on emotion classification on Indonesian Youtube comments. A corpus containing 8,115 Youtube comments is collected and manually labelled using 6 basic emotion label...
Classification of text documents is commonly carried out using various models of bag-of-words that are generated using feature selection methods. In these models, selected features are used as input to well-known classifiers such as Support Vector Machines (SVM) and neural networks. In recent years, a technique called word embeddings has been developed for text mining and, deep learning models using...
Feature representation plays an important role in text classification. Feature mapping based on labels information is an algorithm suitable for Binary Relevance. Compared with the conventional text representation, it makes the dimension of the text under control by means of word embedding. More importantly, it takes full advantage of the general characteristics of the label on text representation...
With the emergence of the Internet social shopping platform, a large quantity of sentiment corpus is accumulating rapidly. Sentiment classification, which is a specific application of sentiment analysis, has received a lot of attention from researchers in the fields of natural language processing. The traditional method to classify sentiment text is usually limited to the content of text. However,...
The basic idea behind the classifier ensembles is to use more than one classifier by expecting to improve the overall accuracy. It is known that the classifier ensembles boost the overall classification performance by depending on two factors namely, individual success of the base learners and diversity. One way of providing diversity is to use the same or different type of base learners. When the...
In this information era, the number of websites in the Internet has dramatically increased over a few years. Any information and services can be retrieved from the website. However, the most valuable content of the website is still a text which is related to the topic or category of the websites. But there has only few researches focusing on categorizing Thai language information. The rest of researches...
Aim to multiclass text categorization problem, a classification algorithm based on multiconlitron and 1-a-r method is presented. 1-a-r method is used to convert a multiclass categorization problem to several binary problems. Multiconlitron is constructed for each binary problem in input space. For the text to be classified, its class is decided by multiconlitrons. The classification experiments are...
Nowadays the exponential growth of generation of textual documents and the emergent need to structure them increase the attention to the automated classification of documents into predefined categories. There is wide range of supervised learning algorithms that deal with text classification. This paper deals with an approach for building a machine learning system in R that uses K-Nearest Neighbors...
AdaBoost is one of the most popular algorithm for classification and has been successfully used for text classification, face detection and tracking. However noise sensitivity is regarded as a major disadvantage and previous works show that AdaBoost will be overfitting when dealing with the data sets with noisy data. To improve the noise tolerance of conventional AdaBoost, this paper proposed a preprocessing...
With the development of computer and network techniques, and the digital Chinese news texts explosion, facing a massive unstructured news data, a better way for knowledge extraction and storage, on the one hand, can help readers understand the core content of news, on the other hand, completed news knowledge accumulation will support the reportage. In recent years, information extraction technology...
There is a constantly growing interest in evaluating music information retrieval (MIR) systems that can provide effective management of the music resources. The crucial characteristic of music is its emotion, which reflect the human's perception. To do the automatic classification of Chinese music emotions more effective, we use the lyrics of music to analysis and classify music based on emotion....
Automatic text classification is the key technology to process and organize large-scale text data. It is well known that the high dimensionality of feature space is a main challenge for text classification. In order to attenuate such a problem as well as inspired by existing arts, we propose an effective text feature selection algorithm by novelly fusing the classical methodologies of Gini index and...
A class imbalance problem often appears in many real world applications, e.g. fault diagnosis, text categorization, fraud detection. When dealing with a large-scale imbalanced dataset, feature selection becomes a great challenge. To confront it, this work proposes a feature selection approach based on a decision tree rule. The effectiveness of the proposed approach is verified by classifying a large-scale...
In the text classification, The similarity between the text need to be calculated, but the existing classification methods only consider the similarity between feature words and categories and does not involve the semantic similarity between feature words. In this paper, a new classification model LDA (Latent Dirichlet Allocation) — KNN (K-Nearest Neighbor) is proposed. LDA is used to solve the problem...
In the view of mobile data security detection, text classification model can be realized in the application layer to detect malicious attacks. Since traditional C4.5 decision tree has the disadvantage of no considering about interaction influence between properties in attribute selection, an improved model of C4.5 decision tree based on AdaBoost algorithm is put forward. The problem in measuring the...
Nowadays, large volumes of text data are being produced in real time due to expansion of communication. It is necessary to organize this data for exploitation and extraction of useful information. Text classification based on the topic is one of the efficient solutions to this problem. Efficient algorithms are applied for text classification if they address high dimensional data. In this paper, a...
The huge expansion of world wide web has involved a contemporary fashion of conveying the attitude or viewpoint of human being. It is a channel where anybody any visualize opinion and sentiments of different customers. It is also possible to see opinion classified into different categories and ratings given on different products. This information plays a supreme role in sentiment classification task...
Millions of file uploads and downloads happen every minute resulting in big data creation and manual text categorization is not possible. Hence, there is a need for automatic categorization of documents that makes storage and retrieval more efficient. This research paper proposes a hybrid text categorization model that combines both Rocchio algorithm and Random Forest algorithm to perform Multi-label...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.