The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
Nowadays, text classification (TC) becomes the main applications of NLP (natural language processing). Actually, we have a lot of researches in classifying text documents, such as Random Forest, Support Vector Machines and Naive Bayes. However, most of them are applied for English documents. Therefore, the text classification researches on Vietnamese still are limited. By using a Vietnamese news corpus,...
Text classification (TC) is a task that assigns a text to one or more classes and predefined categories. Constructing text classifiers with high accuracy is a vital task in biomedical field, given the wealth of information hidden in unlabelled documents. Because of large feature spaces, traditionally discriminative approaches, such as logistic regression and support vector machines with n-gram and...
The increased use of the Internet and the ease of access to online communities like social media have provided an avenue for cybercrimes. Cyberbullying, which is a kind of cybercrime, is defined as an aggressive, intentional action against a defenseless person by using the Internet, social media, or other electronic contents. Researchers have found that many of the bullying cases have tragically ended...
Today, it is not possible to use human power alone to cope with the increasing amount of data. For this reason, some automated methods are needed to group similar documents together or to place documents in predefined categories according to certain rules. The use of automated classification techniques is becoming increasingly important for this reason. In this study, a database consisting of 22 thousand...
Natural Language Processing (NLP) is a prominent subject which includes various subcategories such as text classification, error correction, machine translation, etc. Unlike other languages, there are limited number of Turkish NLP studies in literature. In this study, we apply text classification on Turkish documents by using n-gram features. Our algorithm applies different preprocessing techniques,...
The aim of this study is to compare the successes of various distance metrics and to determine the most appropriate methods in order to detect similarities among textual documents written in Turkish. Computing similarities between text documents is the basic step of plagiarism detection, and text mining methods like author detection, text classification and clustering. Therefore, plagiarism detection...
Social media offer abundant information for studying people's behaviors, emotions and opinions during the evolution of various rare events such as natural disasters. It is useful to analyze the correlation between social media and human-affected events. This study uses Hurricane Sandy 2012 related Twitter text data to conduct information extraction and text classification. Considering that the original...
Clustering is one of the prime topics in data mining. Clustering partitions the data and classifies the data into meaningful subgroups. Document clustering is a set of the document into groups such that two groups show different characteristics with respect to likeness. In this paper, an experimental exploration of similarity based method, HSC for measuring the similarity between data objects particularly...
Authorship attribution has been well studied in terms of text classification with many diverse feature sets. However, finding topic independent features is hard and trained models with hand crafted features in one domain may not work in another domain. In this study we used a semi-supervised neural language model which is known as document embeddings for authorship attribution problem. This method...
In this paper, we present a novel method to classify directions of capital flows in Internet finance. Our method is different from previous text classification methods in that extracts key sentences which may directly reflect the semantics of input text before classification. We use the Bi-LSTM model as a classifier to process input sentences. In this paper, we represent the matrix of key sentences...
Since short text is characterized of the short length, sparse features and strong context dependency, the traditional models have a limited precision. Motivated by this, this article offers an empirical exploration on a character-level model which implements a combination of convolutional neural network(CNN) and recurrent neural networks(RNN) for short text classification. Including the highway networks...
In daily life, we use the internet for many purposes. The Internet makes easier our life and it has led to the providing to occur new technologies. Several smart devices that use the Internet infrastructure generates digital data in different formats and with different generation speeds. The evaluation of the generated data is carried out by the algorithms associated with the field of machine learning...
With the development of technology, people are entering the virtual world more and more. Parallel to this, the internet becomes a bigger network every day and it gets a complex structure depending on this growth. Achieving the desired information with structred data becomes an increasingly important problem. One of the useful ways to find solution for this problem is to divide this complex data into...
Recently, big data have special concerns from researchers, this due to the valuable information can be collected from it. LSA has an effective performance in classification, and information retrieval, since it deals with the semantics of the words. In this paper, we proposed a distributed text classification approach based on LSA, and Cosine Similarity, and can be applied to big data. The proposed...
Long short-term memory (LSTM) is a significant approach to capture the long-range temporal context in sequences of arbitrary length. This had shown astonishing performance in sentence and document modeling. To leverage this, we use LSTM network to the encrypted text categorization at character and word level of texts. These texts are transformed in to dense word-vectors by using bag-of-words embedding...
Dealing with high dimensional data is a challenging and computationally complex task in the data pre-processing phase of text clustering. Conventionally, union and intersection approaches have been used to combine results of different feature selection methods to optimize relevant feature space for document collection. Union method selects all features from considered sub-models, whereas, intersection...
Huge amount of data in today's world are stored in the form of electronic documents. Text mining is the process of extracting the information out of those textual documents. Text classification is the process of classifying text documents into fixed number of predefined classes. The application of text classification includes spam filtering, email routing, sentiment analysis, language identification...
The grouped structure has successfully been embedded in sparse models for feature selection; however, some groups generated by clustering method might be difficult to interpret their semantic information if the number of words in the group is very large. This paper proposes a novel approach in which a group structure is constructed and its corresponding sparse model is used to select features for...
Youtube is one of the most popular video sharing platform in Indonesia. A person can react to a video by commenting on the video. A comment may contain an emotion that can be identified automatically. In this study, we conducted experiments on emotion classification on Indonesian Youtube comments. A corpus containing 8,115 Youtube comments is collected and manually labelled using 6 basic emotion label...
Classification of text documents is commonly carried out using various models of bag-of-words that are generated using feature selection methods. In these models, selected features are used as input to well-known classifiers such as Support Vector Machines (SVM) and neural networks. In recent years, a technique called word embeddings has been developed for text mining and, deep learning models using...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.