The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
In the languages, the occur of words are indicated about meaning of contents in text. Generative models for text, such as the topic model, have the potential to make important contributions to the statistical analysis of large document collections, and the development of a deeper understanding of human language learning and processing. In this paper, we proposed a novel method for building Vietnamese...
Text Classification is an important field of research. There are a number of approaches to classify text documents. However, there is an important challenge to improve the computational efficiency and recall. In this paper, we propose a novel framework to segment Chinese words, generate word vectors, train the corpus and make prediction. Based on the text classification technology, we successfully...
Web text classification is the process of determine the text types automatically under a given classification, according to the text content. Web text categorization system is the use of machine learning, knowledge engineering and other related fields of knowledge, access to the web on the text, after text preprocessing, Chinese word segmentation and training classifier, using classification algorithm...
We propose a classification model for the cognitive level of question items in examinations based on Bloom's taxonomy. The model implements the artificial neural network approach, which is trained using the scaled conjugate gradient learning algorithm. Several data preprocessing techniques such as word extraction, stop word removal, stemming, and vector representation are applied to a feature set...
LJParser is a developing platform for web search and mining. It is a middleware by LING-JOIN Software, which is well known for over ten years of expertise in natural language understanding and web search. LJParser provides powerful modules including precise search for multiple language, new words detection, Chinese word segmentation and pas tagging, language modeling and term translation, text clustering,...
Which features are the most important for the text classification tasks? In the automatic text categorization area, several studies seek answers to this question. In this paper, a feature extraction tool for Turkish texts (Text2arff) is presented. The toolbox automatically extracts several features such as the frequencies of the words and ngrams, word clustering, Latent semantic indexing etc. The...
With a rapid growth of the internet communication, many types of text are produced. They can convey the meanings that can contribute to text categorization. Emotion classification also becomes more interesting, but emotion classification in Thai text is still not able to be correctly classified. Thus, this paper proposes a novel approach that takes advantage of bi-words occurrence to classify emotion...
Classification and clustering are frequently-used methods in data excavation technology. This paper introduces the idea of text clustering into the categorization algorithm study. The authors also attempt to use the text categorization pattern of self'-initiated learning to design a clustering-based text categorization algorithm, in the purpose of reducing the dimension of training set and raising...
In real-world information systems, there are abundant unlabeled data but sparse labeled data. It is challenging to construct an adaptive model to classify a large amount of documents containing different domains. The classifiers trained from a source domain shall perform poorly for the test data in a target domain due to the domain mismatch. In this study, we build a topic-bridged latent Dirichlet...
In e-commerce transactions, goods are classified according to the hierarchical structure, which refers to a tree category. In the process of classification, we shall consider the special features. While using brand name for category, for instance, the degree of distinction characteristic of brand is higher. Based on this, we prepare a dictionary of brands for Chinese words segamentatin on one hand...
When the traditional text classification technologies classify academic dissertations, the dimension of extracted feature terms is high, and they can't represent the theme of thesis. it makes the efficiency is very low and the accuracy rate is not high. The topic words are small in quantity and can reflect the theme of thesis well. Accordingly, the paper proposes to extract the topic words with topic...
Text classification refers to determine the class of an unknown text according to its content in the given classification system. In order to extract fewer features to express the information in the text as much as possible, the paper analysis the various features' statistical properties and to extract the global features according to Zipf's law; and then, based on the statistical analysis of the...
Hierarchical text categorization refers to assigning of one or more suitable category from a hierarchical category space to a document. In this paper, we used hierarchical feature selection method and multiple classifiers for the Hierarchical text categorization task. Experiments showed that the methods we used was effective, compared with flat classification, top-down level-based approach with the...
This research proposes the application of NTC (neural text categorizer) for categorizing news articles. Even if the research on text categorization has been progressed very much, documents should be still encoded into numerical vectors. Encoding so causes the two main problems: huge dimensionality and sparse distribution. The idea of this research as the solution to the problems is to encode documents...
Automatic content generation aims on developing an intelligent tutoring system in Tamil language. This system focuses on delivering personalized content in Tamil language to an individual user needs based on their learning abilities and interests. This paper deals with automatic classification of Tamil documents and also the information extraction from those documents to construct the knowledge base...
This paper proposes new text summarization approaches based on textual unit association networks. Textual units refer to words, phrases, sentences, or paragraphs. Intuitively, textual units containing much co-occurrence information are semantically more salient in a document. We construct two kinds of textual association networks, namely, word-based association network and sentence-based association...
Most of the Chinese text classification systems are all based on the technology of bag of words (BW) which is a valid probability tool for text representation and can provide a better semantic architecture. But the weakness in classification accuracy is still unconquerable. Support vector machine (SVM) has become a popular classification tool and can be applied in the scheme, but the main disadvantages...
Feature selection is a valid method to reduce the dimension of vector in text categorization system. After analyzed several common evaluation functions for feature selection, we applied terms weight function to feature selection. A new evaluation function based on improved TFIDF method is presented; in this function the category information is introduced to feature items, and the feature items of...
This paper examines the role of various linguistic structures on text classification applying the study to the Portuguese language. Besides using a bag-of-words representation where we evaluate different measures and use linguistic knowledge for term selection, we do several experiments using syntactic information representing documents as strings of words and strings of syntactic parse trees. To...
This paper introduces a methodology for determining polarity of text within a multilingual framework. The method leverages on lexical resources for sentiment analysis available in English (SentiWordNet). First, a document in a different language than English is translated into English using standard translation software. Then, the translated document is classified according to its sentiment into one...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.