The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
This article proposes such a question classification approach that integrates multiple semantic features. It is aimed at these two questions in Chinese question classification models: inaccurate semantic information extraction and too slow processing speed caused by too high Eigenvector dimension. With the help of HowNet and the support vector machine and syntactic and semantic information of question...
Text Classification is an important field of research. There are a number of approaches to classify text documents. However, there is an important challenge to improve the computational efficiency and recall. In this paper, we propose a novel framework to segment Chinese words, generate word vectors, train the corpus and make prediction. Based on the text classification technology, we successfully...
When browsing news on the web, various emotions may be evoked in readers and furthermore cause different influence on their minds and life. We expect that emotional analysis and classification of text may provide good performance and significance to users surfing the Internet. Most previous research only focus on bi-emotion classification, that is, Positive and Negative, e.g., identifying whether...
In this paper, we present an approach to automatically extract and classify opinions in texts. We propose a similarity measurement calculating semantically distances between a word and predefined subgroups of seed words. We have evaluated our algorithm on the semantic evaluation company “SemEval 2007” corpus, and we obtained the best value of Precision and F1 62% and 61%. As an improvement of 20 %...
Web text classification is the process of determine the text types automatically under a given classification, according to the text content. Web text categorization system is the use of machine learning, knowledge engineering and other related fields of knowledge, access to the web on the text, after text preprocessing, Chinese word segmentation and training classifier, using classification algorithm...
Word vectors and sets of words are used in a wide range of text-based applications. Yet these word sets are often chosen on an ad hoc basis. In this study, we examine two text-based applications that use word sets and in both cases find that classification performance can be optimised using a fairly simple genetic algorithm. The first study is in authorship attribution, the second one is sentiment...
Discrimination of machine-printed and hand-written words is deemed as a major problem in the recognition of the mixed texts. To present a new method to distinguish between machine-printed words and hand-written words using a novel statistical feature on base legibility and discriminator threshold are objectives of this study. Because of the hand trembling, sudden uncontrollable movement of hand and...
With a rapid growth of the internet communication, many types of text are produced. They can convey the meanings that can contribute to text categorization. Emotion classification also becomes more interesting, but emotion classification in Thai text is still not able to be correctly classified. Thus, this paper proposes a novel approach that takes advantage of bi-words occurrence to classify emotion...
In real-world information systems, there are abundant unlabeled data but sparse labeled data. It is challenging to construct an adaptive model to classify a large amount of documents containing different domains. The classifiers trained from a source domain shall perform poorly for the test data in a target domain due to the domain mismatch. In this study, we build a topic-bridged latent Dirichlet...
In e-commerce transactions, goods are classified according to the hierarchical structure, which refers to a tree category. In the process of classification, we shall consider the special features. While using brand name for category, for instance, the degree of distinction characteristic of brand is higher. Based on this, we prepare a dictionary of brands for Chinese words segamentatin on one hand...
When the traditional text classification technologies classify academic dissertations, the dimension of extracted feature terms is high, and they can't represent the theme of thesis. it makes the efficiency is very low and the accuracy rate is not high. The topic words are small in quantity and can reflect the theme of thesis well. Accordingly, the paper proposes to extract the topic words with topic...
Word matching problem is to find all the occurrences of a pattern P[0...m-1] in the text T[0...n-1], where P neither contains any white space nor preceded and followed by space. In this paper, we assume that our text is offline. Ibrahiem et al. in 2008 have proposed an algorithm (WSA) for solving the word matching problem by splitting the offline text into number of tables in the preprocessing phase...
Word matching problem is to find all the occurrences of a pattern P[0...m-1] in the text T[0...n-1], where P neither contains any white space nor preceded and followed by space. In the multi-patterns word matching problem, all the occurrences of multiple word P0, P1, P2 ...Pr-1, (rges1) in the given text T are to be reported. In the present discussion, we assume that all the patterns have equal size...
According to the high-dimensional sparse features of the storage of the textual document, this paper puts forward a novel model through 3-dimensional space to express text data, in this model, one dimension registers the count of feature words, another denotes the part of speech of the feature words, and the third one records the count of textual documents, that is, the 3-dimensional space model expresses...
Text classification refers to determine the class of an unknown text according to its content in the given classification system. In order to extract fewer features to express the information in the text as much as possible, the paper analysis the various features' statistical properties and to extract the global features according to Zipf's law; and then, based on the statistical analysis of the...
Automatic content generation aims on developing an intelligent tutoring system in Tamil language. This system focuses on delivering personalized content in Tamil language to an individual user needs based on their learning abilities and interests. This paper deals with automatic classification of Tamil documents and also the information extraction from those documents to construct the knowledge base...
Most of the Chinese text classification systems are all based on the technology of bag of words (BW) which is a valid probability tool for text representation and can provide a better semantic architecture. But the weakness in classification accuracy is still unconquerable. Support vector machine (SVM) has become a popular classification tool and can be applied in the scheme, but the main disadvantages...
Feature selection is a valid method to reduce the dimension of vector in text categorization system. After analyzed several common evaluation functions for feature selection, we applied terms weight function to feature selection. A new evaluation function based on improved TFIDF method is presented; in this function the category information is introduced to feature items, and the feature items of...
This paper examines the role of various linguistic structures on text classification applying the study to the Portuguese language. Besides using a bag-of-words representation where we evaluate different measures and use linguistic knowledge for term selection, we do several experiments using syntactic information representing documents as strings of words and strings of syntactic parse trees. To...
With the rapid growth in computer technology and popularization of Internet, e-mail has become one economical and convenient form of communication. But different types of crime and civil action involving e-mail documents appear which do harm to people's life and social's stabilization. So the criminal e-mail's authorship has to be identified automatically for the purpose of computer forensic. To solve...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.