Supervised learning is a popular approach to text classification among the research community as well as within software development industry. It enables intelligent systems to solve various text analysis problems such as document organization, spam detection and report scoring. However, the extremely difficult and time intensive process of creating a training corpus makes it inapplicable to many...
Spam mails are one of the greatest challenges faced by internet service providers, organizations and internet users in unison. Spam mails may be targeted, with a malicious intent or just as a commercial marketing activity - on the whole unwanted by everyone except the dispatcher. Spam filters continuously evolve as spammers go techno-savvy and creative. Machine learning algorithms have been popularly...
Markov Model/ Artificial Neural Network (HMM/ANN) keyword spotting framework. The feature extraction method used was Mel-Frequency Cepstral Coefficients (MFCC). The ANN is a 3-layer feedforward neural network using Multi-Layer Perceptron (MLP). In recognizing the words, an HMM decoder was used which implemented the Viterbi
Most of the mobile platforms provide a keyword based full text search (FTS) for users to find what they want. However, FTS has difficulties in dealing with the cases where a user cannot remember the exact keywords about target data or the number of search results is too many. To overcome these limitations of FTS, we
This paper proposes a lattice-based method for keyword spotting in online Chinese handwriting to improve the trade-off between accuracy and speed, and to overcome the out-of-vocabulary (OOV) problem of lexicon-driven approach. Using a character string recognition algorithm, the lattice-based method generates a
This paper presents a text query-based method for keyword spotting from online Chinese handwritten documents. The similarity between a text word and handwriting is obtained by combining the character similiarity scores given by a character classifier. To overcome the ambiguity of character segmentation, multiple
This paper presents a corpus-based approach for extracting keywords from a text written in a language that has no word boundary. Based on the concept of Thai character cluster, a Thai running text is preliminarily segmented into a sequence of inseparable units, called TCCs. To enable the handling of a large-scaled
Consider an information repository whose content is categorized. A data item (in the repository) can belong to multiple categories and new data is continuously added to the system. In this paper, we describe a system, CS*, which takes a keyword query and returns the relevant top-K categories. In contrast, traditional
This paper presents our recent attempt to make a super-large scale spoken-term detection system, which can detect any keyword uttered in a 2,000-hour speech database within a few seconds. There are three problems to achieve such a system. The system must be able to detect out-of-vocabulary (OOV) terms (OOV problem
This paper compares the performance of keyword and machine learning-based chest x-ray report classification for Acute Lung Injury (ALI). ALI mortality is approximately 30 percent. High mortality is, in part, a consequence of delayed manual chest x-ray classification. An automated system could reduce the time to
The search engine, keyword extraction is an important technique. In this paper, aiming at the defects of the traditional keyword extraction algorithm, we proposed an improved weight computation strategy. The experimental results show that, the improved method's results are significantly better results than the
In this paper, we study the problem of the data redundancy in XML Keyword Search by SLCA and propose a new mode to resolve it. We begin by introducing the notion of SLCA and analyzing its faults. Then we propose the concept of Indirect-SLCA (ISLCA) to reduce the redundancy basing on the notion of Heterogeneous node
developed by implementing the keyword stripping using the Porter Stemmer algorithm. This could make the keyword search more efficient, as the root or stem word is only considered. Experimental results on two public spam corpuses are also discussed at the end.
Language model adaptation using text data downloaded from the WWW is an efficient way to train a topic-specific LM. We are developing an unsupervised LM adaptation method using data in the Web. The one key point of unsupervised Web-based LM adaptation is how to select keywords to compose the search query. In this
The complex network theory is widely used in the field of keyword extraction. Through analyzing the insufficient of keyword extraction algorithms using traditional complex network, this paper proposes a new method to extract Chinese keyword based on semantically weighted network. On the basis of K-nearest neighbor
Internet is becoming an increasingly important platform for ordinary life and work. It is expected that keyword extraction can help people quickly find hot spots on the web, since keywords in a document provide important information about the content of the document. In this paper, we propose to use text clustering
This paper presents a keyword extraction method of web pages based on domain thesaurus. The method extracts keywords from web pages based on traditional statistic features, such as frequency and location, and it also evaluates the weight of candidate keywords combining with their relation of domain thesaurus. This
Audio mining is a speaker independent speech processing technique and is related to data mining. Keyword spotting plays an important role in audio mining. Keyword spotting is retrieval of all instances of a given keyword in spoken utterances. It is well suited to data mining tasks that process large amount of speech
Based on the analysis of the insufficiencies of the present Chinese matching algorithms, by examining the characteristics of approximately duplicate records, this paper proposes a method of duplicate record cleaning based on a reformative keywords matching algorithm. Experiments show that this method improves Recall
approach has a limit as only the annotations of found images during the interaction are updated. In this paper we introduce a novel method of semi-automatic annotation. The method is using visual feature representations of keywords which are improved during the region-based relevance feedback. The experiments show that this
Financed by the National Centre for Research and Development under grant No. SP/I/1/77065/10 by the strategic scientific research and experimental development program:
SYNAT - “Interdisciplinary System for Interactive Scientific and Scientific-Technical Information”.