The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
In this paper, we are proposing a new semantic and contextual based document image classification framework. The framework is composed of two main modules. The first one is the text analysis module (TAM) which processes document images and extracts words from the image, and second one is the SEMCON, which is a semantic and contextual objective metric. From the list of extracted words by TAM, SEMCON...
This paper presents a Document Image Analysis (DIA) system able to extract homogeneous typed and handwritten text regions from complex layout documents of various types. The method is based on two connected component classification stages that successively discriminate text/non text and typed/handwritten shapes, followed by an original block segmentation method based on white rectangles detection...
We present an affective text analysis model that can directly estimate and combine affective ratings of multi-word terms, with application to the problem of sentence polarity/semantic orientation detection. Starting from a hierarchical compositional method for generating sentence ratings, we expand the model by adding multi-word terms that can capture non-compositional semantics. The method operates...
Recognizing mathematical expressions in PDF documents is a new and important field in document analysis. It is quite different from extracting mathematical expressions in image-based documents. In this paper, we propose a novel method by combining rule-based and learning-based methods to detect both isolated and embedded mathematical expressions in PDF documents. Moreover, various features of formulas,...
The emotion tendency of sentiment word is divided into two types: static emotion tendency and dynamic emotion tendency. Basic semantic lexicon is static emotion tendency, in the real context, but it is different between static emotion tendency and dynamic emotion tendency. The paper proposes a method based on degree lexicon, negative lexicon and dependence relationship of sentence elements. The experimental...
Most web text clustering is based on the space vector text representation model. This results in a high dimension in the terms; and it leads to an increase in time complexity and a loss of text semantics due to the fact that the semantic relationship of the terms is not considered. In this paper, a new approach is taken where a concept lattice is generated with text treated as object and terms of...
The alignment of text line images with text transcript is a crucial step of handwritten document annotation. Handwritten text alignment is prone to errors due to the difficulty of character segmentation and the variability of character shape, size and position. In this paper, we propose to incorporate the geometric context of character strings to improve the alignment accuracy for offline handwritten...
In the context of text clustering, global feature selection tries to identify a single subset of features which are relevant to all clusters. However, the clustering process might be improved by considering different subsets of features for locally describing each cluster. In experiments with local feature selection, it was observed that the resulting partitions were unstable but there were cohesive...
Having effective methods to access the desired images is essential nowadays with the availability of huge amount of digital images. The proposed approach is based on an analogy between image retrieval containing desired objects (object-based image retrieval) and text retrieval. We propose a higher-level visual representation, for object-based image retrieval beyond visual appearances. The proposed...
Feature selection is a process which chooses a subset from the original feature set according to some rules. The selected feature retains original physical meaning and provides a better understanding for the data and learning process. However, few modern feature selection approaches take the advantage of features' context information. Based on this analysis, we propose a novel feature selection method...
A common strategy to assign keywords to documents is to select the most appropriate words from the document text. One of the most important criteria for a word to be selected as keyword is its relevance for the text. The tf.idf score of a term is a widely used relevance measure. While easy to compute and giving quite satisfactory results, this measure does not take (semantic) relations between words...
We present a study of the application of Computer Assisted Transcription of Text Images (CATTI) to a task which is much closer to real applications than other tasks previously studied. The new task consists in the transcription of a new publicly available historic handwritten document, called GERMANA. A detailed analysis of the main factors influencing the system performance are exposed and some strategies...
Digital ink texts in Chinese can neither be converted into users' desired layouts nor be recognized until their characters, lines, and paragraphs are correctly extracted. There are many errors in automatically segmented digital ink texts in Chinese because they are free forms and mixed with other languages, as well as their Chinese characters have small gaps and complex structures. Paragraphs, lines,...
We propose a semi-supervised model which segments and annotates images using very few labeled images and a large unaligned text corpus to relate image regions to text labels. Given photos of a sports event, all that is necessary to provide a pixel-level labeling of objects and background is a set of newspaper articles about this sport and one to five labeled images. Our model is motivated by the observation...
The existing software engineering literature has empirically shown that a proper choice of identifiers influences software understandability and maintainability. Researchers have noticed that identifiers are one of the most important source of information about program entities and that the semantic of identifiers guide the cognitive process. Recognizing the words forming identifiers is not an easy...
Pursuing on the analysis of product reviews, an unsupervised product features categorization method is proposed. Morphemes as smallest linguistic meaningful unit are induced in measuring the intra relationship among product features instead of words. Opinion words around product features are chosen to represent the inter relationship among product features instead of full context information. The...
Word Sense Disambiguation in text is still a difficult problem as the best supervised methods require laborious and costly manual preparation of training data. Thus, this work focuses on evaluation of a few selected clustering algorithms in task of Word Sense Disambiguation for Polish. We tested 6 clustering algorithms (K-Means, K-Medoids, hierarchical agglomerative clustering, hierarchical divisive...
This paper reports experiments on topic extraction in Chinese documents using a feature set enriched with Word Sense Disambiguation (WSD) as semantic information. The results of these experiments suggest that incorporating WSD information into Chinese topic extraction tasks may yield improvements over models which do not use WSD information.
This paper presents a new method to acquire domain-ontology relations from semi-structured data sources. First, obtain Web documents according to the co-occurrence of concept instance and attribute value. Further, define formats of relation patterns, and extract pattern instances from Web documents, including pattern clustering and pattern combining in each cluster. Finally, relation pattern instances...
Multi-instance learning (MIL) has many applications, including image and text categorization. One of the most effective approaches to MIL is by using support vector machines with multi-instance kernels. In this paper we propose a multi-instance kernel, called MIR-kernel, that takes into account the relational information of instances when computing similarities between bags. The relational information...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.