The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
The volume of academic paper submissions and publications is growing at an ever increasing rate. While this flood of research promises progress in various fields, the sheer volume of output inherently increases the amount of noise. We present a system to automatically separate papers with a high from those with a low likelihood of gaining citations as a means to quickly find high impact, high quality...
We constructed a system infrastructure capable of processing unstructured data, with the aim of practical application of the system for document data analysis in the manufacturing industry. Using past ISSM research paper data, papers were classified and verified. Using morphological analysis, the extracted parts of speech were used as feature quantities, and machine learning was executed. Since effective...
With the advent of the online social media such as Facebook, Twitter and blogs, the way people perceive things around them has dramatically changed. One simple example could be how people today buy a mobile phone. If in the past, shopping involved moving from one store to the other, these days one cares more about the opinions expressed by people in product reviews rather. There is an increasing tendency...
Handwritten and machine printed (H&P) text separation from document images is a precursor to advance the performance of the OCR system. This paper demonstrates the competence of frequency domain features for the classification of H&P text words. We propose wavelet-like discrete cosine transform (WDCT) based features. We conduct an experiment on a large dataset of 2000 text words of popular...
Text plays its vital role in visual content analysis and understanding. Videos contain text with diversity in its text patterns and complex backgrounds. In this paper, we propose an approach based on compass operator for detecting the edges. We obtain the edge maps by convolving the Kirsch Directional Masks along eight different directions for the preprocessed video frame. The resultant images are...
To add more value on YouTube, a popular portal of social media clips, it is worth recognizing automatically the mood of a media clip using the comments given to such clip. This paper presents a method to classify emotion of a Thai media clip on YouTube using the comments given to the clip. Six basic emotions considered are Anger, Disgust, Fear, Happiness, Sadness and Surprise. Performances using three...
Sentiment analysis or opinion mining consist of many different fields like natural language processing, text mining, decision making and linguistics. It is a type of text analysis that classifies the text and makes decision by extracting and analyzing the text. Opinions can be categorized as positive and negative and measures the degree of positive or negative associated with that event (people, organization,...
A critical issue in recognition of mathematical expressions is the identification of the spatial relations of the symbols or/and sub-expressions that comprise the entire mathematical formula. This paper addresses the problem of structural analysis of mathematical expressions by constructing appropriate feature vectors to represent the spatial affinity of the objects (mathematical symbols or sub-expressions)...
Work on sentiment analysis has thus far been limited in the news article domain. This has mainly been caused by 1) news articles lacking a clearly defined target, 2) the difficulty in separating good and bad news from positive and negative sentiment, and 3) the seeming necessity of, and complexity in, relying on domain-specific interpretations and background knowledge. In this paper we propose, define,...
Gender prediction based on the handwritten text becomes to earn a considerable importance for the document analysis community Gender prediction based on the handwritten text becomes to earn a considerable importance for the document analysis community. It is helpful for person identification as well as in some situations where one needs to classify population according to women-men categories. However,...
In this paper we present a two-level method to detect text in natural scene images. In the first level, connected components (referred as CCs) are got from the images. Then candidate text lines are extracted and groups of connected components that align in horizontal or vertical direction are got. We think CCs in these groups have high probability are texts. To validate which CC is text, a SVM is...
In this paper, we present a method for removing ruling lines from handwritten documents, making no damage to the existing characters. It is argued that ruling lines have a predictable position in the page, but their thickness and the distance between them may differ from one document to another, which is estimated with simple algorithm. Another important challenge in this regard is detecting the edge...
This paper presents a comparison between three classifiers based on Support Vector Machines, Multi-Layer Perceptrons and Gaussian Mixture Models respectively to detect physical structure of historical documents. Each classifier segments a scaled image of historical document into four classes, i.e., areas of periphery, background, text and decoration. We evaluate them on three data sets of historical...
Presence of multi-oriented characters, connected characters with graphical lines, intersection of text and symbols with graphical lines/curves etc. are very common in graphical documents. As a result word spotting in graphical documents is still a challenging task that we try to solve (partially) in this paper. The proposed approach proceeds in two stages. In the first stage, recognition of isolated...
Traditionally, page images undergo pre-processing before the later stages of document analysis are applied. One common pre-processing step is to calculate and correct for the presence of simple page skew through a compensating rotation. Such operations modify the original input image, however, and in doing so may discard or obscure useful information. In this paper, we examine the impact of page deskewing...
Recognizing mathematical expressions in PDF documents is a new and important field in document analysis. It is quite different from extracting mathematical expressions in image-based documents. In this paper, we propose a novel method by combining rule-based and learning-based methods to detect both isolated and embedded mathematical expressions in PDF documents. Moreover, various features of formulas,...
Table detection can be a valuable step in the analysis of unstructured documents. Although much work has been conducted in the domain of machine-print including books, scientific papers, etc., little has been done to address the case of handwritten inputs. In this paper, we study table detection in scanned handwritten documents subject to challenging artifacts and noise. First, we separate text components...
Arabic writer identification is a very active research field. However, no standard benchmark is available for researchers in this field. The aim of this competition is to gather researchers and compare recent advances in Arabic writer identification. This competition was hosted by Kaggle, it has attracted thirty participants from both academia and industry. This paper gives details on this competition,...
In this paper, we present a fast and effective method for removing pre-printed rule-lines in handwritten document images. We use an integral-image representation which allows fast computation of features and apply techniques for large scale Support Vector learning using a data selection strategy to sample a small subset of training data. Results on both constructed and real-world data sets show that...
In this paper, we propose a novel framework for segmentation of documents with complex layouts. The document segmentation is performed by combination of clustering and conditional random fields (CRF) based modeling. The bottom-up approach for segmentation assigns each pixel to a cluster plane based on color intensity. A CRF based discriminative model is learned to extract the local neighborhood information...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.