The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
We present a simple yet effective LSTM-based approach for recognizing machine-print text from raw pixels. We use a fully-connected feed-forward neural network for feature extraction over a sliding window, the output of which is directly fed into a stacked bi-directional LSTM. We train the network using the CTC objective function and use a WFST language model during recognition. Experimental results...
The goal of document image quality assessment (DIQA) is to build a computational model which can predict the degree of degradation for document images. Based on the estimated quality scores, the immediate feedback can be provided by document processing and analysis systems, which helps to maintain, organize, recognize and retrieve the information from document images. Recently, the bag-of-visual-words...
Optical character recognition (OCR) accuracy of document images is an important factor for the success of many document processing and analysis tasks, especially for unconstraint captured document images. Although several document image OCR capability assessment methods are proposed, they mostly model the problem based on the empirically defined rules of image degradation, which cause the existing...
Automatically accessing information from unconstrained image documents has important applications in business and government operations. These real-world applications typically combine optical character recognition (OCR) with language and information technologies, such as machine translation (MT) and keyword spotting. OCR output has errors and presents unique challenges to late-stage processing. This...
This paper presents the most recent progress and state of the art result obtained from BBN's Arabic offline handwriting recognition research. Our system is based a left-to-right hidden Markov model and integrates discriminative learning methods including discriminative MPE and n-best rescoring using the scores of glyph classifiers (SVM, DNN) and the RNNLM. Arabic-related features for n-best rescoring...
The recurrent neural network language model (RNNLM) is a discriminative, non-Markovian model that can capture long-span word history in natural language. It has been proved to be successful in automatic speech recognition and machine translation. In this work, we applied RNNLM to the n-best rescoring stage of the state-of-the-art BBN Byblos OCR (optical character recognition) system for handwriting...
This paper presents a new framework for OCR error detection, which uses a conditional random field model to combine rich features from multiple sources, including confusion networks (c-nets), lexical local context and recurrent neural network language model (RNNLM)1. We propose a novel, efficient method for computing character-level c-net based RNNLM scores by using dynamic programming and c-net partial...
We propose an end-to-end system for text detection and recognition in natural scenes and consumer videos. Maximally Stable Extremal Regions which are robust to illumination and viewpoint variations are selected as text candidates. Rich shape descriptors such as Histogram of Oriented Gradients, Gabor filter, corners and geometrical features are used to represent the candidates and classified using...
In this paper, we address the problem of text classification: classifying modern machine-printed text, handwritten text and historical typewritten text from degraded noisy documents. We propose a novel text classification approach based on iVector, a newly developed concept in speaker verification. To a given text line, the iVector is a fixed-length feature vector representation, transformed from...
We present a novel binarization method that is especially effective on historical documents with the following characteristics: (a) the documents contain free-form cursive handwritten text with significant but consistent slant, (b) scanning artifacts resulting in the text and background pixels not having uniform intensity even within the same page, and (c) pages containing significant amount of bleeds...
This paper presents a novel approach to detect Arabic OOV names from OCR'ed handwritten documents. In our approach, OOV names are searched for using approximate string match on character consensus networks (cnets). The retrieved regions are re-ranked using novel features representing the quality of the match and the likelihood of the detected region to be an OOV name. Our features that encode word...
Feature extraction is an important step in off-line handwriting recognition systems to represent raw handwriting in a low-dimensional, tractable feature space. Traditionally, linear feature transforms such as Principle Component Analysis (PCA), Linear Discriminative Analysis (LDA) are commonly used. The assumptions they make, however, usually cannot be satisfied in practice and thus the best performance...
We propose a contour based shape decomposition approach that provides local segmentation of touching characters. The shape contour is linearized into edge lets and edge lets are merged into boundary fragments. The connection cost between boundary fragments is obtained by considering local smoothness, connection length and a stroke-level property called the Same Stroke Rate. Samples of connections...
We describe an end-to-end system for translating real-world Arabic field documents that contain a mix of handwritten and printed content into English. These documents are extremely challenging to recognize due to presence of noise, poor image capture quality, and variations in writing style, writing device, font, layout, genre, etc. Furthermore, no off-the-shelf machine translation (MT) engine is...
In this paper, we describe our approach for extracting salient information from US census form images. These forms present several challenges including variations in individual form templates, skew, writing device, writing style, etc. We describe an innovative registration algorithm that is robust to scale variations for segmenting the input image into cells. Following registration, the borders of...
Camera-captured optical character recognition (OCR) is a challenging area because of artifacts introduced during image acquisition with consumer-domain hand-held and Smart phone cameras. Critical information is lost if the user does not get immediate feedback on whether the acquired image meets the quality requirements for OCR. To avoid such information loss, we propose a novel automated image quality...
Handwritten text line segmentation on real-world data presents significant challenges that cannot be overcome by any single technique. Given the diversity of approaches and the recent advances in ensemble-based combination for pattern recognition problems, it is possible to improve the segmentation performance by combining the outputs from different line finding methods. In this paper, we propose...
In this paper, we describe an approach to extract text from broadcast videos. Candidate blocks are detected based on edge extraction results. Corners and geometrical features are used for the purpose of initial classification which is carried out by using a support vector machine (SVM). Considering the spatial inter-dependencies of different regions in the image, we propose a novel conditional random...
We present an OCR-driven writer identification algorithm in this paper. Our algorithm learns writer-specific characteristics more precisely from explicit character alignment using the Viterbi algorithm and shows significant reduction of close-set writer identification error rates, compared with the GMM-based method. With writers' identities retrieved, we improve the performance of handwriting recognition...
We present a system for identification and recognition of handwritten and typewritten text from document images using hidden Markov models (HMMs) in this paper. Our text type identification uses OCR decoding to generate word boundaries followed by word-level handwritten/typewritten identification using HMMs. We show that the contextual constraints from the HMM significantly improves the identification...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.