The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
Since Google became a celebrity in the early noughties, many people with the power to control and direct research resources have taken the view that there is no more research to be done on the problem of information retrieval. In reality, there are so many variants of “the search problem” that not all have been catalogued, and few have been solved to the point where we can rely absolutely on the quality...
The problem of finding repeats within a string is an important computational problem with applications in data compression and in the field of molecular biology. Both exact and inexact repeats occur frequently in the genome, and certain repeats are known to be related to human diseases. A multiple tandem repeat in a sequence S is a (periodic) substring r of S of the form r = u...
We describe new implementations of MSD radix sort for efficiently sorting large collections of strings. Our implementations are significantly faster than previous MSD radix sort implementations, and in fact faster than any other string sorting algorithm on several data sets. We also describe a new variant that achieves high space-efficiency at a small additional cost on runtime.
Let s = s1 .. sn be a text (or sequence) on a finite alphabet Σ. A fingerprint in s is the set of distinct characters contained in one of its substrings. Fingerprinting a text consists in computing the set of all fingerprints of all its substrings. A fingerprint, , admits a number of maximal locations 〈i,j 〉 in S, that...
A framework of context-sensitive grammar transform is proposed. A greedy compression algorithm with the transform model is presented as well as a Knuth-Morris-Pratt (KMP)-type compressed pattern matching (CPM) algorithm. The compression performance is a match for gzip and Re-Pair. The search speed of our CPM algorithm is almost twice faster than the KMP type CPM algorithm on Byte-Pair-Encoding by...
Though many compression methods are based on the use of variable length codes, there has recently been a trend to search for alternatives in which the lengths of the codewords are more restricted, which can be useful for fast decoding and compressed searches. This paper explores the construction of variable-to-fixed length codes, which have been suggested long ago by Tunstall. Using a new heuristic...
The BM25 similarity computation has been shown to provide effective document retrieval. In operational terms, the formulae which form the basis for BM25 employ both term frequency and document length normalization. This paper considers an alternative form of normalization using document-centric impacts, and shows that the new normalization simplifies BM25 and reduces the number of tuning parameters...
Probabilistic latent semantic analysis (PLSA) is a method of calculating term relationships within a document set using term frequencies. It is well known within the information retrieval community that raw term frequencies contain various biases that affect the precision of the retrieval system. Weighting schemes, such as BM25, have been developed in order to remove such biases and hence improve...
Classified s-grams have been successfully used in cross-language information retrieval (CLIR) as an approximate string matching technique for translating out-of-vocabulary (OOV) words. For example, s-grams have consistently outperformed other approximate string matching techniques, like edit distance or n-grams. The Jaccard coefficient has traditionally been used as an s-gram based string proximity...
We introduce a novel alphabet sampling technique for speeding up both online and indexed string matching. We choose a subset of the alphabet and select the corresponding subsequence of the text. Online or indexed searching is then carried out on that subsequence, and candidate matches are verified in the full text. We show that this speeds up online searching, especially for moderate to long patterns,...
We consider the well known problem of pattern matching under the Hamming distance. Previous approaches have shown how to count the number of mismatches efficiently, especially when a bound is known for the maximum Hamming distance. Our interest is different in that we wish collect a random sample of mismatches of fixed size at each position in the text. Given a pattern p of length m and a text t of...
The Compact Directed Acyclic Word Graph (CDAWG) is a well-known suffix data structure designed for an efficient solution to problems on strings. Some applications, especially those from the data compression field, require maintaining a CDAWG over a sliding window. The fastest known solution to this problem is an approximation algorithm that slides a CDAWG in an amortized constant time. However, the...
Self-indexing is a concept developed for indexing arbitrary strings. It has been enormously successful to reduce the size of the large indexes typically used on strings, namely suffix trees and arrays. Self-indexes represent a string in a space close to its compressed size and provide indexed searching on it. On natural language, a compressed inverted index over the compressed text already provides...
In this paper we consider the of a string in which and, for i > 1, iff k is the largest integer such that . The prefix array is closely related to the : an integer array [1..n] such that iff the length of the longest border of is k. Border arrays or their variants are used in many string algorithms and prefix arrays can be used directly for pattern-matching...
We present a new search procedure for approximate string matching over suffix trees. We show that hierarchical verification, which is a well-established technique for on-line searching, can also be used with an indexed approach. For this, we need that the index supports bidirectionality, meaning that the search for a pattern can be updated by adding a letter at the right or at the left. This turns...
We discuss the following variant of incremental edit distance computation: Given strings A and B with lengths m and n, respectively, the task is to compute, in n successive iterations j = n ...1, an encoding of the edit distances between A and all prefixes of Bj..n. Here Bj..n is the suffix of B that begins at its jth character...
A repetitive sequence collection is one where portions of a base sequence of length n are repeated many times with small variations, forming a collection of total length N. Examples of such collections are version control data and genome sequences of individuals, where the differences can be expressed by lists of basic edit operations. This paper is devoted to studying ways to store massive sets of...
We present a practical study on the compact representation of sequences supporting rank, select, and access queries. While there are several theoretical solutions to the problem, only a few have been tried out, and there is little idea on how the others would perform, especially in the case of sequences with very large alphabets. We first present a new practical implementation of the compressed representation...
In this paper we propose a method for the analysis of very large graphs obtained from query logs, using query coverage inspection. The goal is to extract semantic relations between queries and their terms. We take a new approach to successfully and efficiently cluster these large graphs by analyzing clique overlap and a priori induced cliques. The clustering quality is evaluated with an extension...
We present a method for optimizing phrase search based on inverted indexes. Our approach adds selected (two-term) phrases to an existing index. Whereas competing approaches are often based on the analysis of query logs, our approach works out of the box and uses only the information contained in the index. Also, our method is competitive in terms of query performance and can even improve on other...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.