String Processing and Information Retrieval

chapter

“Search Is a Solved Problem” and Other Annoying Fallacies

David Hawking

Lecture Notes in Computer Science > String Processing and Information Retrieval > 1-1

Since Google became a celebrity in the early noughties, many people with the power to control and direct research resources have taken the view that there is no more research to be done on the problem of information retrieval. In reality, there are so many variants of “the search problem” that not all have been catalogued, and few have been solved to the point where we can rely absolutely on the quality...

chapter

Approximate Runs - Revisited

Gad M. Landau

Lecture Notes in Computer Science > String Processing and Information Retrieval > 2-2

The problem of finding repeats within a string is an important computational problem with applications in data compression and in the field of molecular biology. Both exact and inexact repeats occur frequently in the genome, and certain repeats are known to be related to human diseases. A multiple tandem repeat in a sequence S is a (periodic) substring r of S of the form r = u ...

chapter

Engineering Radix Sort for Strings

Juha Kärkkäinen, Tommi Rantala

Lecture Notes in Computer Science > String Processing and Information Retrieval > 3-14

We describe new implementations of MSD radix sort for efficiently sorting large collections of strings. Our implementations are significantly faster than previous MSD radix sort implementations, and in fact faster than any other string sorting algorithm on several data sets. We also describe a new variant that achieves high space-efficiency at a small additional cost on runtime.

chapter

Faster Text Fingerprinting

Roman Kolpakov, Mathieu Raffinot

Lecture Notes in Computer Science > String Processing and Information Retrieval > 15-26

Let s = s ₁ .. s _n be a text (or sequence) on a finite alphabet Σ. A fingerprint in s is the set of distinct characters contained in one of its substrings. Fingerprinting a text consists in computing the set ${\cal F}$ of all fingerprints of all its substrings. A fingerprint, $f \in {\cal F}$ , admits a number of maximal locations 〈i,j 〉 in S, that...

chapter

Context-Sensitive Grammar Transform: Compression and Pattern Matching

Shirou Maruyama, Yohei Tanaka, Hiroshi Sakamoto, Masayuki Takeda

Lecture Notes in Computer Science > String Processing and Information Retrieval > 27-38

A framework of context-sensitive grammar transform is proposed. A greedy compression algorithm with the transform model is presented as well as a Knuth-Morris-Pratt (KMP)-type compressed pattern matching (CPM) algorithm. The compression performance is a match for gzip and Re-Pair. The search speed of our CPM algorithm is almost twice faster than the KMP type CPM algorithm on Byte-Pair-Encoding by...

chapter

Improved Variable-to-Fixed Length Codes

Shmuel T. Klein, Dana Shapira

Lecture Notes in Computer Science > String Processing and Information Retrieval > 39-50

Though many compression methods are based on the use of variable length codes, there has recently been a trend to search for alternatives in which the lengths of the codewords are more restricted, which can be useful for fast decoding and compressed searches. This paper explores the construction of variable-to-fixed length codes, which have been suggested long ago by Tunstall. Using a new heuristic...

chapter

Term Impacts as Normalized Term Frequencies for BM25 Similarity Scoring

Vo Ngoc Anh, Raymond Wan, Alistair Moffat

Lecture Notes in Computer Science > String Processing and Information Retrieval > 51-62

The BM25 similarity computation has been shown to provide effective document retrieval. In operational terms, the formulae which form the basis for BM25 employ both term frequency and document length normalization. This paper considers an alternative form of normalization using document-centric impacts, and shows that the new normalization simplifies BM25 and reduces the number of tuning parameters...

chapter

The Effect of Weighted Term Frequencies on Probabilistic Latent Semantic Term Relationships

Laurence A. F. Park, Kotagiri Ramamohanarao

Lecture Notes in Computer Science > String Processing and Information Retrieval > 63-74

Probabilistic latent semantic analysis (PLSA) is a method of calculating term relationships within a document set using term frequencies. It is well known within the information retrieval community that raw term frequencies contain various biases that affect the precision of the retrieval system. Weighting schemes, such as BM25, have been developed in order to remove such biases and hence improve...

chapter

Comparison of s-gram Proximity Measures in Out-of-Vocabulary Word Translation

Anni Järvelin, Antti Järvelin

Lecture Notes in Computer Science > String Processing and Information Retrieval > 75-86

Classified s-grams have been successfully used in cross-language information retrieval (CLIR) as an approximate string matching technique for translating out-of-vocabulary (OOV) words. For example, s-grams have consistently outperformed other approximate string matching techniques, like edit distance or n-grams. The Jaccard coefficient has traditionally been used as an s-gram based string proximity...

chapter

Speeding Up Pattern Matching by Text Sampling

Francisco Claude, Gonzalo Navarro, Hannu Peltola, Leena Salmela, more

Lecture Notes in Computer Science > String Processing and Information Retrieval > 87-98

We introduce a novel alphabet sampling technique for speeding up both online and indexed string matching. We choose a subset of the alphabet and select the corresponding subsequence of the text. Online or indexed searching is then carried out on that subsequence, and candidate matches are verified in the full text. We show that this speeds up online searching, especially for moderate to long patterns,...

chapter

Mismatch Sampling

Raphaël Clifford, Klim Efremenko, Benny Porat, Ely Porat, more

Lecture Notes in Computer Science > String Processing and Information Retrieval > 99-108

We consider the well known problem of pattern matching under the Hamming distance. Previous approaches have shown how to count the number of mismatches efficiently, especially when a bound is known for the maximum Hamming distance. Our interest is different in that we wish collect a random sample of mismatches of fixed size at each position in the text. Given a pattern p of length m and a text t of...

chapter

Sliding CDAWG Perfection

Martin Senft, Tomáš Dvořák

Lecture Notes in Computer Science > String Processing and Information Retrieval > 109-120

The Compact Directed Acyclic Word Graph (CDAWG) is a well-known suffix data structure designed for an efficient solution to problems on strings. Some applications, especially those from the data compression field, require maintaining a CDAWG over a sliding window. The fastest known solution to this problem is an approximation algorithm that slides a CDAWG in an amortized constant time. However, the...

chapter

Self-indexing Natural Language

Nieves R. Brisaboa, Antonio Fariña, Gonzalo Navarro, Angeles S. Places, more

Lecture Notes in Computer Science > String Processing and Information Retrieval > 121-132

Self-indexing is a concept developed for indexing arbitrary strings. It has been enormously successful to reduce the size of the large indexes typically used on strings, namely suffix trees and arrays. Self-indexes represent a string in a space close to its compressed size and provide indexed searching on it. On natural language, a compressed inverted index over the compressed text already provides...

chapter

New Perspectives on the Prefix Array

W. F. Smyth, Shu Wang

Lecture Notes in Computer Science > String Processing and Information Retrieval > 133-143

In this paper we consider the of a string in which and, for i > 1, iff k is the largest integer such that . The prefix array is closely related to the : an integer array [1..n] such that iff the length of the longest border of is k. Border arrays or their variants are used in many string algorithms and prefix arrays can be used directly for pattern-matching...

chapter

Indexed Hierarchical Approximate String Matching

Luís M. S. Russo, Gonzalo Navarro, Arlindo L. Oliveira

Lecture Notes in Computer Science > String Processing and Information Retrieval > 144-154

We present a new search procedure for approximate string matching over suffix trees. We show that hierarchical verification, which is a well-established technique for on-line searching, can also be used with an indexed approach. For this, we need that the index supports bidirectionality, meaning that the search for a pattern can be updated by adding a letter at the right or at the left. This turns...

chapter

An Efficient Linear Space Algorithm for Consecutive Suffix Alignment under Edit Distance (Short Preliminary Paper)

Heikki Hyyrö

Lecture Notes in Computer Science > String Processing and Information Retrieval > 155-163

We discuss the following variant of incremental edit distance computation: Given strings A and B with lengths m and n, respectively, the task is to compute, in n successive iterations j = n ...1, an encoding of the edit distances between A and all prefixes of B _j..n. Here B _j..n is the suffix of B that begins at its jth character...

chapter

Run-Length Compressed Indexes Are Superior for Highly Repetitive Sequence Collections

Jouni Sirén, Niko Välimäki, Veli Mäkinen, Gonzalo Navarro

Lecture Notes in Computer Science > String Processing and Information Retrieval > 164-175

A repetitive sequence collection is one where portions of a base sequence of length n are repeated many times with small variations, forming a collection of total length N. Examples of such collections are version control data and genome sequences of individuals, where the differences can be expressed by lists of basic edit operations. This paper is devoted to studying ways to store massive sets of...

chapter

Practical Rank/Select Queries over Arbitrary Sequences

Francisco Claude, Gonzalo Navarro

Lecture Notes in Computer Science > String Processing and Information Retrieval > 176-187

We present a practical study on the compact representation of sequences supporting rank, select, and access queries. While there are several theoretical solutions to the problem, only a few have been tried out, and there is little idea on how the others would perform, especially in the case of sequences with very large alphabets. We first present a new practical implementation of the compressed representation...

chapter

Clique Analysis of Query Log Graphs

Alexandre P. Francisco, Ricardo Baeza-Yates, Arlindo L. Oliveira

Lecture Notes in Computer Science > String Processing and Information Retrieval > 188-199

In this paper we propose a method for the analysis of very large graphs obtained from query logs, using query coverage inspection. The goal is to extract semantic relations between queries and their terms. We take a new approach to successfully and efficiently cluster these large graphs by analyzing clique overlap and a priori induced cliques. The clustering quality is evaluated with an extension...

chapter

Out of the Box Phrase Indexing

Frederik Transier, Peter Sanders

Lecture Notes in Computer Science > String Processing and Information Retrieval > 200-211

We present a method for optimizing phrase search based on inverted indexes. Our approach adds selected (two-term) phrases to an existing index. Whereas competing approaches are often based on the analysis of query logs, our approach works out of the box and uses only the information contained in the index. Also, our method is competitive in terms of query performance and can even improve on other...

INFONA - science communication portal

String Processing and Information Retrieval
15th International Symposium, SPIRE 2008, Melbourne, Australia, November 10-12, 2008. Proceedings

“Search Is a Solved Problem” and Other Annoying Fallacies

Approximate Runs - Revisited

Engineering Radix Sort for Strings

Faster Text Fingerprinting

Context-Sensitive Grammar Transform: Compression and Pattern Matching

Improved Variable-to-Fixed Length Codes

Term Impacts as Normalized Term Frequencies for BM25 Similarity Scoring

The Effect of Weighted Term Frequencies on Probabilistic Latent Semantic Term Relationships

Comparison of s-gram Proximity Measures in Out-of-Vocabulary Word Translation

Speeding Up Pattern Matching by Text Sampling

Mismatch Sampling

Sliding CDAWG Perfection

Self-indexing Natural Language

New Perspectives on the Prefix Array

Indexed Hierarchical Approximate String Matching

An Efficient Linear Space Algorithm for Consecutive Suffix Alignment under Edit Distance (Short Preliminary Paper)

Run-Length Compressed Indexes Are Superior for Highly Repetitive Sequence Collections

Practical Rank/Select Queries over Arbitrary Sequences

Clique Analysis of Query Log Graphs

Out of the Box Phrase Indexing

Filter options

Publication date

Keywords

INFONA - science communication portal

String Processing and Information Retrieval 15th International Symposium, SPIRE 2008, Melbourne, Australia, November 10-12, 2008. Proceedings $("#expandableTitles").expandable();

Add recipient

Sending message cancelled

Are you sure you want to cancel sending this message?

Send message

Filter options

Publication date

Date range setting

Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.

Keywords

Reporting an error / abuse

Sending the report failed

Accessibility options

String Processing and Information Retrieval
15th International Symposium, SPIRE 2008, Melbourne, Australia, November 10-12, 2008. Proceedings