Information Retrieval Needs More Theoreticians
The juxtaposition of “theory” and “Information Retrieval” (IR) often invites much eyebrow-raising and chin-stroking. That is because a trove of exciting applied work in IR—particularly since the dawn of deep learning—along with a constant drift of the literature towards empirical studies and heuristics, tempt one to conclude that “experimental” is a more apt adjective to describe much of the discipline. The reality, however, is more nuanced: experimental explorations rest on foundational algorithmic and data structure research that is routinely overlooked.
Embarking on a quick journey through time helps unearth the theoretical underpinnings of IR. Historical examples abound, from compactly indexing massive amounts of text with compressed inverted indexes; to efficient set intersection algorithms and mechanisms for the dynamic pruning of inverted lists for fast top-k retrieval; to compressed representations of decision forests and their efficient inference for learning-to-rank. One can effortlessly draw a direct line from these developments to a range of applications such as search and recommendation systems that have flourished at massive scales.
As the IR landscape evolves and the field becomes more intertwined with deep learning, vectors, and matrices replace text as units of information. That, in turn, gives way to fascinating new algorithmic problems for which the existing tools are not only insufficient but are also wrong. In fact, a strategy of avoiding new problems and resorting instead to the existing arsenal of data structures and algorithms holds us back and restrains us from looking beyond the lamppost for new ideas.
That is the argument I will make in this keynote: For IR research to remain innovative, more theoreticians must go back to the drawing board, reexamine, and possibly reinvent much of what is at its core. To illustrate this point, I will give concrete examples of some of today’s open problems. These include developing (non-linear) sketching algorithms for vectors and matrices for the purposes of fast retrieval at scale; building compact indexes; designing approximate top-k retrieval by maximal in- ner product, possibly subject to additional constraints; and generalizing retrieval from inner product to matrix norms. I hope that this talk will convince us of the need for more theoretical work and entice this particular community to explore new challenges in IR.
Inge Li Gørtz
Regular Expression Matching
A regular expression specifies a set of strings formed by single characters combined with concatenation, union, and Kleene star operators. Given a regu- lar expression R and a string Q, the regular expression matching problem is to decide if Q matches any of the strings specified by R. Regular expressions are a fundamental concept in formal languages and regular expression matching is a basic primitive for searching and processing data. A standard textbook solution [Thompson, CACM 1968] constructs and simulates a nondeterministic finite au- tomaton, leading to an O(nm) time algorithm, where n is the length of Q and m is the length of R. Despite considerable research efforts only polylogarithmic improvements of this bound are known. Recently, conditional lower bounds pro- vided evidence for this lack of progress when Backurs and Indyk [FOCS 2016] proved that, assuming the strong exponential time hypothesis (SETH), regular expression matching cannot be solved in O((nm)1−ϵ), for any constant ϵ > 0. Hence, the complexity of regular expression matching is essentially settled in terms of n and m.
In this talk we will give an overview of the main standard techniques for regular expression matching, and show a new approach that goes beyond worst- case analysis in n and m. We introduce a density parameter, ∆, that captures the amount of nondeterminism in the NFA simulation on Q. The density is at most nm+1 but can be significantly smaller. Our main result is a new algorithm that solves regular expression matching in time
Recent Results on the Longest Common Substring Problem
In the longest common substring (LCS) problem, we are given two strings S and T , each of length at most n, and are to find a longest string occurring as a fragment of both S and T . An O(n)- time solution for this classic problem, assuming that S and T are over a constant-sized alphabet, dates back to Weiner’s seminal paper on suffix trees (SWAT 1973). This solution can be extended to strings over an integer alphabet using Farach’s suffix tree construction algorithm (FOCS 1997) and to computing the LCS of many strings of total length n (Hui, CPM 1992).
Recently it was shown that, under the word RAM model of computation, the LCS of two strings over a constant-sized alphabet can be computed even faster. More precisely, Charalampopoulos, Kociumaka, Pissis, and Radoszewski (ESA 2021) showed that assuming the alphabet size is σ and that the strings are represented in a packed form of size O(n log σ/ log n), the LCS of S and T can be computed in O(n log σ/ √ log n) time.
In the last 10 years, several further results on the LCS problem and its variants have been presented. These include computing the LCS of two strings using only O(s) space, where s < n, maintaining the LCS of two dynamic strings, internal queries for the LCS of two fragments of a given text, approximate variants of the LCS in which one searches for the longest fragment of string S that resembles a fragment of string T (the two fragments may be required to have a small Hamming or edit distance), and computing the LCS with the assumption that one of the strings is given in a compressed form (a less efficient algorithm assuming that both S and T are compressed was known earlier).
In the talk, I will present a survey of known results on the LCS problem, with an emphasis on the result from ESA 2021 and the underlying technique by Charalampopoulos et al. (CPM 2018), and list open problems related to the LCS computation.