New Algorithmic Guarantees for Searching in Large Datasets

Umang Bhaskar
Monday, 20 May 2019, 14:00 to 15:00
A-201 (STCS Seminar Room)
Abstract: Many recent advancements in computation have come from harnessing the power of very large sets of data.  This leads to an algorithmic question: how can we store a large dataset so that we can quickly query it later on?  The speed of these queries is a performance bottleneck for many modern big data applications.
One approach to speed up these searches is to adapt to structure in the datasets.  This achieves the best of both worlds: the worst-case guarantees of classical algorithms, on top of the heuristic performance possible for many datasets seen in practice.  My research focuses on algorithms with asymptotic performance bounds that improve on their classic counterparts by taking advantage of structure.
In my talk I will discuss two fundamental search problems.  The first, membership search, queries if a given item is a member of the dataset. I will talk about a new result that speeds up these queries adaptively: the data structure changes its structure after each query, greatly improving the performance on future queries while still retaining optimal worst-case bounds.
I will also talk about similarity search under edit distance, in which we want to find the most similar item in the dataset.  Edit distance is a notion of similarity for texts, so these queries have applications in areas such as text processing and bioinformatics.  I will talk about a search algorithm that uses an approximate notion of similarity for edit distance.  This provides a smooth performance curve: easy datasets containing a wide variety of texts can be searched extremely quickly, whereas worst-case datasets with densely clustered texts require more expensive searches.