Welcome to WCTA 2023
The 18th Workshop on Compression, Text, and Algorithms (WCTA 2023) will be held at the Department of Computer Science of the University of Pisa, Italy, on September 29th 2023, just after SPIRE.
WCTA is a forum primarily intended for early-stage researchers to present their work. There are no published proceedings so the results presented can also be submitted to most other workshops and conferences.
Call for Papers
We encourage junior members of our community to submit a one-page abstract (not including references) containing preferably unpublished work, work in progress, surveys of interest, and open problems, among others. WCTA has no published proceedings, so the results presented can also be submitted to other workshops and conferences.
Important Dates
- Abstract deadline:
August 20th, 2023August 28th, 2023 (AoE) - Notification: August 29th, 2023
- Workshop: September 29th, 2023
Submission
Please, submit the one-page abstract (excluding references) in PDF format via email to wcta2023pisa@gmail.com
Program of WCTA 2023
September 29th 2023 | University of Pisa – Italy
09:00 – Welcome from WCTA chairs
09:10 – Keynote | Pierre Peterlongo | “Indexing Large Metagenomic Projects. Application to the Tara Oceans Datasets” | SLIDES | Chair: Giulio Ermanno Pibiri
10:30 – Coffee break
Session 1: Chair Veronica Guerrini
11:00 – Patrick Dinklage, “Lempel-Ziv-like compression in parameterized space via top-k frequent pattern estimation”
11:30 – Giulia Bernardini, “Inferring Phylogenetic Networks from Multiple Multipartite Trees”
12:00 – Adrián Goga, “Wheeler maps” (remote)
12:30 – Lunch
Session 2: Chair Adrián Gómez Brandón
14:00 – Andrea Guerra, “NEAT: A nonlinear learned lossless compressor for time series”
14:30 – Martita Muñoz, “Clustering-based compression for raster time series”
15:00 – Aaron Hong, “Another virtue of wavelet forests?” (remote)
15:30 – Coffee break
Session 3: Chair Bojana Kodric
16:00 – Lorenzo Carfagna, “A linear time construction algorithm for the Two-Dimensional Block Trees”
16:30 – Nicola Cotumaccio, “The lexicographic structure of graphs: algorithms, recent results and open problems”
17:00 – Xing Lyu, “Sum-of-Local-Effects Data Structures for Separable Graphs” (remote)
17:30 – Closing remarks from chairs
Pierre Peterlongo
Title
Indexing Large Metagenomic Projects. Application to the Tara Oceans Datasets
Abstract
Despite their wealth of biological information, public sequencing databases are largely underutilized. One cannot efficiently search for a sequence of interest in these immense resources. Sophisticated computational methods such as approximate membership query data structures allow searching for fixed-length words (k-mers) in large datasets. Yet they face scalability challenges when applied to thousands of complex sequencing experiments. In this context we propose kmindex, a new approach that uses inverted indexes based on Bloom filters. Thanks to its algorithmic choices and its fine-tuned implementation, kmindex offers the possibility to index thousands of highly complex metagenomes into an index that answers sequences queries in the tenth of a second. Index construction is one order of magnitude faster than previous approaches, and query time is two orders of magnitude faster. Based on Bloom filters, kmindex achieves negligible false positive rates, below 0.01% on average. Its average false positive rate is four orders of magnitude lower than existing approaches, for similar index sizes. It has been successfully used to index 1,393 complex marine seawater metagenome samples of raw sequences from the Tara Oceans project, demonstrating its effectiveness on large and complex datasets. This level of scaling was previously unattainable. Building on the kmindex results, we provide a public web server named “Ocean Read Atlas” (ORA) at https://ocean-read-atlas.mio.osupytheas.fr/ that can answer queries against the entire Tara Oceans dataset in real-time. kmindex is open-source software available at https://github.com/tlemane/kmindex.
Keynote Slides: PDF
Short Bio
Pierre Peterlongo is research director at Inria and he heads the “Genscale” research team at IRISA/Inria, the Computer Science laboratory of Rennes. He studies algorithms for DNA sequencing data. He obtained his PhD in Computer Science in 2006 at University of Marne-la-Vallée and then did a Postdoc at IRISA in Rennes. During his PhD and postdoc, he worked on text algorithms applied to DNA sequences, for various problems of pattern matching, searching for approximate repeats and sequence alignments. He was then recruited as an Inria junior researcher in 2008 in Rennes. This coincided with the advent of high-throughput sequencing technologies, and his research interests moved towards the design of efficient data structures and algorithms able to scale the huge amounts of data generated by such technologies. He is notably recognized for his work and software for reference-free variant discovery, de novo comparative metagenomics and large-scale sequence indexing using k-mers.
- Veronica Guerrini, University of Pisa
- Giulio Ermanno Pibiri, Ca’ Foscari University of Venice