Bio Signal Processing and Communication

Compression of Genomic Sequences Jens-Rainer Ohm, Christian Rohlfing

Next Generation Sequencing (NGS) technologies enable the usage of genomic information as everyday practice in several fields, but the growing volume of data generated becomes a serious obstacle for a wide diffusion. Therefore, efficient compression of genomic data is a critical element whose lack is currently limiting its application potential.

Several master theses are offered or subsequently planned in this area; candidate students should have solid knowledge and skills in mathematics for signal analysis and processing, with emphasis on coding/compression, as well as similarity analysis of discrete sample-patterns/letter-strings (Note: This could also be transferred from the background of areas such as audio or video signal processing/compression).

The following concepts are planned to be investigated in detail:

  • Complex (amplitude/phase) signal representations for compression of nucleotide sequences
  • Statistical analysis of nucleotide sequences at local and global levels in signal domain and Fourier domain, identification of “coding regions”
  • Methods of grouping nucleotides into “codons” for optimum compression
  • Fast matching procedures for read pairs
  • Joint compression of genomic information and associated quality values of the reads

Comparison of developed genomic compression methods is planned to be performed against the benchmark of the emerging MPEG-G standard (ISO/IEC 23092-2).