The Burrows-Wheeler Transform

https://news.ycombinator.com/rss Hits: 13
Summary

What's the dollar sign for? The $ marks the end of the string, and is needed to make the BWT reversible. Without that marker, you could still regenerate the matrix in Step , but you wouldn't know which row contains the original string. If it's an English word, you might guess it's banana and not nabana, but that's harder to do with DNA because most rotations will look reasonable. Intuition behind the BWT In Step , the sorting causes rows that start the same to be more likely to be grouped together. As a result, the character that comes right before (i.e. the character in the last column) is also likely to be similar, based on repeated patterns in the English language, and also in DNA sequences! For example, in the BWT of coconut, the letter c is grouped because it's always followed by an o. Although o is followed by either c or n, it still clusters in the BWT because its corresponding rows end up being sorted next to each other. If there was a row in Step that started with a letter in between c and n, the o's would no longer cluster. For example, what happens to the o's if you add an i to coconut: icoconut. Would the o's cluster if you tried ucoconut? Now it's your turn: try encoding your name or a repetitive string. Which characters can you add or remove to make the BWT cluster more or less? Decoding the BWT Starting from the BWT string, we can recover the original string from the BWT by more or less doing the opposite of the encoding algorithm: Start with an empty matrix, prepend the BWT string, sort the strings, and repeat until the matrix is filled. Once the matrix is filled, we can read off the string from any of the rows since we have the $ marker. How to use BWT for sequence alignment So far, we've seen how to use the Burrows-Wheeler Transform to encode and decode strings. That's nice and all, but how can we use the BWT for sequence alignment, i.e. looking for a small string in a much larger string? To do that, I first need to introduce yet another magical pr...

First seen: 2025-10-09 21:21

Last seen: 2025-10-10 09:29