Section The Levenshtein Metric
The Levenshtein metric is one measure of distance that researchers use to understand DNA. DNA is composed of double chains of nucleotides, which wind together to form a double helix. The nucleotides come in four types: adenine (A), cytosine (C), guanine (G), and thymine (T). The nucleotides in the two chains of a DNA strand pair together, (A with T, and C with G), so the nucleotides in one chain determine the nucleotides in the other. Therefore, we can represent a DNA strand with a string of letters from the alphabet \(\{A, C, G, T\}\text{.}\) One problem DNA researchers have is how to compare two strands of DNA, and the Levenshtein metric is one way that the distance between strands can be measured. Other metrics could be used, but the Levenshtein metric is appropriate to the task for several reasons. During evolution, changes in DNA sequences arise due to nucleotide substitution, or the insertion or deletion of nucleotides. These evolutionary changes can be modeled by the operations that determine the Levenshtein distance better than other metrics. In addition, the Levenshtein metric can be used to calculate distances between strings of different lengths. The Levenshtein metric also has applications in spell checkers, speech recognition, and automated plagiarism detection. To understand how the Levenshtein metric is calculated, consider the question of how far apart the words “green” and “grease” are.
To compare these words, we have to be able to change letters, and add or delete letters. If \(x = x_1x_2 \cdots x_n\) is a string of letters, we allow the following operations:
- a deletion:
replace \(x\) with \(x_1 \cdots x_{i-1}x_{i+1} \cdots x_n\) for some \(i\text{,}\)
- an insertion:
replace \(x\) with \(x_1 \cdots x_{i}yx_{i+1} \cdots x_n\text{,}\) where \(y\) is an allowable letter and \(0 \leq i \leq n\text{,}\)
- a substitution:
replace \(x\) with \(x_1 \cdots x_{i-1}yx_{i+1} \cdots x_n\text{,}\) where \(y\) is an allowable letter and \(1 \leq i \leq n\text{.}\)
Activity 4.2.
(a)
Using the allowable operations, change the word “green” into the word “grease”. Specifically identify each operation you use. (Note: the intermediate strings of letters do not have to form recognizable words.) How many operations did you use?
(b)
If it took three operations to transform “green” into “grease”, we could say that the distance between “green” and “grease” is at most 3. However, it may be possible to transform “green” into “grease” in fewer than 3 operations, which might change our opinion of the distance between these words. In general, to define the Levenshtein distance
\(d_L\) between a sting
\(x\) and a string
\(y\text{,}\) let
\(m_d\) denote the number of deletions,
\(m_i\) the number of insertions, and
\(m_s\) the number of substitutions we use to get from
\(x\) to
\(y\text{.}\) There may be many different combinations of
\(m_d\text{,}\) \(m_i\text{,}\) and
\(m_s\) that get us from
\(x\) to
\(y\text{,}\) so we want the smallest number.
Definition 4.2.
The Levenshtein distance \(d_L(x,y)\) between strings \(x\) and \(y\) is
\begin{equation*}
d_L(x,y) = \min\{m_d+m_i+m_s\}\text{.}
\end{equation*}
Prove that the Levenshtein distance function is really a metric on the set of all possible words (sensical or nonsensical).
(c)
A spell checker corrects the misspelled word “tupotagry”. Using the Levenshtein metric, which word would the spell checker use as the closest to “tupotagry”? Why?
\begin{equation*}
\text{"topography} \qquad \text{"topology"} \qquad \text{"tautology"}
\end{equation*}