A common operation in computational genomics is finding the number of k-mers in a DNA sequence, or the number of all possible substrings of length $k$ in a given sequence (this is similar to counting n-grams in computational linguistics). For example, given this sequence:
AAGAGT
It contains the following 2-mers:
AA
AG
(twice)GA
GT
And the following 3-mers:
AAG
AGA
GAG
AGT
Ask: How can we solve this problem if we only use lists and strings?
We need to generate all possible substrings of length $k$ (using only the DNA bases: A
, C
, T
, and G
), and then count how many times they appear in a DNA sequence.
Ask: What functions would you define?
Let's start with a function to generate all possible DNA k-mers.
def generate_dna_kmers(k):
'''
Return a list of all possible substrings of
length k using only characters A, C, T, and G
'''
bases = ["A", "C", "T", "G"]
last = bases
current = []
for i in range(k-1):
for b in bases:
for l in last:
current.append(l+b)
last = current
current= []
return last
generate_dna_kmers(1)
generate_dna_kmers(2)
generate_dna_kmers(3)
Now, we need an additional function that, given a single substring, will count the number of (potentially overlapping) occurrences of that substring in a DNA sequence.
def count_mer(mer, seq):
'''
Counts the number of times a substring mer
ocurrs in the sequence seq (including overlapping
occurrences)
sample use: count_mer("GGG", "AGGGCGGG") => 2
'''
k = len(mer)
count = 0
for i in range(0, len(seq)-k+1):
if mer == seq[i:i+k]:
count = count + 1
return count
count_mer("GGG", "AGGGCGGG")
Finally, we just need a function that, given a sequence and $k$ will use the previous two functions to count the number of occurrences of every possible k-mer in the sequence.
def kmer_count(k, seq):
'''
Return a list of the number of times each possible k-mer appears
in seq, including overlapping occurrences.
'''
mers = generate_dna_kmers(k)
rv = []
for m in mers:
cnt = count_mer(m, seq)
if cnt != 0:
rv.append((m, cnt))
return rv
kmer_count(2, "AAGAGT")
kmer_count(3, "AAGAGT")
seq1 = "AAGGCCTT"
seq2 = "AAAGGGCCCTTT"
seq3 = "ACTGTGCAAACCTTGGGGTTCCAA"
seq4 = "AAGGAAGGCGCGCCGGAAAGAG"
seq5 = "GGGGAGGG"
kmer_count(1, seq4)
kmer_count(2, seq4)
kmer_count(3, seq4)
Ask: This algorithm is correct but it is very inefficient. Why?
First of all, we generate all possible k-mers ($4^k$), even though they may not all appear in the sequence.
Then, in count_mer
we iterate over a sequence of length $n$ and check every substring of length $k$ to see if it matches the provided mer
. This requires $n\cdot k$ operations.
Finally, in kmer_count
we call count_mer
$4^k$ times (the number of possible k-mers). So, we have to perform a total of $n\cdot k \cdot 4^k$ operations. As $k$ grows, the number of operations grows exponentially!
This is terrible performance: any time you find yourself writing an algorithm where the number of "steps" grows exponentially with the size of the input, you probably need to look for a better solution (caveat: there are some types of problems where it is unknown whether there are solutions better than exponential)
from dnafile import read
seq_drosofila = read("drosofila-small.txt")
len(seq_drosofila)
kmer_count(1, seq_drosofila)
kmer_count(2, seq_drosofila)
%time kmers = kmer_count(1, seq_drosofila)
%time kmers = kmer_count(2, seq_drosofila)
%time kmers = kmer_count(3, seq_drosofila)
%time kmers = kmer_count(4, seq_drosofila)
Notice how the time with $k$ is roughly 4 times greater than the time with $k-1$
How can we improve on the k-mer counting algorithm using dictionaries?
Instead of generating all possible k-mers and counting the occurrence of each of them in the sequence, we make a single pass through the sequence and maintain a dictionary of k-mers. This dictionary is initially empty but, when we find a k-mer we add a (kmer, 1) entry in the dictionary (and, if we encounter that k-mer again, we increment the value of that entry)
def kmer_count_better(k, seq):
'''
Return a list of the number of times each possible k-mer appears
in seq, including overlapping occurrences.
'''
rv = {}
for i in range(0, len(seq)-k+1):
subseq = seq[i:i+k]
v = rv.get(subseq, 0)
rv[subseq] = v + 1
return list(rv.items())
seq4
seq4[3:3+4]
kmer_count_better(1, seq4)
kmer_count_better(2, seq4)
kmer_count_better(3, seq4)
Note: we use %timeit instead of %time because the running times are going to be pretty small. %timeit automatically runs the statement multiple times and takes an average.
%timeit kmers = kmer_count_better(1, seq_drosofila)
%timeit kmers = kmer_count_better(2, seq_drosofila)
%timeit kmers = kmer_count_better(3, seq_drosofila)
%timeit kmers = kmer_count_better(4, seq_drosofila)
The running time is essentially the same regardless of $k$! Why? Accessing/updating a dictionary is done in amortized constant time, so the number of operations in this algorithm is proportional only to $n$ (the length of the sequence). Notice, however, that we're making a copy of the substring in each step, so it's actually proportional to $n\cdot k$ (this is still much much much better than $n\cdot k \cdot 4^k$). So, you'll start to see (slightly) longer running times as $k$ increases.
%timeit kmers = kmer_count_better(100, seq_drosofila)
%timeit kmers = kmer_count_better(500, seq_drosofila)
%timeit kmers = kmer_count_better(1000, seq_drosofila)