%load_ext autoreload
%autoreload 2
Mutations
How do we represent mutations computationally?
In SeqLike, we use the family of Mutation
objects (Substitution
, Deletion
, and Insertion
) as primitives,
as well as their MutationSet
collection.
Later on in the notebook, we will show the APIs built on top of these primitive objects
that enable fluent sequence design workflows.
First off, let's see a few examples in action to get a feel for how it is used.
Here's a seqlike object:
from seqlike.SeqLike import aaSeqLike
from seqlike.Mutation import Mutation
s1 = aaSeqLike("MKAILV")
And here's a Substitution
object, created by a generic call to Mutation
.
sub1 = Mutation("3K")
sub1
type(sub1)
They can be added together:
s1 + sub1
For comparison:
s1
Built-in validation of the WT sequence happens if the expected WT sequence is specified:
sub_with_wt = Mutation("K1R")
# s1 + sub_with_wt # will raise an error.
sub_with_wt = Mutation("K2R")
s1 + sub_with_wt # will NOT raise an error.
Here's an Insertion object:
ins1 = Mutation("^4D")
type(ins1)
It, too, can be added to a seqlike:
s1 + ins1
Finally, a Deletion
object:
del1 = Mutation("2-")
del1
type(del1)
Deletions behave like a special case substitution:
s1 + del1
Finally, if you really don't like the gap, you can always ungap the resulting SeqLike:
(s1 + del1).ungap()
Just be aware that you lose the original reference length and coordinate system.
Mutation Sets
MutationSets allow for collections of one or more mutations to be housed together. For example, let's combine one of our substitutions and insertions together.
sub1, ins1
from seqlike.MutationSet import MutationSet
ms1 = MutationSet([sub1, ins1])
ms1
# ms3 = ms1 + ms2
MutationSet
objects have a special property that shows which positions are represented.
ms1.positions
We can add a MutationSet to a SeqLike object.
s1 + ms1
The operations are don't modify internal state, so re-running them again guarantees identical results:
s1 + ms1
s1 + ms1
Mutations in a MutationSet are applied from left to right. What happens, though, if an Insertion (which changes the indexing), is added before a Substitution (which doesn't)?
sub1
ms1_swapped = MutationSet([ins1, sub1])
ms1_swapped
s1 + ms1_swapped
Within a MutationSet, we preserve indexing w.r.t. the original sequence, and internally propagate insertions of positions throughout the sequence. If you have multiple MutationSets, however, then indexing is preserved w.r.t. the previous SeqLike, from left to right:
# THIS:
s1 + ms1 + ms1_swapped
# IS EQUIVALENT TO THIS:
intermediate = s1 + ms1
s2 = intermediate + ms1_swapped
s2
Be careful when adding two mutationsets together, because they simply get added up.
combined_ms = MutationSet(mutations="2A;3C".split(";")) + MutationSet(mutations="2D;4K".split(";"))
combined_ms
s1 + combined_ms
Note here how only mutation 2D ended up being retained.
Magical Mutation Set string parsing
It's really tedious to specify multiple mutations as specific objects, so we have a magical parser that allows us to parse mutation strings:
s1
mutations = "^2A;^4D;5-" # insertion, substitution, deletion
ms2 = MutationSet(mutations=mutations.split(";"))
# The rest of the mutations in the set are offset by the correct amount
s1 + ms2
import pandas as pd
series = pd.Series(["2A;3C;4D", "2A;3C"], name="mutations")
series.apply(lambda mutation_str: MutationSet(mutation_str.split(";"))).apply(lambda mutset: s1 + mutset)
# Get back ;-delimited string:
str(ms1)
ms1.to_str()
Mutational Scanning
Mutational scanning, such as an Alanine scan, looks like this:
from seqlike import SeqLike
from typing import List
def alanine_scan(s: SeqLike) -> List[SeqLike]:
mutants = []
for i in range(len(s)):
mutants.append(s + Substitution(f"{i}A"))
return mutants
We've wrapped that functionality in the SeqLike class.
s1.scan("A")
We can do arbitrary scans too:
s1.scan("W")
Finally, we can always back-mutate sequences into their original.
# Do an Alanine Scan but ensure wanted mutations w.r.t. WT are preserved.
s1.scan("A").apply(lambda seq: seq + MutationSet("1M;6C".split(";")))
Differencing SeqLikes
The __sub__
operator has been overloaded such that if we subtract one seqlike from another seqlike,
we get back a mutation set w.r.t. the left seqlike that can be added back to the left seqlike to obtain the right seqlike.
For example, with a SeqLikes s1
:
s1
And a particular mutation:
sub1
We can obtain the difference between s1
and s1 + ms1
:
diff1 = s1 - (s1 + sub1)
diff1
The resulting MutationSet is an inferred set of mutations needed to reconstruct the sequence on the right side of the plus sign from the left side. It may not always be the same as the original mutation set. Numbering is always going to be with respect to an ungapped reference (left hand side) sequence.
diff2 = (s1 + sub1) - s1
diff2
We can apply the mutation inferred mutation sets and verify that we get back the same mutated sequence:
s1 + sub1
which can be compared to:
s1 + diff1
We can verify equality of the two strings below:
(s1 + sub1).to_str() == (s1 + diff1).to_str()
Let's try with a mutation set, one that is a bit more complicated. Firstly, here's the first mutation set we used:
ms1
Let's check the diff of the two sequences:
diff1 = s1 - (s1 + ms1)
diff1
s1 + ms1
s1 + diff1
Likewise, we can check their equality:
(s1 + ms1).to_str() == (s1 + diff1).to_str()
Let's try with a mutationset that is a bit more complicated.
ms2
diff = s1 - (s1 + ms2)
diff
Equality is preserved when we ungap the sequences.
(s1 + diff).ungap().to_str() == (s1 + ms2).ungap().to_str()
Finally, let's do a really complicated one with 3 substitutions:
ms3 = MutationSet("2A;3F;4Q".split(";"))
diff = s1 - s1 + ms3
diff
As you can see, this is pretty trivial, not actually complicated ;).
(s1 + diff).ungap().to_str() == (s1 + ms3).ungap().to_str()