%load_ext autoreload
%autoreload 2

Mutations

How do we represent mutations computationally? In SeqLike, we use the family of Mutation objects (Substitution, Deletion, and Insertion) as primitives, as well as their MutationSet collection. Later on in the notebook, we will show the APIs built on top of these primitive objects that enable fluent sequence design workflows.

First off, let's see a few examples in action to get a feel for how it is used.

Here's a seqlike object:

from seqlike.SeqLike import aaSeqLike
from seqlike.Mutation import Mutation

s1 = aaSeqLike("MKAILV")

/usr/share/miniconda3/envs/seqlike-dev/lib/python3.9/site-packages/Bio/Application/__init__.py:40: BiopythonDeprecationWarning: The Bio.Application modules and modules relying on it have been deprecated.

Due to the on going maintenance burden of keeping command line application
wrappers up to date, we have decided to deprecate and eventually remove these
modules.

We instead now recommend building your command line and invoking it directly
with the subprocess module.
  warnings.warn(

Loading BokehJS ...

And here's a Substitution object, created by a generic call to Mutation.

sub1 = Mutation("3K")
sub1

3K

type(sub1)

seqlike.Mutation.Substitution

They can be added together:

s1 + sub1

NT: None 

*** AA: SeqRecord(seq=Seq('MKKILV'), id='ea987063-d422-4cb8-83a5-0f7913c23336', name='<unknown name>', description='<unknown description>', dbxrefs=[])

For comparison:

s1

NT: None 

*** AA: SeqRecord(seq=Seq('MKAILV'), id='ea987063-d422-4cb8-83a5-0f7913c23336', name='<unknown name>', description='<unknown description>', dbxrefs=[])

Built-in validation of the WT sequence happens if the expected WT sequence is specified:

sub_with_wt = Mutation("K1R")
# s1 + sub_with_wt  # will raise an error.

sub_with_wt = Mutation("K2R")
s1 + sub_with_wt  # will NOT raise an error.

NT: None 

*** AA: SeqRecord(seq=Seq('MRAILV'), id='ea987063-d422-4cb8-83a5-0f7913c23336', name='<unknown name>', description='<unknown description>', dbxrefs=[])

Here's an Insertion object:

ins1 = Mutation("^4D")
type(ins1)

seqlike.Mutation.Insertion

It, too, can be added to a seqlike:

s1 + ins1

NT: None 

*** AA: SeqRecord(seq=Seq('MKADILV'), id='ea987063-d422-4cb8-83a5-0f7913c23336', name='<unknown name>', description='<unknown description>', dbxrefs=[])

Finally, a Deletion object:

del1 = Mutation("2-")
del1

2-

type(del1)

seqlike.Mutation.Deletion

Deletions behave like a special case substitution:

s1 + del1

NT: None 

*** AA: SeqRecord(seq=Seq('M-AILV'), id='ea987063-d422-4cb8-83a5-0f7913c23336', name='<unknown name>', description='<unknown description>', dbxrefs=[])

Finally, if you really don't like the gap, you can always ungap the resulting SeqLike:

(s1 + del1).ungap()

NT: None 

*** AA: SeqRecord(seq=Seq('MAILV'), id='ea987063-d422-4cb8-83a5-0f7913c23336', name='<unknown name>', description='<unknown description>', dbxrefs=[])

Just be aware that you lose the original reference length and coordinate system.

Mutation Sets

MutationSets allow for collections of one or more mutations to be housed together. For example, let's combine one of our substitutions and insertions together.

sub1, ins1

(3K, ^4D)

from seqlike.MutationSet import MutationSet 

ms1 = MutationSet([sub1, ins1])
ms1
# ms3 = ms1 + ms2

[3K, ^4D]

MutationSet objects have a special property that shows which positions are represented.

ms1.positions

[3, 4]

We can add a MutationSet to a SeqLike object.

s1 + ms1

NT: None 

*** AA: SeqRecord(seq=Seq('MKKDILV'), id='ea987063-d422-4cb8-83a5-0f7913c23336', name='<unknown name>', description='<unknown description>', dbxrefs=[])

The operations are don't modify internal state, so re-running them again guarantees identical results:

s1 + ms1

NT: None 

*** AA: SeqRecord(seq=Seq('MKKDILV'), id='ea987063-d422-4cb8-83a5-0f7913c23336', name='<unknown name>', description='<unknown description>', dbxrefs=[])

s1 + ms1

NT: None 

*** AA: SeqRecord(seq=Seq('MKKDILV'), id='ea987063-d422-4cb8-83a5-0f7913c23336', name='<unknown name>', description='<unknown description>', dbxrefs=[])

Mutations in a MutationSet are applied from left to right. What happens, though, if an Insertion (which changes the indexing), is added before a Substitution (which doesn't)?

sub1

3K

ms1_swapped = MutationSet([ins1, sub1])
ms1_swapped
s1 + ms1_swapped

NT: None 

*** AA: SeqRecord(seq=Seq('MKKDILV'), id='ea987063-d422-4cb8-83a5-0f7913c23336', name='<unknown name>', description='<unknown description>', dbxrefs=[])

Within a MutationSet, we preserve indexing w.r.t. the original sequence, and internally propagate insertions of positions throughout the sequence. If you have multiple MutationSets, however, then indexing is preserved w.r.t. the previous SeqLike, from left to right:

# THIS:
s1 + ms1 + ms1_swapped

NT: None 

*** AA: SeqRecord(seq=Seq('MKKDDILV'), id='ea987063-d422-4cb8-83a5-0f7913c23336', name='<unknown name>', description='<unknown description>', dbxrefs=[])

# IS EQUIVALENT TO THIS:
intermediate = s1 + ms1
s2 = intermediate + ms1_swapped
s2

NT: None 

*** AA: SeqRecord(seq=Seq('MKKDDILV'), id='ea987063-d422-4cb8-83a5-0f7913c23336', name='<unknown name>', description='<unknown description>', dbxrefs=[])

Be careful when adding two mutationsets together, because they simply get added up.

combined_ms = MutationSet(mutations="2A;3C".split(";")) + MutationSet(mutations="2D;4K".split(";"))
combined_ms

[2A, 2D, 3C, 4K]

s1 + combined_ms

NT: None 

*** AA: SeqRecord(seq=Seq('MDCKLV'), id='ea987063-d422-4cb8-83a5-0f7913c23336', name='<unknown name>', description='<unknown description>', dbxrefs=[])

Note here how only mutation 2D ended up being retained.

Magical Mutation Set string parsing

It's really tedious to specify multiple mutations as specific objects, so we have a magical parser that allows us to parse mutation strings:

s1

NT: None 

*** AA: SeqRecord(seq=Seq('MKAILV'), id='ea987063-d422-4cb8-83a5-0f7913c23336', name='<unknown name>', description='<unknown description>', dbxrefs=[])

mutations = "^2A;^4D;5-" # insertion, substitution, deletion
ms2 = MutationSet(mutations=mutations.split(";"))

# The rest of the mutations in the set are offset by the correct amount
s1 + ms2

NT: None 

*** AA: SeqRecord(seq=Seq('MAKADI-V'), id='ea987063-d422-4cb8-83a5-0f7913c23336', name='<unknown name>', description='<unknown description>', dbxrefs=[])

import pandas as pd 

series = pd.Series(["2A;3C;4D", "2A;3C"], name="mutations")
series.apply(lambda mutation_str: MutationSet(mutation_str.split(";"))).apply(lambda mutset: s1 + mutset)

0    (M, A, C, D, L, V)
1    (M, A, C, I, L, V)
Name: mutations, dtype: object

# Get back ;-delimited string:
str(ms1)

'3K;^4D'

ms1.to_str()

'3K;^4D'

Mutational Scanning

Mutational scanning, such as an Alanine scan, looks like this:

from seqlike import SeqLike 
from typing import List 

def alanine_scan(s: SeqLike) -&gt; List[SeqLike]:
    mutants = []
    for i in range(len(s)):
        mutants.append(s + Substitution(f"{i}A"))
    return mutants

We've wrapped that functionality in the SeqLike class.

s1.scan("A")

0    (A, K, A, I, L, V)
1    (M, A, A, I, L, V)
2    (M, K, A, I, L, V)
3    (M, K, A, A, L, V)
4    (M, K, A, I, A, V)
5    (M, K, A, I, L, A)
dtype: object

We can do arbitrary scans too:

s1.scan("W")

0    (W, K, A, I, L, V)
1    (M, W, A, I, L, V)
2    (M, K, W, I, L, V)
3    (M, K, A, W, L, V)
4    (M, K, A, I, W, V)
5    (M, K, A, I, L, W)
dtype: object

Finally, we can always back-mutate sequences into their original.

# Do an Alanine Scan but ensure wanted mutations w.r.t. WT are preserved.
s1.scan("A").apply(lambda seq: seq + MutationSet("1M;6C".split(";")))

0    (M, K, A, I, L, C)
1    (M, A, A, I, L, C)
2    (M, K, A, I, L, C)
3    (M, K, A, A, L, C)
4    (M, K, A, I, A, C)
5    (M, K, A, I, L, C)
dtype: object

Differencing SeqLikes

The __sub__ operator has been overloaded such that if we subtract one seqlike from another seqlike, we get back a mutation set w.r.t. the left seqlike that can be added back to the left seqlike to obtain the right seqlike.

For example, with a SeqLikes s1:

s1

NT: None 

*** AA: SeqRecord(seq=Seq('MKAILV'), id='ea987063-d422-4cb8-83a5-0f7913c23336', name='<unknown name>', description='<unknown description>', dbxrefs=[])

And a particular mutation:

sub1

3K

We can obtain the difference between s1 and s1 + ms1:

diff1 = s1 - (s1 + sub1)
diff1

[3K]

The resulting MutationSet is an inferred set of mutations needed to reconstruct the sequence on the right side of the plus sign from the left side. It may not always be the same as the original mutation set. Numbering is always going to be with respect to an ungapped reference (left hand side) sequence.

diff2 = (s1 + sub1) - s1
diff2

[3A]

We can apply the mutation inferred mutation sets and verify that we get back the same mutated sequence:

s1 + sub1

NT: None 

*** AA: SeqRecord(seq=Seq('MKKILV'), id='ea987063-d422-4cb8-83a5-0f7913c23336', name='<unknown name>', description='<unknown description>', dbxrefs=[])

which can be compared to:

s1 + diff1

NT: None 

*** AA: SeqRecord(seq=Seq('MKKILV'), id='ea987063-d422-4cb8-83a5-0f7913c23336', name='<unknown name>', description='<unknown description>', dbxrefs=[])

We can verify equality of the two strings below:

(s1 + sub1).to_str() ==  (s1 + diff1).to_str()

True

Let's try with a mutation set, one that is a bit more complicated. Firstly, here's the first mutation set we used:

ms1

[3K, ^4D]

Let's check the diff of the two sequences:

diff1 = s1 - (s1 + ms1)
diff1

[1K, ^1M, 3D]

s1 + ms1

NT: None 

*** AA: SeqRecord(seq=Seq('MKKDILV'), id='ea987063-d422-4cb8-83a5-0f7913c23336', name='<unknown name>', description='<unknown description>', dbxrefs=[])

s1 + diff1

NT: None 

*** AA: SeqRecord(seq=Seq('MKKDILV'), id='ea987063-d422-4cb8-83a5-0f7913c23336', name='<unknown name>', description='<unknown description>', dbxrefs=[])

Likewise, we can check their equality:

(s1 + ms1).to_str() == (s1 + diff1).to_str()

True

Let's try with a mutationset that is a bit more complicated.

ms2

[^2A, ^4D, 5-]

diff = s1 - (s1 + ms2)
diff

[1A, ^1M, 4D, 5I]

Equality is preserved when we ungap the sequences.

(s1 + diff).ungap().to_str() ==  (s1 + ms2).ungap().to_str()

True

Finally, let's do a really complicated one with 3 substitutions:

ms3 = MutationSet("2A;3F;4Q".split(";"))

diff = s1 - s1 + ms3
diff

[2A, 3F, 4Q]

As you can see, this is pretty trivial, not actually complicated ;).

(s1 + diff).ungap().to_str() == (s1 + ms3).ungap().to_str()

True