On interview preparation: nuances of python

From my perspective mastering particular language is still secondary skills for programmer.
For sure it will affects architecture of software and may add advantages for some use cases of handling data flows, but overall, in the long run, decent design of system coupled with carefully catered algorithms much more important.

Nonetheless within this article I want to highlight several topics that usually distinct python veteran from novice. Part of these questions I was asked about during recent interviews, some of them I had asked myself to distinct people, who name themselves python expert from those who really are, based on long exciting hours of debugging sessions from real products.

Why I want to start with python and not serious languages c++ or scala?
Currently its de-facto standard second language accompanied more performant one and
in many domains it is considered to be a primary language because of simplicity of entry,
rich ecosystem and variety of libraries.

Interpretators:

  • CPython – default one
  • Pypy – If you need a bit faster python (JIT compiler with faster version of garbage collector)
  • Jython\IronPython – in case you need integration with java or .Net world
  • Cython – you need more speed, numpy, pandas and scikit can not fullfil what you need,
    but you still afraid to accept that c++ may be your best friend.

Dependencies management:

Standard practice is to isolate dependencies within dedicated sandbox of virtual environment with all required packages and python interpreter itself installed in such way that it doesn’t confronting with system wide installed libraries.

virtualenv is mainly used for it, but virtualenvwrappers may be more handy. In case it will be necessary to maintain several python versions simultaneously Pyenv can help.

Requirements.txt – should contains list of all dependencies.
Using –index-url you can specify your own private package servers instead of default one – pypi – including private repos with access through ssh or just local file path:

file:///path/to/package#egg=my_package

Distribution and Installation:

  • setup.py – during automatic package installation requirements.txt are not analysed
    anyhow – what really matter is setup.py file used by setuptools. Its worth to be aware about install_requires directive, that should contain list of all dependencies.
  • wheel (or egg) as common distribution formats.

Popular language’s caveats:

Mutable vs immutable datatypes

You can’t change int or string value  of some variables – you have to replace it with new value. But you can add element to list or remove key from dict. Custom objects usually mutable – you can go crazy with altering their fields or adding new methods.

Why it maters?

Mutating default arguments – perfect case for hard to find bugs:

def some_method(b=[]):     # WRONG, use b=None instead
    b.append(1)
    print(b)
    return sum(b)

for x in xrange(3):
    some_method()

Arguments are passed inside methods by reference i.e. it is not copy of object.
In case they immutable and you try to modify them inside of function – new local instance will be created – those procedure named rebinding, if they are mutable – you can end up modifying originally submitted value.

How to make sure that it is actually mutating?

x = 100500
y = 100500

id(x) == id(y) # False, values are the same but objects are different

Method id return identity of object (sort of virtual address of variable if you are used to C terminology)

z = x
id(x) == id(y) # True, equal!
z = 127
id(z) == id(y) # False, as we make re-assign z to another value

Note however:

x = 250
y = 250

id(x) == id(y) # will be True 
"""because of optimisation of interpreter for small numbers:
https://docs.python.org/2/c-api/int.html"""

x = "STR1"
y = "STR1"

id(x) == id(y) # may be True as well
"""
interpreter's implementation specific
for immutable types, operations that compute new values may 
actually return a reference to any existing object with 
the same type and value:
https://docs.python.org/2/reference/datamodel.html#objects-values-and-types
"""

Operator is  is doing exactly this – comparing ids of two objects – whether they both reference exactly same memory cell.

So with brief ideas about value and reference, we can tackle equality operation:

class A():
    def __init__(self, a):
        self.a = a

x1 = A(1)
x2 = A(1)

# Both will be False
x1 == x2
x1 is x2 # equal to id(x1) == id(x2)

By default, comparison operator == invoke __eq__ method and it use ids of objects for comparison.

Truthiness vs “is None”

class A:
    def __init__(self, a):
        self.a = a

class B():
  __bool__(self): return False

l = [float('nan'), 0, 13, 0.0001, "", "False", [], A(None), B(), False]

for entry in l:
    if entry:
        print("{} - True".format(entry))
    else:
        print("{} - False".format(entry))

Is None – is nothing else as straightforward check whether some variable reference the same address as None value. But what checks are performed during invocation truthiness check for some_object? Under the hood it just evaluate objects – check for existence of implementation of __bool__ or len method of object, if they are not present – treat object as True.

Hashing in python

Common question to confuse people – why by default hash is defined for immutable objects only:
reasons will be very obvious if you think about example below:

class A():
    def __init__(self, a):
        self.a = a

    def __hash__(self):
        return hash(self.a)

    def __eq__(self, other):
        print("__eq__ invoked!")
        return self.__class__ == other.__class__ and self.a == other.a

container = set()
c = A(1)
d = A(10)
container.add(c)
c in container # True
d in container # False
c.a = 123
c in container # False

I.e.  after we change value of object’s attribute – we can’t really check whether it was in container. That make sense! But what if we do this one:

d.a = 1
A(1) in container # False, and call of eq method
d in container # False, and call of eq method

I.e.  such kind of checks always invoke __eq__ for cases when hashes are similar in order to deal with possible hash collisions.

What it mean from practical point of view:

  • if object1 equals to object2 their hashes also should be the same
  • in theory it is possible to use identity of objects for hash computation and compare values in case hash collisions

What will happen if we try to compare lists?

a = [1, 2, 3]
b = [3, 2, 1]
c = [1, 2, 3]

a == b # False
a == c # True

Check equality for every pair of elements with the same index – so order matters!

Copy – shallow and deep

What if we add this to our equation?

d = a
a == d # True
print("{} = {}".format(id(a), id(b)))

I.e. we have similar address that d & a referenced.
Due to it – if we add element to a, d will also have this element.

It mean that in case we need two independent instances of similar list we need to use
deepcopy. And it will be necessary for every mutable type (or collection).

Python sort of names mangling:

Python offer interesting approach to prevent name collisions in case of inheritance.
Knowledge of this trick can save your time while working with someone’s code.

class B():
    __attr1 = None

u = B()
u.____attr1 # AttributeError: B instance has no attribute '__attr1'
u._B__attr1 = 123 # All right then

Slicing tricks:

Reversing string, pythonic way:

a = "Acapulco"
a[::-1]

And familiarity with slice notation

array[start:end:step]

may be beneficial by providing very compact code:

g = [x for x in xrange(10)] # [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
g[::2] # Even only [0, 2, 4, 6, 8]
g[-3:] # Last three elements [7, 8, 9]

Init list with similar values:

b = [-13] * 10

But don’t get in trap of using:

b2d = [[-13]*10]*10 - # WRONG! will create 10 reference to the same list

this is valid way of doing this:

b2d = [[-13] * n_cols for i in range(n_rows)]

How to work with resources using with notation:

with open("in.txt", "r") as input, open("out.txt", "w") as output:
    output.write(input.read())

Benefits – implicit handling of resource release in case of any exceptions

And yeah, nothing will stop you from adding to your own objects similar functionality
just add to them __enter__ and __exit__ methods.

Talking about exceptions:

Keep in mind:

  • possibility to rely on exception’s inheritance hierarchy
  • using else when no exception is raised within try block
  • finally is similar to what you may seen in other languages – will be executed anyway.
import random


class MySimpleException(Exception):
    """Base class"""


class MyMediumException(MySimpleException):
    pass


class MyCriticalException(MySimpleException):
    pass


class MyCustomException(Exception):
    pass


def some_method():
    res = random.randint(0, 10)

    if res == 0:
        return res

    if 1 <= res < 3:
        raise MySimpleException("MySimpleException")
    elif 3 <= res < 6:
        raise MyMediumException("MyMediumException")
    elif 6 <= res < 9:
        raise MyCriticalException("MyCriticalException")

    raise MyCustomException("Bad luck today!")


def exception_handling_example():
    try:
        some_method()
    except MySimpleException as e:
        print("Any exception within class hierarchy: " + e.message)
    except Exception as e:
        print("Any exception" + e.message)
    else:
        print("No exception at all")
    finally:
        print("This will be executed no matter what")


for x in xrange(10):
    exception_handling_example()

Iterators, generators & yield:

x = range(1000000) # Python 2.7.*
import sys
sys.getsizeof(x) # 8000072

def my_generator(n):
    num = 0
    while num < n:
        yield num
        num += 1

x = my_generator(1000000)
sys.getsizeof(x) # 80

Range will create list with all elements in memory.
my_generator(1000000) will create special object – generator – that can be used in loop –
in our example it will produce values from 0 up to 1000000 – 1 in lazy fashion:

next(x) # 0
next(x) # 1
next(x) # 2

Many familiar with syntax of list comprehension but may find this syntax a bit confusing:

a = (u * 2 for u in g)

but this is comprehension as well – it just return generator

This is not:

x = xrange(1000000) # &amp;gt;&amp;gt; type 'xrange'
next(x)             # TypeError: xrange object is not an iterator
x = iter(x)
next(x)             # 0

Iterator – is object that have implemented both __next__ and __iter__ method.

Generators is just one particular case of coroutine:

  • within method we have list of instruction,
  • when we hit return, all subsequent calls of this method will re-execute all those instructions.
  • when we hit yield – current state of method (last executed instructions and values of local variables) will be saved and subsequent calls to this method continues from the last statement.

Few words about import in python:

First of all a bit of terminology:

  • module – is just *.py file (or C-extensions – *.pyd or *.so, built-in usually are C-extensions as well)
  • package – set of modules under some folder (for python 2.7.* – it must contains __init__ file)

When we write:

import X
# Just a shortcut to use: xyz.method(args)
import X.Y.Z as xyz
# Will bring to global scope method, variable or complete module Y
from . import Y
# Will bring to global scope EVERYTHING from module X, including its sub-imports
from X import *
  1. Import search module in sys.modules cache that contains previously imported modules.
  2. If not found – search among built-in modules.
  3. And as the last hope – search by filename in every directory mention in sys.path in order of appearance (different version of the same modules in two paths – will load first one)

When module to be loaded is finally found – it will be parsed, interpreter set few variables before execution – for example __name__, and finally execute – hence those famous check for main:

# executed when we imported this module from other code:
# import x.y.z.name_of_this_script, 
# __name__ = "x.y.z.name_of_this_script"
print("module loaded")

if __name__ == '__main__':
    # executed when we run "python ./x/y/z/name_of_this_script.py"
    print("module executed as main")

Parsing may be slow – it first check for pre-cached versions – *.pyc files (with all related last time changed vs timestamp of cached version).

Importing will change global variables and therefore perform some blocking to ensure that nothing
will screw up – even if modules to be imported already in sys.modules cache.

Usually two types of import are distinguished:

from a.b.c.d import Z   # absolute - can be a bit verbose
from . import Z         # relative - add Y from current package

These is just syntax sugar for more compact code:

import X.Y.Z as xyz
xyz.some_method()

from X.Y import Z
Z.some_method()

It still will be loaded fully – not just Z but X, all imports from X, Y, all imports from Z, and finally Z (with all imports as well).

Circular imports

Imagine following project structure:

project
    module1
        __init__.py
        script1.py – contains line from module2.script2 import method2
    module2
        __init__.py
        script2.py - contains line from module1.script1 import method1

python -m module1.script1

When you try to run script1 python will try to resolve all imports first:

  • ok, we need import module2.script2 – lets load all imports for it,
  • within script we have to import module1.script1 – lets load all imports for it,
  • within script we have to import module2.script2 – lets load all imports for it,

In most cases it is matter of proper design, but nonetheless changing imports to be

import module2.script2

will load it once.

Another common approach to fight import’s issues is to use deferred importing:

# imports
# some method

def very_important_method():
    from another_module import smth
    # code

Lambdas

f = lambda x: x.some_attribute
type(f) # type "function"

Single expression – i.e. it should return something and fit to single expression – i.e. sequence of statements that return some value. Example of usage: map, filter, sort.

Memory optimisation

Python is not the most efficient when we talk about performance or RAM consumption:

import sys

class A(object):
    pass

a = A()

sys.getsizeof(A) # 904 0_o
A.__dict__
dict_proxy({'__dict__': &amp;lt;attribute '__dict__' of 'A' objects&amp;gt;, '__module__': '__main__', '__weakref__': &amp;lt;attribute '__weakref__' of 'A' objects&amp;gt;, '__doc__': None})

sys.getsizeof(a) # 64
sys.getsizeof(a.__dict__)   # 280
# but we have freedom to add object and method of object's instance
a.new_int_attribute = 123

Main approach is to explicitly define attributes that should be available for object using __slots__ specifier:

class B(object):
    __slots__ = ['a']
    def __init__(self, a):
        self.a = a

sys.getsizeof(B) # 944
B.__dict__
dict_proxy({'a': &amp;lt;member 'a' of 'B' objects&amp;gt;, '__module__': '__main__', '__slots__': ['a'], '__init__': &amp;lt;function __init__ at 0x109a287d0&amp;gt;, '__doc__': None})

b = B(1)
sys.getsizeof(a)  # 56
sys.getsizeof(a.__dict__) # AttributeError: 'B' object has no attribute '__dict__'

a.new_attr = 23 # B object has no attribute new_attr

Note, that apart of __dict__, another attribute will be missing from objects with implicit definition of __slots____weakref__ – object containing ALL weak references to current object.

Additionally it is possible to use something like namedlist – to create objects that not participate in garbage collections – full comparison of various approach for save a bit of memory is discussed here.  Usually for profiling it is sufficient to use dedicated library like profilehooks.

But nothing is stoping you from diving into byte code as well:

def a():
    print("a")
print(a.__code__)    # code object a at 0x109b2bc30
import dis
dis.dis(a)
dis.dis(a)
  2           0 LOAD_CONST               1 ('')
              3 PRINT_ITEM
              4 PRINT_NEWLINE
              5 LOAD_CONST               0 (None)
              8 RETURN_VALUE

compile("print('Hello world')", '', 'single')   # code object &amp;lt;module&amp;gt; at 0x10db313b0
exec compile("print('Hello world')", '', 'single')   # Hello world

Python is a language with automatic garbage collection. It has two key components: reference counting and generation based garbage collections. When del method is invoked or variable goes out of scope – nothing is actually deleted right away – first number of references is decremented, when it become zero – then memory occupied by object will be available for interpreter again.  Programmer can disable completely generational GC or disable it for particular class by defining __del__ method:

import gc
gc.disable()   # For everything
# OR
class A():
    # ...
    def __del__(self):
        # details
    # ...

gc module allow to manually trigger garbage collection or hunt down possible leaks. It is implementation specific but generally memory will not be returned to OS.

Threading

Python interpreter is single threaded itself – famous GIL. So it may not be the perfect choice for compute-bound application. But for I\O based load – threads in python good enough, especially with gevent in python2.7 or asyncio in 3.7 Alternative is to use multiprocessing module. Or some more suitable language!

Function definitions:

Positional and named arguments – that should be simple:

def method(a, b, c):
    print("a:{a}b:{b}c:{c}".format(a=a, b=b, c=c))

method(b=1,c=5, a="whatever")   # named: will print a:whateverb:35c:1
method(1,5,"whatever")          # positional: will print a:1b:5c:whatever

args  and  kwargs used to specify variable number of arguments. kwargs is named and args is positional

def method(*args, **kwargs):
    for _ in args:
        print(_)

    for k, v in kwargs.items():
        print("{} = {}".format(k, v))


method(1,2,3, a="a", b="b", c="c")
1
2
3
a = a
c = c
b = b

Other

Few useful built-in modules to be aware of and not re-invent the wheel:

  • argparser, configparser
  • itertools – permutations, combinations, product, imap
  • collections – defaultdict, Counter, ordered dict, namedtuples
  • re – regular expresions
  • os, sys, random

Not so standard but handy:

  • requests – to talk with web via POST & GET
  • scapy – for playing with tcp packets
  • scrappy\beautiful soup – to crawl web
  • numpy\panda for memory optimised collections
  • nose – if standard unittest is not enough
  • dash – visualisation

P.S.: Which version to use 2 vs 3:

As for now – in the middle of 2019 – more historical remark:

  • chained exceptions stack trace
  • unordered exceptions during comparison 1 > ‘2’
  • yield from generator_method()
  • faulthandler – tracing kills (except SIGKILL)
  • More handy modules: ip_address, lru_cache, enum, pathlibs, dataclasses import dataclass
  • unicode strings
  • type hints & annotations
  • iterable unpacking
  • dict items order guaranteed 3.7+
  • speed

Leave a Reply

Your email address will not be published.