The Python Almanac
The world of Python packages is adventurous and can be confusing at times. Here, I try to aggregate and showcase a diverse set of Python packages which have become useful at some point.
Introduction
Installing Python
Normally you should use your system’s package manager. In case of problems, try pyenv:
$ pyenv versions
$ pyenv install <foo>
$ pyenv global <foo>
This will install the specified Python version to $(pyenv root)/versions
.
Installing packages
Python packages can be easily installed from PyPI (Python Package Index):
$ pip install --user <package> (local install does not clash with system packages)
Using --user
will install the package only for the current user. This is good if multiple users need different package versions, but can lead to redundant installations.
To install from a git repository, use the following command:
$ pip install --user -U git+https://github.com/<user>/<repository>@<branch>
Package management
While packages can be installed globally or user-specific, it often makes sense to create project-specific virtual environments.
This can be easily accomplished using venv:
$ python -m venv my_venv
$ . venv/bin/activate
$ pip install <package>
Software development
Package Distribution
Use setuptools
. poetry handles many otherwise slightly annoying things:
$ poetry init/add/install/run/publish
CI encapsulation: tox.
Keeping track of version numbers can be achieved using bump2version.
Transform between various project file formats using dephell.
Testing
Setup testing using pytest. It has a wide range of useful features, such as fixtures (modularized per-test setup code) and test parametrization (quickly execute the same test for multiple inputs).
[1]:
%%writefile /tmp/tests.py
import os
import pytest
@pytest.fixture(scope='session')
def custom_directory(tmp_path_factory):
return tmp_path_factory.mktemp('workflow_test')
def test_fixture_execution(custom_directory):
assert os.path.isdir(custom_directory)
@pytest.mark.parametrize('expression_str,result', [
('2+2', 4), ('2*2', 4), ('2**2', 4)
])
def test_expression_evaluation(expression_str, result):
assert eval(expression_str) == result
Writing /tmp/tests.py
[2]:
!pytest -v /tmp/tests.py
============================= test session starts ==============================
platform darwin -- Python 3.8.6, pytest-6.2.2, py-1.10.0, pluggy-0.13.1 -- /Users/kimja/.pyenv/versions/3.8.6/bin/python3.8
cachedir: .pytest_cache
rootdir: /tmp
plugins: anyio-2.0.2
collected 4 items
../../../../../../../../tmp/tests.py::test_fixture_execution PASSED [ 25%]
../../../../../../../../tmp/tests.py::test_expression_evaluation[2+2-4] PASSED [ 50%]
../../../../../../../../tmp/tests.py::test_expression_evaluation[2*2-4] PASSED [ 75%]
../../../../../../../../tmp/tests.py::test_expression_evaluation[2**2-4] PASSED [100%]
============================== 4 passed in 0.03s ===============================
Linting/Formatting
Linters and code formatters improve the quality of your Python code by conducting a static analysis and flagging issues.
flake8: Catch various common errors and adhere to PEP8. Supports many plugins.
pylint: Looks for even more sources of code smell.
black: “the uncompromising Python code formatter”.
While there can be a considerable overlap between the tools’ outputs, each offers its own advantages and they can typically be used together.
Profiling
Code profiling tools are a great way of finding parts of your code which can be optimized. They come in various flavors:
line_profiler: which parts of the code require most execution time
memory_profiler: which parts of the code consume the most memory
Consider the following script (note the @profile
decorator):
[3]:
%%writefile /tmp/script.py
@profile
def main():
# takes a long time
for _ in range(100_000):
1337**42
# requires a lot of memory
arr = [1] * 1_000_000
main()
Writing /tmp/script.py
line_profiler
[4]:
!kernprof -l -v -o /tmp/script.py.lprof /tmp/script.py
Wrote profile results to /tmp/script.py.lprof
Timer unit: 1e-06 s
Total time: 0.13024 s
File: /tmp/script.py
Function: main at line 2
Line # Hits Time Per Hit % Time Line Contents
==============================================================
2 @profile
3 def main():
4 # takes a long time
5 100001 33213.0 0.3 25.5 for _ in range(100_000):
6 100000 93762.0 0.9 72.0 1337**42
7
8 # requires a lot of memory
9 1 3265.0 3265.0 2.5 arr = [1] * 1_000_000
memory_profiler
[5]:
!python3 -m memory_profiler /tmp/script.py
Filename: /tmp/script.py
Line # Mem usage Increment Occurences Line Contents
============================================================
2 37.668 MiB 37.668 MiB 1 @profile
3 def main():
4 # takes a long time
5 37.668 MiB 0.000 MiB 100001 for _ in range(100_000):
6 37.668 MiB 0.000 MiB 100000 1337**42
7
8 # requires a lot of memory
9 45.301 MiB 7.633 MiB 1 arr = [1] * 1_000_000
Debugging
Raw python
ipdb is useful Python commandline debugger. To invoke it, simply put import ipdb; ipdb.set_trace()
in your code. Starting with Python 3.7, you can also write breakpoint()
. This honors the PYTHONBREAKPOINT
environment variable. To automatically start the debugger when an error occurs, run your script with python -m ipdb -c continue <script>
.
The debugger supports various commands: * p: print expression * pp: pretty print * n: next line in current function * s: execute current line and stop at next possible location (e.g. in function call) * c: continue execution * unt: execute until we reach greater line * l: list source (l .
) * ll: whole source code of current function * b: breakpoint ([ ([filename:]lineno | function) [, condition] ]
) * w/bt: print stack trace * u: move up the stack trace * d: move down the
stack trace * h: help * q: quit
C++ extension:
Open two windows: ipython, ldb (gdb)
In [1]: !ps aux | grep -i ipython (lldb) attach –pid 1234 (lldb) continue
(lldb) breakpoint set -f myfile.cpp -l 400
In [2]: run myscript.py
Documentation
sphinx, nbsphinx
Logging
There are various built-in and third-party logging modules available.
[6]:
from loguru import logger
[7]:
logger.debug('Helpful debug message')
logger.error('oh no')
2021-05-01 11:27:35.525 | DEBUG | __main__:<module>:1 - Helpful debug message
2021-05-01 11:27:35.526 | ERROR | __main__:<module>:2 - oh no
Data Science
SciPy
SciPy is comprised of various popular Python modules which are for scientific computations.
Numpy can be used for a multitude of things.
[8]:
import numpy as np
[9]:
data = np.random.normal(size=(100, 3))
[10]:
data[:10, :]
[10]:
array([[ 0.12591488, -0.4218942 , -2.12752191],
[ 1.27252206, 0.22487169, -0.96513306],
[ 0.42667086, -0.61475518, -0.11270731],
[ 0.52754573, 0.29209191, -0.03715688],
[-1.4418503 , -0.45623167, -0.56817326],
[ 0.53189076, 0.55974594, -0.98371494],
[ 1.00713651, 0.2159477 , 0.14132138],
[ 0.5308628 , -1.82106195, -0.44155486],
[-1.04296029, 0.16444262, -0.2541122 ],
[ 0.66799592, -1.62949468, 1.29769595]])
Dataframes
Organizing your data in dataframes using pandas makes nearly everything easier.
[11]:
import pandas as pd
[12]:
df = pd.DataFrame(data, columns=['A', 'B', 'C'])
df['group'] = np.random.choice(['G1', 'G2'], size=df.shape[0])
[13]:
df.head()
[13]:
A | B | C | group | |
---|---|---|---|---|
0 | 0.125915 | -0.421894 | -2.127522 | G1 |
1 | 1.272522 | 0.224872 | -0.965133 | G1 |
2 | 0.426671 | -0.614755 | -0.112707 | G1 |
3 | 0.527546 | 0.292092 | -0.037157 | G1 |
4 | -1.441850 | -0.456232 | -0.568173 | G1 |
Python’s Tidyverse
The philosophy of R’s tidyverse makes working with dataframes a joy. The following packages try to bring these concepts to the world of Python.
siuba as dplyr
siuba implements a grammar of data manipulation inspired by dplyr. It starts by introducing various piping operations to pandas dataframes.
dfply and plydata do (more or less) the same thing.
[14]:
from siuba import group_by, summarize, _
from siuba.data import mtcars
[15]:
(mtcars >> group_by(_.cyl) >> summarize(hp_mean=_.hp.mean()))
[15]:
cyl | hp_mean | |
---|---|---|
0 | 4 | 82.636364 |
1 | 6 | 122.285714 |
2 | 8 | 209.214286 |
plotnine as ggplot2
plotnine implements a grammar of graphics and is (in ideology) based on R’s ggplot2.
[16]:
import plotnine
[17]:
# first convert dataframe from wide to long format
df_long = pd.melt(df, id_vars=['group'])
df_long.head()
[17]:
group | variable | value | |
---|---|---|---|
0 | G1 | A | 0.125915 |
1 | G1 | A | 1.272522 |
2 | G1 | A | 0.426671 |
3 | G1 | A | 0.527546 |
4 | G1 | A | -1.441850 |
[18]:
(
plotnine.ggplot(df_long, plotnine.aes(x='variable', y='value', fill='group'))
+ plotnine.geom_boxplot()
+ plotnine.facet_wrap('~group')
+ plotnine.theme_minimal()
)
[18]:
<ggplot: (334842766)>
pandas-datareader
What fun is data science without data?
pandas-datareader gives you direct access to a diverse set of data sources.
[19]:
import pandas_datareader as pdr
For example, databases from Eurostat are readily available:
[20]:
# ilc_pw01: Average rating of satisfaction by domain, sex, age and educational attainment level
df_eurostat = pdr.data.DataReader('ilc_pw01', 'eurostat')
df_eurostat.T.head()
[20]:
TIME_PERIOD | 2018-01-01 | ||||||
---|---|---|---|---|---|---|---|
ISCED11 | UNIT | SEX | INDIC_WB | AGE | GEO | FREQ | |
Less than primary, primary and lower secondary education (levels 0-2) | Rating (0-10) | Females | Satisfaction with accommodation | From 16 to 24 years | Austria | Annual | NaN |
Belgium | Annual | NaN | |||||
Bulgaria | Annual | NaN | |||||
Switzerland | Annual | NaN | |||||
Cyprus | Annual | NaN |
[21]:
df_eurostat.columns.get_level_values('SEX')
[21]:
Index(['Females', 'Females', 'Females', 'Females', 'Females', 'Females',
'Females', 'Females', 'Females', 'Females',
...
'Total', 'Total', 'Total', 'Total', 'Total', 'Total', 'Total', 'Total',
'Total', 'Total'],
dtype='object', name='SEX', length=58050)
[22]:
df_sub = (
df_eurostat.T.xs('All ISCED 2011 levels', level='ISCED11')
.xs('Total', level='SEX')
.xs('Rating (0-10)', level='UNIT')
.xs('Annual', level='FREQ')
.reset_index()
.drop('GEO', axis=1)
.rename(
columns={
pd.Timestamp('2018-01-01 00:00:00'): 'satisfaction',
'INDIC_WB': 'type',
'AGE': 'age_range',
}
)
)
df_sub = df_sub[
df_sub['age_range'].isin(
[
'From 16 to 24 years',
'From 25 to 34 years',
'From 35 to 49 years',
'From 50 to 64 years',
'From 65 to 74 years',
]
)
]
df_sub.dropna().head()
[22]:
TIME_PERIOD | type | age_range | satisfaction |
---|---|---|---|
774 | Satisfaction with financial situation | From 16 to 24 years | 7.7 |
775 | Satisfaction with financial situation | From 16 to 24 years | 7.1 |
776 | Satisfaction with financial situation | From 16 to 24 years | 4.1 |
777 | Satisfaction with financial situation | From 16 to 24 years | 7.1 |
778 | Satisfaction with financial situation | From 16 to 24 years | 6.4 |
[23]:
(
plotnine.ggplot(
df_sub.dropna(), plotnine.aes(x='type', y='satisfaction', fill='age_range')
)
+ plotnine.geom_boxplot()
+ plotnine.theme_minimal()
+ plotnine.theme(axis_text_x=plotnine.element_text(rotation=45, hjust=1))
)
[23]:
<ggplot: (334895665)>
Networkx
Networkx is a wonderful library for conducting network analysis.
[24]:
import networkx as nx
[25]:
graph = nx.watts_strogatz_graph(100, 4, 0.1)
print(nx.info(graph))
Name:
Type: Graph
Number of nodes: 100
Number of edges: 200
Average degree: 4.0000
[26]:
pos = nx.drawing.nx_agraph.graphviz_layout(graph, prog='neato', args='-Goverlap=scale')
list(pos.items())[:3]
[26]:
[(0, (2800.5, 870.29)), (1, (2479.1, 777.36)), (2, (2464.1, 408.03))]
[27]:
node_clustering = nx.clustering(graph)
list(node_clustering.items())[:3]
[27]:
[(0, 0.5), (1, 0.5), (2, 0.3333333333333333)]
[28]:
nx.draw(
graph,
pos,
node_size=100,
nodelist=list(node_clustering.keys()),
node_color=list(node_clustering.values()),
)
Plotting
Matplotlib
Matplotlib is the de facto standard plotting library for Python.
[29]:
import matplotlib.pyplot as plt
[30]:
fig, ax = plt.subplots()
ax.scatter(data[:, 0], data[:, 1])
fig.tight_layout()
Axis ticks can be formatted in a multitude of different ways. The most versatile way is probably FuncFormatter
.
[31]:
from matplotlib.ticker import FuncFormatter
[32]:
@FuncFormatter
def my_formatter(x, pos):
return f'{x=}, {pos=}'
[33]:
fig, ax = plt.subplots(figsize=(10, 6))
ax.scatter(data[:, 0], data[:, 1])
ax.xaxis.set_major_formatter(my_formatter)
ax.yaxis.set_major_formatter(my_formatter)
Seaborn
Seaborn makes working with dataframes and creating commonly used plots accessible and comfortable.
[34]:
import seaborn as sns
[35]:
df_long.head()
[35]:
group | variable | value | |
---|---|---|---|
0 | G1 | A | 0.125915 |
1 | G1 | A | 1.272522 |
2 | G1 | A | 0.426671 |
3 | G1 | A | 0.527546 |
4 | G1 | A | -1.441850 |
[36]:
sns.boxplot(data=df_long, x='variable', y='value', hue='group')
[36]:
<AxesSubplot:xlabel='variable', ylabel='value'>
Statannot
Statannot can be used to quickly add markers of significance to comparison plots.
[37]:
import statannot
[38]:
ax = sns.boxplot(
data=df_long,
x='variable',
y='value',
hue='group',
order=['A', 'B', 'C'],
hue_order=['G1', 'G2'],
)
statannot.add_stat_annotation(
ax,
plot='barplot',
data=df_long,
x='variable',
y='value',
hue='group',
order=['A', 'B', 'C'],
hue_order=['G1', 'G2'],
box_pairs=[(('B', 'G1'), ('B', 'G2'))],
text_format='simple',
test='Mann-Whitney',
)
B_G1 v.s. B_G2: Mann-Whitney-Wilcoxon test two-sided with Bonferroni correction, P_val=9.415e-01 U_stat=1.207e+03
[38]:
(<AxesSubplot:xlabel='variable', ylabel='value'>,
[<statannot.StatResult.StatResult at 0x1401d3ca0>])
Brokenaxes
Brokenaxes can be used to include outliers in a plot without messing up the axis range. Note that this can be quite misleading.
[39]:
import brokenaxes
[40]:
bax = brokenaxes.brokenaxes(ylims=((0, 20), (90, 110)))
bax.boxplot([np.random.normal(10, size=100), np.random.normal(100, size=100)])
[40]:
[{'whiskers': [<matplotlib.lines.Line2D at 0x14037ba30>,
<matplotlib.lines.Line2D at 0x14038b3d0>,
<matplotlib.lines.Line2D at 0x140399850>,
<matplotlib.lines.Line2D at 0x140399b50>],
'caps': [<matplotlib.lines.Line2D at 0x14038b730>,
<matplotlib.lines.Line2D at 0x14038ba90>,
<matplotlib.lines.Line2D at 0x140399eb0>,
<matplotlib.lines.Line2D at 0x1403a5250>],
'boxes': [<matplotlib.lines.Line2D at 0x140385490>,
<matplotlib.lines.Line2D at 0x1403994f0>],
'medians': [<matplotlib.lines.Line2D at 0x14038bdf0>,
<matplotlib.lines.Line2D at 0x1403a55b0>],
'fliers': [<matplotlib.lines.Line2D at 0x140399190>,
<matplotlib.lines.Line2D at 0x1403a5910>],
'means': []},
{'whiskers': [<matplotlib.lines.Line2D at 0x1403af040>,
<matplotlib.lines.Line2D at 0x1403af3a0>,
<matplotlib.lines.Line2D at 0x1403ba6d0>,
<matplotlib.lines.Line2D at 0x1403baa30>],
'caps': [<matplotlib.lines.Line2D at 0x1403af5b0>,
<matplotlib.lines.Line2D at 0x1403af910>,
<matplotlib.lines.Line2D at 0x1403bad90>,
<matplotlib.lines.Line2D at 0x1403c7100>],
'boxes': [<matplotlib.lines.Line2D at 0x1403a5ca0>,
<matplotlib.lines.Line2D at 0x1403ba370>],
'medians': [<matplotlib.lines.Line2D at 0x1403afc70>,
<matplotlib.lines.Line2D at 0x1403c7460>],
'fliers': [<matplotlib.lines.Line2D at 0x1403affd0>,
<matplotlib.lines.Line2D at 0x1403c77c0>],
'means': []}]
Adjusttext
Adjusttext can help for plots with many labels which potentially overlap.
[41]:
from adjustText import adjust_text
[42]:
data_sub = data[:40, :]
fig, (ax_raw, ax_adj) = plt.subplots(nrows=1, ncols=2, figsize=(16, 6))
ax_raw.scatter(data_sub[:, 0], data_sub[:, 1])
[
ax_raw.annotate(f'{round(x, 1)},{round(y, 1)}', xy=(x, y))
for x, y in data_sub[:, [0, 1]]
]
ax_adj.scatter(data_sub[:, 0], data_sub[:, 1])
adjust_text(
[
ax_adj.annotate(f'{round(x, 1)},{round(y, 1)}', xy=(x, y))
for x, y in data_sub[:, [0, 1]]
],
arrowprops=dict(arrowstyle='->'),
)
[42]:
182
Folium
Folium is a Python wrapper of the Leaflet.js library to visualize dynamic maps.
[43]:
import folium
[44]:
folium.Map(
location=[np.random.uniform(40, 70), np.random.uniform(10, 30)],
zoom_start=7,
width=500,
height=500,
)
[44]:
Ahlive
Creating animated visualizations becomes easy and fun with ahlive.
[45]:
import ahlive as ah
[46]:
adf = ah.DataFrame(pd.DataFrame({'A': [1, 2, 3], 'B': [1, 2, 3]}), xs='A', ys='B')
adf.render()
[########################################] | 100% Completed | 6.5s
[46]:
Color palettes
Choosing the right color palette for your visualization can be tricky. palettable provides many useful ones.
[47]:
import palettable
You can easily display a multitude of colorscales…
[48]:
palettable.tableau.Tableau_10.show_discrete_image()
…and access them in a variety of ways.
[49]:
print(palettable.tableau.Tableau_10.name)
print(palettable.tableau.Tableau_10.type)
print(palettable.tableau.Tableau_10.hex_colors)
print(palettable.tableau.Tableau_10.mpl_colormap)
Tableau_10
qualitative
['#1F77B4', '#FF7F0E', '#2CA02C', '#D62728', '#9467BD', '#8C564B', '#E377C2', '#7F7F7F', '#BCBD22', '#17BECF']
<matplotlib.colors.LinearSegmentedColormap object at 0x1426fcac0>
High performance
When dealing with large amounts of data or many computations, it can make sense to optimize hotspots in C++ or use specialized libraries.
Dask
Dask provides a Panda’s like interface to high-performance dataframes which support out-of-memory processing, cluster distribution, and more. It is particularly useful when the dataframe does not fit in RAM anymore. Common operations operate on chunks of the dataframe and are only executed when explicitly requested.
[50]:
import dask.dataframe as dd
[51]:
df = pd.DataFrame(np.random.normal(size=(1_000_000, 2)), columns=['A', 'B'])
[52]:
ddf = dd.from_pandas(df, npartitions=4)
ddf.head()
[52]:
A | B | |
---|---|---|
0 | 1.209945 | -0.829657 |
1 | -1.041410 | 0.138946 |
2 | 0.497872 | 0.628439 |
3 | -0.418077 | 0.039312 |
4 | -0.353732 | 0.500529 |
[53]:
ddf['A'] + ddf['B']
[53]:
Dask Series Structure:
npartitions=4
0 float64
250000 ...
500000 ...
750000 ...
999999 ...
dtype: float64
Dask Name: add, 16 tasks
[54]:
(ddf['A'] + ddf['B']).compute()
[54]:
0 0.380288
1 -0.902464
2 1.126310
3 -0.378765
4 0.146798
...
999995 -1.198871
999996 0.195255
999997 1.707589
999998 -1.505217
999999 -1.828147
Length: 1000000, dtype: float64
Vaex
Vaex fills a similar niche as dask and makes working with out-of-core dataframe easy. It has a slightly more intuitive interface and offers many cool visualizations right out of the box.
[55]:
import vaex as vx
[56]:
vdf = vx.from_pandas(df)
vdf.head()
[56]:
# | A | B |
---|---|---|
0 | 1.20994 | -0.829657 |
1 | -1.04141 | 0.138946 |
2 | 0.497872 | 0.628439 |
3 | -0.418077 | 0.0393116 |
4 | -0.353732 | 0.500529 |
5 | 0.218921 | -0.140769 |
6 | 1.44677 | 0.177711 |
7 | -0.929196 | -0.306118 |
8 | 1.23181 | 1.37377 |
9 | -0.354355 | -1.17795 |
[57]:
vdf['A'] + vdf['B']
[57]:
Expression = (A + B)
Length: 1,000,000 dtype: float64 (expression)
---------------------------------------------
0 0.380288
1 -0.902464
2 1.12631
3 -0.378765
4 0.146798
...
999995 -1.19887
999996 0.195255
999997 1.70759
999998 -1.50522
999999 -1.82815
[58]:
vdf.plot(vdf['A'], vdf['B'])
[58]:
<matplotlib.image.AxesImage at 0x141ea3310>
Joblib
Joblib makes executing functions in parallel very easy and removes boilerplate code.
[59]:
import time
import random
import joblib
[60]:
def heavy_function(i):
print(f'{i=}')
time.sleep(random.random())
return i**i
[61]:
joblib.Parallel(n_jobs=2)([joblib.delayed(heavy_function)(i) for i in range(10)])
[61]:
[1, 1, 4, 27, 256, 3125, 46656, 823543, 16777216, 387420489]
Swifter
Choosing the correct way of parallelizing your computations can be non-trivial. Swifter tries to automatically select the most suitable one.
[62]:
import swifter
[63]:
df_big = pd.DataFrame({'A': np.random.randint(0, 100, size=1_000_000)})
df_big.head()
[63]:
A | |
---|---|
0 | 29 |
1 | 84 |
2 | 61 |
3 | 63 |
4 | 39 |
[64]:
%%timeit
df_big['A'].apply(lambda x: x**2)
526 ms ± 4.48 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
[65]:
%%timeit
df_big['A'].swifter.apply(lambda x: x**2)
2.84 ms ± 139 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Bioinformatics
PyRanges
PyRanges makes working with genomic ranges easy as pie.
[66]:
import pyranges as pr
[67]:
df_exons = pr.data.exons()
df_exons
[67]:
+--------------+-----------+-----------+-------+
| Chromosome | Start | End | +3 |
| (category) | (int32) | (int32) | ... |
|--------------+-----------+-----------+-------|
| chrX | 135721701 | 135721963 | ... |
| chrX | 135574120 | 135574598 | ... |
| chrX | 47868945 | 47869126 | ... |
| chrX | 77294333 | 77294480 | ... |
| ... | ... | ... | ... |
| chrY | 15409586 | 15409728 | ... |
| chrY | 15478146 | 15478273 | ... |
| chrY | 15360258 | 15361762 | ... |
| chrY | 15467254 | 15467278 | ... |
+--------------+-----------+-----------+-------+
Stranded PyRanges object has 1,000 rows and 6 columns from 2 chromosomes.
For printing, the PyRanges was sorted on Chromosome and Strand.
3 hidden columns: Name, Score, Strand
[68]:
df_locus = pr.PyRanges(
pd.DataFrame({'Chromosome': ['chrX'], 'Start': [1_400_000], 'End': [1_500_000]})
)
df_locus
[68]:
+--------------+-----------+-----------+
| Chromosome | Start | End |
| (category) | (int32) | (int32) |
|--------------+-----------+-----------|
| chrX | 1400000 | 1500000 |
+--------------+-----------+-----------+
Unstranded PyRanges object has 1 rows and 3 columns from 1 chromosomes.
For printing, the PyRanges was sorted on Chromosome.
[69]:
df_exons.overlap(df_locus).df
[69]:
Chromosome | Start | End | Name | Score | Strand | |
---|---|---|---|---|---|---|
0 | chrX | 1475113 | 1475229 | NM_001267713_exon_4_0_chrX_1475114_f | 0 | + |
1 | chrX | 1419383 | 1419519 | NM_001161531_exon_9_0_chrX_1419384_f | 0 | + |
2 | chrX | 1424338 | 1424420 | NM_006140_exon_11_0_chrX_1424339_f | 0 | + |
3 | chrX | 1407651 | 1407781 | NM_001161532_exon_3_0_chrX_1407652_f | 0 | + |
4 | chrX | 1404670 | 1404813 | NM_172245_exon_3_0_chrX_1404671_f | 0 | + |
5 | chrX | 1424338 | 1424420 | NM_001161530_exon_10_0_chrX_1424339_f | 0 | + |
6 | chrX | 1414319 | 1414349 | NM_172245_exon_8_0_chrX_1414320_f | 0 | + |
7 | chrX | 1407411 | 1407535 | NM_172249_exon_4_0_chrX_1407412_f | 0 | + |
Obonet
Obonet is a library for working with (OBO-formatted) ontologies.
[70]:
import obonet
[71]:
url = 'https://github.com/DiseaseOntology/HumanDiseaseOntology/raw/main/src/ontology/HumanDO.obo'
graph = obonet.read_obo(url)
[72]:
list(graph.nodes(data=True))[0]
[72]:
('DOID:0001816',
{'name': 'angiosarcoma',
'alt_id': ['DOID:267', 'DOID:4508'],
'def': '"A vascular cancer that derives_from the cells that line the walls of blood vessels or lymphatic vessels." [url:http\\://en.wikipedia.org/wiki/Hemangiosarcoma, url:https\\://en.wikipedia.org/wiki/Angiosarcoma, url:https\\://ncit.nci.nih.gov/ncitbrowser/ConceptReport.jsp?dictionary=NCI_Thesaurus&ns=ncit&code=C3088, url:https\\://www.ncbi.nlm.nih.gov/pubmed/23327728]',
'subset': ['DO_cancer_slim', 'NCIthesaurus'],
'synonym': ['"hemangiosarcoma" EXACT []'],
'xref': ['ICDO:9120/3',
'MESH:D006394',
'NCI:C3088',
'NCI:C9275',
'SNOMEDCT_US_2020_09_01:39000009',
'UMLS_CUI:C0018923',
'UMLS_CUI:C0854893'],
'is_a': ['DOID:175']})
Statistics/Machine Learning
Statsmodels
Statsmodels helps with statistical modelling.
[73]:
import statsmodels.formula.api as smf
[74]:
df_data = pd.DataFrame({'X': np.random.normal(size=100)})
df_data['Y'] = 1.3 * df_data['X'] + 4.2
df_data.head()
[74]:
X | Y | |
---|---|---|
0 | 0.927521 | 5.405777 |
1 | 2.110368 | 6.943479 |
2 | 1.517580 | 6.172854 |
3 | -0.069583 | 4.109542 |
4 | -0.644390 | 3.362293 |
[75]:
mod = smf.ols('Y ~ X', data=df_data)
res = mod.fit()
[76]:
res.params
[76]:
Intercept 4.2
X 1.3
dtype: float64
[77]:
res.summary().tables[1]
[77]:
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | 4.2000 | 2.03e-16 | 2.07e+16 | 0.000 | 4.200 | 4.200 |
X | 1.3000 | 1.93e-16 | 6.73e+15 | 0.000 | 1.300 | 1.300 |
Pingouin
Pingouin provides additional statistical methods.
[78]:
import pingouin as pg
[79]:
pg.normality(np.random.normal(size=100))
/Users/kimja/.pyenv/versions/3.8.6/lib/python3.8/site-packages/outdated/utils.py:14: OutdatedPackageWarning: The package outdated is out of date. Your version is 0.2.0, the latest is 0.2.1.
Set the environment variable OUTDATED_IGNORE=1 to disable these warnings.
/Users/kimja/.pyenv/versions/3.8.6/lib/python3.8/site-packages/outdated/utils.py:14: OutdatedPackageWarning: The package pingouin is out of date. Your version is 0.3.9, the latest is 0.3.11.
Set the environment variable OUTDATED_IGNORE=1 to disable these warnings.
[79]:
W | pval | normal | |
---|---|---|---|
0 | 0.985349 | 0.336443 | True |
[80]:
pg.normality(np.random.uniform(size=100))
[80]:
W | pval | normal | |
---|---|---|---|
0 | 0.946713 | 0.000507 | False |
Scitkit-learn
Scikit-learn facilitates machine learning in Python.
[81]:
from sklearn import svm
from sklearn import datasets
from sklearn.model_selection import train_test_split
[82]:
X, y = datasets.load_iris(return_X_y=True)
X.shape, y.shape
[82]:
((150, 4), (150,))
[83]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=0)
[84]:
clf = svm.SVC(kernel='linear', C=1).fit(X_train, y_train)
clf.score(X_test, y_test)
[84]:
0.9666666666666667
Scikit-learn offers various plugins which deal with common issues encountered while modeling.
Imbalanced-learn provides various re-sampling techniques when the dataset has annoying class imbalances.
[85]:
import collections
from imblearn.over_sampling import RandomOverSampler
[86]:
ros = RandomOverSampler(random_state=0)
[87]:
X_sub, y_sub = X[:60, :], y[:60]
X_resampled, y_resampled = ros.fit_resample(X_sub, y_sub)
[88]:
print('sub:', sorted(collections.Counter(y_sub).items()))
print('resampled:', sorted(collections.Counter(y_resampled).items()))
sub: [(0, 50), (1, 10)]
resampled: [(0, 50), (1, 50)]
Category_encoders helps with converting categorical variables to numerical ones.
[89]:
import category_encoders
[90]:
tmp = np.random.choice(['A', 'B'], size=10)
df_cat = pd.DataFrame({'original_class': tmp, 'feature01': tmp})
df_cat.head()
[90]:
original_class | feature01 | |
---|---|---|
0 | A | A |
1 | B | B |
2 | A | A |
3 | A | A |
4 | B | B |
[91]:
category_encoders.OneHotEncoder(cols=['feature01']).fit_transform(df_cat)
[91]:
original_class | feature01_1 | feature01_2 | |
---|---|---|---|
0 | A | 1 | 0 |
1 | B | 0 | 1 |
2 | A | 1 | 0 |
3 | A | 1 | 0 |
4 | B | 0 | 1 |
5 | B | 0 | 1 |
6 | B | 0 | 1 |
7 | B | 0 | 1 |
8 | B | 0 | 1 |
9 | B | 0 | 1 |
Yellowbrick makes a multitude of visual diagnostic tools readily accessible.
[92]:
from yellowbrick.classifier import ROCAUC
[93]:
clf.fit(X, y)
[93]:
SVC(C=1, kernel='linear')
[94]:
visualizer = ROCAUC(clf)
# visualizer.score(X, y) # TODO: uncomment once scikit-learn is fixed
visualizer.show()
[94]:
<AxesSubplot:title={'center':'ROC Curves for SVC'}, xlabel='False Positive Rate', ylabel='True Positive Rate'>
Language Bindings
Pybind11
Pybind11 makes writing bindings between Python and C++ enjoyable. In combination with cppimport some might even call it fun. It is possible to implement custom typecasters to support bindings for arbitrary objects.
[95]:
%%writefile cpp_source.cpp
#include <pybind11/pybind11.h>
namespace py = pybind11;
int square(int x) {
return x * x;
}
PYBIND11_MODULE(cpp_source, m) {
m.def(
"square", &square,
py::arg("x") = 1
);
}
/*
<%
setup_pybind11(cfg)
cfg['compiler_args'] = ['-std=c++11']
%>
*/
Overwriting cpp_source.cpp
[96]:
import cppimport
[97]:
cpp_source = cppimport.imp('cpp_source')
[98]:
cpp_source.square(5)
[98]:
25
Jupyter
Nbstripout
Commiting Jupyter notebooks to CVS (e.g. git) can be annoying due to non-code properties being saved. Nbstripout strips all of those away and can be run automatically for each committed notebook by executing nbstripout --install
once.
Miscellaneous
humanize
humanize allows to format numbers with the goal of making them more human-readable.
[99]:
import datetime
import humanize
This is, of course, highly context-dependent:
[100]:
big_number = 4578934
print('Actual number:', big_number)
print('Readable number:', humanize.intword(big_number))
print('Time difference:', humanize.precisedelta(datetime.timedelta(seconds=big_number)))
print('File size:', humanize.naturalsize(big_number))
Actual number: 4578934
Readable number: 4.6 million
Time difference: 1 month, 21 days, 23 hours, 55 minutes and 34 seconds
File size: 4.6 MB
rich
Making sure that terminal applications have nicely formatted output makes using them a much better experience.
rich makes it easy to add colors and other fluff.
[101]:
import rich
[102]:
rich.print('[green]Hello[/green] [bold red]World[/bold red] :tada:')
Hello World 🎉
It is also useful in the everyday life of a Python developer:
[103]:
rich.inspect(rich.print)
╭──────────────────────────── <function print at 0x14c517d30> ────────────────────────────╮ │ def print(*objects: Any, sep=' ', end='\n', file: IO[str] = None, flush: bool = False): │ │ │ │ Print object(s) supplied via positional arguments. │ │ This function has an identical signature to the built-in print. │ │ For more advanced features, see the :class:`~rich.console.Console` class. │ │ │ │ 35 attribute(s) not shown. Run inspect(inspect) for options. │ ╰─────────────────────────────────────────────────────────────────────────────────────────╯