pyskim: Quick summary statistics for dataframes
When starting out with a new data analysis project, familiarizing yourself with your data set is, of course, crucial (besides obtaining domain specific knowledge, etc).
This typically involves a set of somewhat repetitive steps such as value counts, histograms, scatterplots and so on.
pyskim helps you achieve that goal as quickly and comfortably as possible.
Simply locate your CSV (or whichever delimiter is your favorite one) file and call pyskim
from the commandline:
$ pyskim iris.csv
── Data Summary ────────────────────────────────────────────────────────────────────────────────────
type value
----------------- -------
Number of rows 150
Number of columns 5
──────────────────────────────────────────────────
Column type frequency:
Count
------- -------
float64 4
string 1
── Variable type: number ───────────────────────────────────────────────────────────────────────────
name na_count mean sd p0 p25 p50 p75 p100 hist
-- ------------ ---------- ------ ----- ---- ----- ----- ----- ------ ----------
0 sepal_length 0 5.84 0.828 4.3 5.1 5.8 6.4 7.9 ▂▆▃▇▄▇▅▁▁▁
1 sepal_width 0 3.06 0.436 2 2.8 3 3.3 4.4 ▁▁▄▅▇▆▂▂▁▁
2 petal_length 0 3.76 1.77 1 1.6 4.35 5.1 6.9 ▇▃▁▁▂▅▆▄▃▁
3 petal_width 0 1.2 0.762 0.1 0.3 1.3 1.8 2.5 ▇▂▁▂▂▆▁▄▂▃
── Variable type: string ───────────────────────────────────────────────────────────────────────────
name na_count n_unique top_counts
-- --------------- ---------- ---------- -----------------------------------------
0 species 0 3 versicolor: 50, setosa: 50, virginica: 50
It will tell you the most relevant dataframe properties at a glance, and additionally provide statistics for each column. Which statistics are computed depends on the column’s datatype and is customizable to adapt to your own custom datatypes.