Diglett: Tools for data-munging¶

Diglett: Tools for data-munging

diglett.eda¶

Tools for exploratory data analysis (EDA).

These help understand the nature of some data, mostly by displaying stuff, not returning anything.

diglett.eda.show_top_n(data, n=10, show_output=True, other_val='…')¶

Cleanly display the top N values from a “group by count” SQL output.

Parameters

data (Union[Series, DataFrame]) – A series or dataframe, with columns (<dim>, num_).
n (int) – The number of rows to show, before grouping remainder as “other”.
show_output (bool) – If true, display result “nicely”, else return the actual df.
other_val (str) – The value to which values beyond the top N are grouped.

Return type

Union[Series, DataFrame, None]

diglett.eda.summarize(df, return_output=False)¶

Show a better summary than df.info().

Incl. number of rows, memory, nulls, unique, (min/mean/max) of numerics, mode of string cols.

Return type: Optional[DataFrame]

diglett.eda.tabulate(df, normalize=False, sorted=True, return_output=False)¶

Pivot a DataFrame to show a numeric column across each of two dimensions.

Parameters

df (DataFrame) – DataFrame w/ columns: (dim_A, dim_B, num_).
normalize (bool) – Whether to return absolute counts (default) or normalize by total.
sorted (bool) – Whether to sort index (and columns) to have highest row (or column) sums first.
return_output (bool) – By default, output is displayed, but can also be returned.

Return type

Optional[DataFrame]

diglett.insist¶

Insist (assert) assumptions about data quality.

These functions each verify a specific assumption.
They are “non-breaking” by using HTML warnings instead of actual assertions.
They each return the input object, allowing them to be used in DataFrame.pipe() chains.

diglett.insist.average_greater_than(df, col, threshold, return_alert=False)¶

Check that average of specified column is greater than some value.

Return type: Union[DataFrame, HTML]

diglett.insist.less_than_pct_null(df, cols=None, pct=0.01, return_alert=False)¶

Check that specified (or all) columns contain less than some % of null values.

Return type: Union[DataFrame, HTML]

diglett.insist.more_than_pct_unique(df, col, pct=0.99, return_alert=False)¶

Check that a minimum pct. of values in a Series are unique.

Return type: Union[DataFrame, HTML]

diglett.insist.no_nulls(df, cols=None, **kwargs)¶

Check that specified (or all) columns do not contain null values.

Return type: Union[DataFrame, HTML]

diglett.transform¶

Tools for transforming input data into more usable form.

diglett.transform.fillnas(input_df, subset=None, value=0)¶

Apply fillna to a subset of columns.

Roughly equivalent to df[subset] = df[subset].fillna(value)

Parameters

input_df (DataFrame) – The DataFrame to operate on.
subset (Optional[List[str]]) – A list of the columns to operate on.
value (Any) – The value with which to fill nulls.

Return type

DataFrame

diglett.transform.multi_moving_average(df, window=7, min_periods=1)¶

Apply a moving average to a DataFrame with a 2-level index, where the second is a dimension.

Parameters

df (DataFrame) – The DataFrame to operate on.
window (int) – Passed directly as argument to df.rolling()
min_periods (int) – Passed directly as argument to df.rolling()

Return type

DataFrame

diglett.transform.reindex_by_sum(df, axis=1, margin_col=None)¶

Reindex axis of a DataFrame according to it’s sum.

Parameters

df (DataFrame) – The DataFrame to reindex.
axis (int) – The axis upon which to reindex.
margin_col (Optional[str]) – This column/index value will be excluded from the sort (end).

Return type

DataFrame

diglett.transform.winsorize(srs, lower=0, upper=0.99, verbose=True)¶

Winsorize a series at specified quantiles.

Parameters

srs (Series) – The pandas Series to be winsorized.
lower (Union[int, float]) – Values below this point get winsorized.
upper (Union[int, float]) – Values above this point get winsorized.
verbose (bool) – Whether to print info to stdout for debugging.

Return type

Series

diglett.group¶

Tools related to df.groupby().

diglett.group.group_other(data, n=10, other_val='…', sort_by=None)¶

Group the “long-tail” dimensions (beyond top N) of a DataFrame or Series together.

Assumptions:

Index does not matter
All non-numeric columns are dimensions
If multiple numeric columns, sort by right-most before grouping

Parameters

data (Union[Series, DataFrame]) – A series or dataframe as input.
n (int) – Values beyond this point get grouped as “other”.
other_val (str) – The string with which to represent “other” values.
sort_by (Optional[str]) – The column by which to sort the dataframe, before grouping.

diglett.join¶

Functions related to joinging/merging datasets.

diglett.join.verbose_merge(left, right, left_on=None, right_on=None, left_index=False, right_index=False, *args, **kwargs)¶

Wraps pd.merge function to provide a visual overview of cardinality between datasets.

Specify both (left_on, right_on) or (left_index, right_index) arguments.

Return type: DataFrame

diglett.output¶

Functions which produce some sort of output from a DataFrame.

diglett.output.display_side_by_side(*args)¶

Output an array of pandas DataFrames side-by-side in a Jupyter notebook to conserve vertical space.

Return type: None

diglett.output.format_helper(df, int_cols=None, pct_cols=None, delta_cols=None, monospace=True, hide_index=True, return_output=False)¶

Apply common formatting using pandas.DataFrame.style methods.

Parameters

df (Union[DataFrame, Styler]) – The pandas DataFrame to be displayed.
int_cols (Optional[List[str]]) – Optional hard-coded list of columns to display as integers.
pct_cols (Optional[List[str]]) – Optional hard-coded list of columns to display as percentages.
delta_cols (Optional[List[str]]) – Optional hard-coded list of columns to display as “deltas”.
monospace (bool) – Whether to display with monospace font.
hide_index (bool) – Whether to hide the index of the DataFrame.
return_output (bool) – This is only used for testing purposes.

Return type

Optional[Styler]