diglett package

Submodules

diglett.eda module

Tools for exploratory data analysis (EDA).

These help understand the nature of some data, mostly by displaying stuff, not returning anything.

diglett.eda.show_top_n(data, n=10, show_output=True, other_val='…')

Cleanly display the top N values from a “group by count” SQL output.

Parameters
  • data (Union[Series, DataFrame]) – A series or dataframe, with columns (<dim>, num_).

  • n (int) – The number of rows to show, before grouping remainder as “other”.

  • show_output (bool) – If true, display result “nicely”, else return the actual df.

  • other_val (str) – The value to which values beyond the top N are grouped.

Return type

Union[Series, DataFrame, None]

diglett.eda.summarize(df, return_output=False)

Show a better summary than df.info().

Incl. number of rows, memory, nulls, unique, (min/mean/max) of numerics, mode of string cols.

Return type

Optional[DataFrame]

diglett.eda.tabulate(df, normalize=False, sorted=True, return_output=False)

Pivot a DataFrame to show a numeric column across each of two dimensions.

Parameters
  • df (DataFrame) – DataFrame w/ columns: (dim_A, dim_B, num_).

  • normalize (bool) – Whether to return absolute counts (default) or normalize by total.

  • sorted (bool) – Whether to sort index (and columns) to have highest row (or column) sums first.

  • return_output (bool) – By default, output is displayed, but can also be returned.

Return type

Optional[DataFrame]

diglett.group module

Tools related to df.groupby().

diglett.group.group_other(data, n=10, other_val='…', sort_by=None)

Group the “long-tail” dimensions (beyond top N) of a DataFrame or Series together.

Assumptions:

  • Index does not matter

  • All non-numeric columns are dimensions

  • If multiple numeric columns, sort by right-most before grouping

Parameters
  • data (Union[Series, DataFrame]) – A series or dataframe as input.

  • n (int) – Values beyond this point get grouped as “other”.

  • other_val (str) – The string with which to represent “other” values.

  • sort_by (Optional[str]) – The column by which to sort the dataframe, before grouping.

diglett.insist module

Insist (assert) assumptions about data quality.

  • These functions each verify a specific assumption.

  • They are “non-breaking” by using HTML warnings instead of actual assertions.

  • They each return the input object, allowing them to be used in DataFrame.pipe() chains.

diglett.insist.average_greater_than(df, col, threshold, return_alert=False)

Check that average of specified column is greater than some value.

Return type

Union[DataFrame, HTML]

diglett.insist.less_than_pct_null(df, cols=None, pct=0.01, return_alert=False)

Check that specified (or all) columns contain less than some % of null values.

Return type

Union[DataFrame, HTML]

diglett.insist.more_than_pct_unique(df, col, pct=0.99, return_alert=False)

Check that a minimum pct. of values in a Series are unique.

Return type

Union[DataFrame, HTML]

diglett.insist.no_nulls(df, cols=None, **kwargs)

Check that specified (or all) columns do not contain null values.

Return type

Union[DataFrame, HTML]

diglett.join module

Functions related to joinging/merging datasets.

diglett.join.verbose_merge(left, right, left_on=None, right_on=None, left_index=False, right_index=False, *args, **kwargs)

Wraps pd.merge function to provide a visual overview of cardinality between datasets.

Specify both (left_on, right_on) or (left_index, right_index) arguments.

Return type

DataFrame

diglett.output module

Functions which produce some sort of output from a DataFrame.

diglett.output.display_side_by_side(*args)

Output an array of pandas DataFrames side-by-side in a Jupyter notebook to conserve vertical space.

Return type

None

diglett.output.format_helper(df, int_cols=None, pct_cols=None, delta_cols=None, monospace=True, hide_index=True, return_output=False)

Apply common formatting using pandas.DataFrame.style methods.

Parameters
  • df (Union[DataFrame, Styler]) – The pandas DataFrame to be displayed.

  • int_cols (Optional[List[str]]) – Optional hard-coded list of columns to display as integers.

  • pct_cols (Optional[List[str]]) – Optional hard-coded list of columns to display as percentages.

  • delta_cols (Optional[List[str]]) – Optional hard-coded list of columns to display as “deltas”.

  • monospace (bool) – Whether to display with monospace font.

  • hide_index (bool) – Whether to hide the index of the DataFrame.

  • return_output (bool) – This is only used for testing purposes.

Return type

Optional[Styler]

diglett.transform module

Tools for transforming input data into more usable form.

diglett.transform.fillnas(input_df, subset=None, value=0)

Apply fillna to a subset of columns.

Roughly equivalent to df[subset] = df[subset].fillna(value)

Parameters
  • input_df (DataFrame) – The DataFrame to operate on.

  • subset (Optional[List[str]]) – A list of the columns to operate on.

  • value (Any) – The value with which to fill nulls.

Return type

DataFrame

diglett.transform.multi_moving_average(df, window=7, min_periods=1)

Apply a moving average to a DataFrame with a 2-level index, where the second is a dimension.

Parameters
  • df (DataFrame) – The DataFrame to operate on.

  • window (int) – Passed directly as argument to df.rolling()

  • min_periods (int) – Passed directly as argument to df.rolling()

Return type

DataFrame

diglett.transform.reindex_by_sum(df, axis=1, margin_col=None)

Reindex axis of a DataFrame according to it’s sum.

Parameters
  • df (DataFrame) – The DataFrame to reindex.

  • axis (int) – The axis upon which to reindex.

  • margin_col (Optional[str]) – This column/index value will be excluded from the sort (end).

Return type

DataFrame

diglett.transform.winsorize(srs, lower=0, upper=0.99, verbose=True)

Winsorize a series at specified quantiles.

Parameters
  • srs (Series) – The pandas Series to be winsorized.

  • lower (Union[int, float]) – Values below this point get winsorized.

  • upper (Union[int, float]) – Values above this point get winsorized.

  • verbose (bool) – Whether to print info to stdout for debugging.

Return type

Series

diglett.utils module

Various utility functions, mostly used within the rest of the diglett module.

diglett.utils.describe(func)

Describe the shape of the input shape, output shape, and time of a pandas pipe function.

Return type

Callable

diglett.utils.display_header(size, text)

Display an HTML header of a specified level.

Return type

None

diglett.utils.h2(text: str) None

Display an HTML header of a specified level.

diglett.utils.h3(text: str) None

Display an HTML header of a specified level.

diglett.utils.h4(text: str) None

Display an HTML header of a specified level.

diglett.utils.h5(text: str) None

Display an HTML header of a specified level.

diglett.utils.h6(text: str) None

Display an HTML header of a specified level.

diglett.utils.h7(text: str) None

Display an HTML header of a specified level.

diglett.utils.text_header(text, line_char='-')

Sandwich a given string with an equal length line of separate characters above and below it.

Return type

None

Module contents

Boilerplate tools for routine data analysis.