Diglett: Tools for data-munging

diglett.eda

Tools for exploratory data analysis (EDA).

These help understand the nature of some data, mostly by displaying stuff, not returning anything.

diglett.eda.show_top_n(data, n=10, show_output=True, other_val='…')

Cleanly display the top N values from a “group by count” SQL output.

Parameters
  • data (Union[Series, DataFrame]) – A series or dataframe, with columns (<dim>, num_).

  • n (int) – The number of rows to show, before grouping remainder as “other”.

  • show_output (bool) – If true, display result “nicely”, else return the actual df.

  • other_val (str) – The value to which values beyond the top N are grouped.

Return type

Union[Series, DataFrame, None]

diglett.eda.summarize(df, return_output=False)

Show a better summary than df.info().

Incl. number of rows, memory, nulls, unique, (min/mean/max) of numerics, mode of string cols.

Return type

Optional[DataFrame]

diglett.eda.tabulate(df, normalize=False, sorted=True, return_output=False)

Pivot a DataFrame to show a numeric column across each of two dimensions.

Parameters
  • df (DataFrame) – DataFrame w/ columns: (dim_A, dim_B, num_).

  • normalize (bool) – Whether to return absolute counts (default) or normalize by total.

  • sorted (bool) – Whether to sort index (and columns) to have highest row (or column) sums first.

  • return_output (bool) – By default, output is displayed, but can also be returned.

Return type

Optional[DataFrame]

diglett.insist

Insist (assert) assumptions about data quality.

  • These functions each verify a specific assumption.

  • They are “non-breaking” by using HTML warnings instead of actual assertions.

  • They each return the input object, allowing them to be used in DataFrame.pipe() chains.

diglett.insist.average_greater_than(df, col, threshold, return_alert=False)

Check that average of specified column is greater than some value.

Return type

Union[DataFrame, HTML]

diglett.insist.less_than_pct_null(df, cols=None, pct=0.01, return_alert=False)

Check that specified (or all) columns contain less than some % of null values.

Return type

Union[DataFrame, HTML]

diglett.insist.more_than_pct_unique(df, col, pct=0.99, return_alert=False)

Check that a minimum pct. of values in a Series are unique.

Return type

Union[DataFrame, HTML]

diglett.insist.no_nulls(df, cols=None, **kwargs)

Check that specified (or all) columns do not contain null values.

Return type

Union[DataFrame, HTML]

diglett.transform

Tools for transforming input data into more usable form.

diglett.transform.fillnas(input_df, subset=None, value=0)

Apply fillna to a subset of columns.

Roughly equivalent to df[subset] = df[subset].fillna(value)

Parameters
  • input_df (DataFrame) – The DataFrame to operate on.

  • subset (Optional[List[str]]) – A list of the columns to operate on.

  • value (Any) – The value with which to fill nulls.

Return type

DataFrame

diglett.transform.multi_moving_average(df, window=7, min_periods=1)

Apply a moving average to a DataFrame with a 2-level index, where the second is a dimension.

Parameters
  • df (DataFrame) – The DataFrame to operate on.

  • window (int) – Passed directly as argument to df.rolling()

  • min_periods (int) – Passed directly as argument to df.rolling()

Return type

DataFrame

diglett.transform.reindex_by_sum(df, axis=1, margin_col=None)

Reindex axis of a DataFrame according to it’s sum.

Parameters
  • df (DataFrame) – The DataFrame to reindex.

  • axis (int) – The axis upon which to reindex.

  • margin_col (Optional[str]) – This column/index value will be excluded from the sort (end).

Return type

DataFrame

diglett.transform.winsorize(srs, lower=0, upper=0.99, verbose=True)

Winsorize a series at specified quantiles.

Parameters
  • srs (Series) – The pandas Series to be winsorized.

  • lower (Union[int, float]) – Values below this point get winsorized.

  • upper (Union[int, float]) – Values above this point get winsorized.

  • verbose (bool) – Whether to print info to stdout for debugging.

Return type

Series

diglett.group

Tools related to df.groupby().

diglett.group.group_other(data, n=10, other_val='…', sort_by=None)

Group the “long-tail” dimensions (beyond top N) of a DataFrame or Series together.

Assumptions:

  • Index does not matter

  • All non-numeric columns are dimensions

  • If multiple numeric columns, sort by right-most before grouping

Parameters
  • data (Union[Series, DataFrame]) – A series or dataframe as input.

  • n (int) – Values beyond this point get grouped as “other”.

  • other_val (str) – The string with which to represent “other” values.

  • sort_by (Optional[str]) – The column by which to sort the dataframe, before grouping.

diglett.join

Functions related to joinging/merging datasets.

diglett.join.verbose_merge(left, right, left_on=None, right_on=None, left_index=False, right_index=False, *args, **kwargs)

Wraps pd.merge function to provide a visual overview of cardinality between datasets.

Specify both (left_on, right_on) or (left_index, right_index) arguments.

Return type

DataFrame

diglett.output

Functions which produce some sort of output from a DataFrame.

diglett.output.display_side_by_side(*args)

Output an array of pandas DataFrames side-by-side in a Jupyter notebook to conserve vertical space.

Return type

None

diglett.output.format_helper(df, int_cols=None, pct_cols=None, delta_cols=None, monospace=True, hide_index=True, return_output=False)

Apply common formatting using pandas.DataFrame.style methods.

Parameters
  • df (Union[DataFrame, Styler]) – The pandas DataFrame to be displayed.

  • int_cols (Optional[List[str]]) – Optional hard-coded list of columns to display as integers.

  • pct_cols (Optional[List[str]]) – Optional hard-coded list of columns to display as percentages.

  • delta_cols (Optional[List[str]]) – Optional hard-coded list of columns to display as “deltas”.

  • monospace (bool) – Whether to display with monospace font.

  • hide_index (bool) – Whether to hide the index of the DataFrame.

  • return_output (bool) – This is only used for testing purposes.

Return type

Optional[Styler]