Diglett: Tools for data-munging¶
diglett.eda¶
Tools for exploratory data analysis (EDA).
These help understand the nature of some data, mostly by displaying stuff, not returning anything.
- diglett.eda.show_top_n(data, n=10, show_output=True, other_val='…')¶
Cleanly display the top N values from a “group by count” SQL output.
- Parameters
data (
Union
[Series
,DataFrame
]) – A series or dataframe, with columns (<dim>, num_).n (
int
) – The number of rows to show, before grouping remainder as “other”.show_output (
bool
) – If true, display result “nicely”, else return the actual df.other_val (
str
) – The value to which values beyond the top N are grouped.
- Return type
Union
[Series
,DataFrame
,None
]
- diglett.eda.summarize(df, return_output=False)¶
Show a better summary than df.info().
Incl. number of rows, memory, nulls, unique, (min/mean/max) of numerics, mode of string cols.
- Return type
Optional
[DataFrame
]
- diglett.eda.tabulate(df, normalize=False, sorted=True, return_output=False)¶
Pivot a DataFrame to show a numeric column across each of two dimensions.
- Parameters
df (
DataFrame
) – DataFrame w/ columns: (dim_A, dim_B, num_).normalize (
bool
) – Whether to return absolute counts (default) or normalize by total.sorted (
bool
) – Whether to sort index (and columns) to have highest row (or column) sums first.return_output (
bool
) – By default, output is displayed, but can also be returned.
- Return type
Optional
[DataFrame
]
diglett.insist¶
Insist (assert) assumptions about data quality.
These functions each verify a specific assumption.
They are “non-breaking” by using HTML warnings instead of actual assertions.
They each return the input object, allowing them to be used in DataFrame.pipe() chains.
- diglett.insist.average_greater_than(df, col, threshold, return_alert=False)¶
Check that average of specified column is greater than some value.
- Return type
Union
[DataFrame
,HTML
]
- diglett.insist.less_than_pct_null(df, cols=None, pct=0.01, return_alert=False)¶
Check that specified (or all) columns contain less than some % of null values.
- Return type
Union
[DataFrame
,HTML
]
- diglett.insist.more_than_pct_unique(df, col, pct=0.99, return_alert=False)¶
Check that a minimum pct. of values in a Series are unique.
- Return type
Union
[DataFrame
,HTML
]
- diglett.insist.no_nulls(df, cols=None, **kwargs)¶
Check that specified (or all) columns do not contain null values.
- Return type
Union
[DataFrame
,HTML
]
diglett.transform¶
Tools for transforming input data into more usable form.
- diglett.transform.fillnas(input_df, subset=None, value=0)¶
Apply fillna to a subset of columns.
Roughly equivalent to df[subset] = df[subset].fillna(value)
- Parameters
input_df (
DataFrame
) – The DataFrame to operate on.subset (
Optional
[List
[str
]]) – A list of the columns to operate on.value (
Any
) – The value with which to fill nulls.
- Return type
DataFrame
- diglett.transform.multi_moving_average(df, window=7, min_periods=1)¶
Apply a moving average to a DataFrame with a 2-level index, where the second is a dimension.
- Parameters
df (
DataFrame
) – The DataFrame to operate on.window (
int
) – Passed directly as argument to df.rolling()min_periods (
int
) – Passed directly as argument to df.rolling()
- Return type
DataFrame
- diglett.transform.reindex_by_sum(df, axis=1, margin_col=None)¶
Reindex axis of a DataFrame according to it’s sum.
- Parameters
df (
DataFrame
) – The DataFrame to reindex.axis (
int
) – The axis upon which to reindex.margin_col (
Optional
[str
]) – This column/index value will be excluded from the sort (end).
- Return type
DataFrame
- diglett.transform.winsorize(srs, lower=0, upper=0.99, verbose=True)¶
Winsorize a series at specified quantiles.
- Parameters
srs (
Series
) – The pandas Series to be winsorized.lower (
Union
[int
,float
]) – Values below this point get winsorized.upper (
Union
[int
,float
]) – Values above this point get winsorized.verbose (
bool
) – Whether to print info to stdout for debugging.
- Return type
Series
diglett.group¶
Tools related to df.groupby().
- diglett.group.group_other(data, n=10, other_val='…', sort_by=None)¶
Group the “long-tail” dimensions (beyond top N) of a DataFrame or Series together.
Assumptions:
Index does not matter
All non-numeric columns are dimensions
If multiple numeric columns, sort by right-most before grouping
- Parameters
data (
Union
[Series
,DataFrame
]) – A series or dataframe as input.n (
int
) – Values beyond this point get grouped as “other”.other_val (
str
) – The string with which to represent “other” values.sort_by (
Optional
[str
]) – The column by which to sort the dataframe, before grouping.
diglett.join¶
Functions related to joinging/merging datasets.
- diglett.join.verbose_merge(left, right, left_on=None, right_on=None, left_index=False, right_index=False, *args, **kwargs)¶
Wraps pd.merge function to provide a visual overview of cardinality between datasets.
Specify both (left_on, right_on) or (left_index, right_index) arguments.
- Return type
DataFrame
diglett.output¶
Functions which produce some sort of output from a DataFrame.
- diglett.output.display_side_by_side(*args)¶
Output an array of pandas DataFrames side-by-side in a Jupyter notebook to conserve vertical space.
- Return type
None
- diglett.output.format_helper(df, int_cols=None, pct_cols=None, delta_cols=None, monospace=True, hide_index=True, return_output=False)¶
Apply common formatting using pandas.DataFrame.style methods.
- Parameters
df (
Union
[DataFrame
,Styler
]) – The pandas DataFrame to be displayed.int_cols (
Optional
[List
[str
]]) – Optional hard-coded list of columns to display as integers.pct_cols (
Optional
[List
[str
]]) – Optional hard-coded list of columns to display as percentages.delta_cols (
Optional
[List
[str
]]) – Optional hard-coded list of columns to display as “deltas”.monospace (
bool
) – Whether to display with monospace font.hide_index (
bool
) – Whether to hide the index of the DataFrame.return_output (
bool
) – This is only used for testing purposes.
- Return type
Optional
[Styler
]