hooqu.analyzers

Subpackages

Module contents

class hooqu.analyzers.Analyzer(*args, **kwds)[source]

Bases: abc.ABC, typing.Generic

calculate(data, aggregate_with=None, save_states_with=None)[source]

Runs preconditions, calculates and returns the metric

Parameters
  • data (DataFrameLike) – Data frame being analyzed

  • aggregate_with – Loader for previous states to include in the computation (optional)

  • save_states_with – persist internal states using this (optional)

Returns

Return type

Returns failure metric in case preconditions fail.

class hooqu.analyzers.Completeness(column, where=None)[source]

Bases: hooqu.analyzers.analyzer.StandardScanShareableAnalyzer

class hooqu.analyzers.Compliance(instance, predicate, where=None)[source]

Bases: hooqu.analyzers.analyzer.NonScanAnalyzer

Compliance is a measure of the fraction of rows that complies with the given column constraint. E.g if the constraint is “att1>3” and data frame has 5 rows with att1 column value greater than 3 and 10 rows under 3; a DoubleMetric would be returned with 0.33 value

Parameters
  • instance (str) – Unlike other column analyzers (e.g completeness) this analyzer can not infer to the metric instance name from column name. also the constraint given here can be referring to multiple columns, so metric instance name should be provided, describing what the analysis being done for.

  • predicate (str) – predicate that can be understood by DataFrameLike.eval.

  • where (Optional[str]) – Additional filter to apply before the analyzer is run.

class hooqu.analyzers.FrequenciesAndNumRows(*args, **kwds)[source]

Bases: hooqu.analyzers.analyzer.State

class hooqu.analyzers.MaxState(*args, **kwds)[source]

Bases: hooqu.analyzers.analyzer.DoubledValuedState

class hooqu.analyzers.Maximum(column, where=None)[source]

Bases: hooqu.analyzers.analyzer.StandardScanShareableAnalyzer

class hooqu.analyzers.Mean(column, where=None)[source]

Bases: hooqu.analyzers.analyzer.StandardScanShareableAnalyzer

class hooqu.analyzers.MeanState(*args, **kwds)[source]

Bases: hooqu.analyzers.analyzer.DoubledValuedState

class hooqu.analyzers.MinState(*args, **kwds)[source]

Bases: hooqu.analyzers.analyzer.DoubledValuedState

class hooqu.analyzers.Minimum(column, where=None)[source]

Bases: hooqu.analyzers.analyzer.StandardScanShareableAnalyzer

class hooqu.analyzers.NonScanAnalyzer(*args, **kwds)[source]

Bases: hooqu.analyzers.analyzer.Analyzer

Analyzer that does not need to run any aggregation and can extract the information straight from the dataframe. This is a special implementation of Hooqu for the Size Analyzer.

metric_from_aggregation_result(result, offset, aggregate_with=None, save_states_with=None)[source]

We don’t calculate metrics from aggregation

class hooqu.analyzers.NumMatches(*args, **kwds)[source]

Bases: hooqu.analyzers.analyzer.DoubledValuedState

class hooqu.analyzers.NumMatchesAndCount(*args, **kwds)[source]

Bases: hooqu.analyzers.analyzer.DoubledValuedState

A state for computing ratio-based metrics, contains #rows that match a predicate and overall #rows

class hooqu.analyzers.Quantile(column, quantile, where=None)[source]

Bases: hooqu.analyzers.analyzer.StandardScanShareableAnalyzer

Quantile analyzer that computes the quantile using a linear interpolation, i.e. returning a value within the column.

column:

Column in DataFrameLike for which the quantile is analyzed.

quantile:

Computed Quantile. Must be in the interval [0, 1], where 0.5 would be the median.

where:

Additional filter to apply before the analyzer is run.

class hooqu.analyzers.QuantileState(*args, **kwds)[source]

Bases: hooqu.analyzers.analyzer.DoubledValuedState

class hooqu.analyzers.ScanShareableAnalyzer(*args, **kwds)[source]

Bases: hooqu.analyzers.analyzer.Analyzer

An analyzer that runs a set of aggregation functions over the data, can share scans over the data

class hooqu.analyzers.Size(where=None)[source]

Bases: hooqu.analyzers.analyzer.NonScanAnalyzer

class hooqu.analyzers.StandardDeviation(column, where=None)[source]

Bases: hooqu.analyzers.analyzer.StandardScanShareableAnalyzer

Calculate the population standard deviation (degrees of freedom = 0) on the specified column. NaNs are ignored in the calculations.

Note that unlike pandas this calculate the population variance i.e. degree of freedom (ddof=0)

class hooqu.analyzers.StandardDeviationState(*args, **kwds)[source]

Bases: hooqu.analyzers.analyzer.DoubledValuedState

class hooqu.analyzers.Sum(column, where=None)[source]

Bases: hooqu.analyzers.analyzer.StandardScanShareableAnalyzer

class hooqu.analyzers.SumState(*args, **kwds)[source]

Bases: hooqu.analyzers.analyzer.DoubledValuedState

class hooqu.analyzers.Uniqueness(columns, where=None)[source]

Bases: hooqu.analyzers.grouping_analyzers.ScanShareableFrequencyBasedAnalyzer, hooqu.analyzers.uniqueness._UniquenessDataClassMixin

Uniqueness is the fraction of unique values of a column(s), i.e., values that occur exactly once.