Example 5 - df
is a dataframe containing specific columns¶
For example we want to check that 'foo' and 'bar' are present.
1- Example values to validate¶
import pandas as pd
# Valid
df = pd.DataFrame(data={'foo': [1, 2], 'bar': [None, "hello"]})
df = pd.DataFrame(data={'a': [1, 2], 'foo': ['r', 't'], 'bar': [None, "hello"]})
# Invalid
df = pd.DataFrame(data={'fo': [1, 2], 'bar': [None, "hello"]}) # typo in name
2- Inline validation¶
Principles:
- type can be checked with
instance_of
- required columns can be checked by verifying that the set of actual columns is a superset of the required columns.
Since this validation is simple, we show below how it can be done with valid8
alone. But to go further we rather recommend to combine it with another library
validate
+ built-ins¶
validate
provides both type and superset validation built-in, but they do not apply to the same element so we have to call it twice:
from valid8 import validate
# type validation
validate('df', df, instance_of=pd.DataFrame)
# columns validation
required_cols = {'foo', 'bar'}
validate('df columns', set(df.columns), superset_of=required_cols,
help_msg="DataFrame should contain mandatory columns {c}.
Found {var_value}", c=required_cols)
Note: you see in this example a reminder that the help message is formatted by valid8
using str.format(). You can use in this help message any custom keyword argument (such as c
above) or any of the already-available variables. The best way to see what is available is to write a wrong help message with an unexistent variable name in the string template:
validate('df columns', set(df.columns), superset_of=required_cols,
help_msg="Just kidding {hoho}")
yields:
ValidationError[ValueError]: Error while formatting help msg,
keyword [hoho] was not found in the validation context.
Help message to format was 'Just kidding {hoho}'.
Context elements available: {
'display_prefix_for_exc_outcomes': False,
'append_details': True,
'validator': _QuickValidator<validation_function=validate, none_policy=VALIDATE, exc_type=ValidationError>,
'var_value': {'fo', 'bar'},
'var_name': 'df columns',
'validation_outcome': NotSuperset(append_details=True,wrong_value={'fo', 'bar'},reference_set={'foo', 'bar'},missing={'foo'},help_msg=x superset of {reference_set} does not hold for x={wrong_value}. Missing elements: {missing}),
'help_msg': 'Just kidding {hoho}'
}
with validator
+ built-ins¶
It is relatively straightforward to validate both df
and its columns
- either with a pure "boolean test" approach:
from valid8 import validator
required_cols = {'foo', 'bar'}
with validator('df', df, instance_of=pd.DataFrame) as v:
missing = required_cols - set(df.columns)
v.alid = len(missing) == 0
- or with a "failure raising" approach, less compact (and not really more explicit error messages):
from valid8 import validation
required_cols = {'foo', 'bar'}
with validation('df', df, instance_of=pd.DataFrame):
missing = required_cols - set(df.columns)
if len(missing) > 0:
raise ValueError('missing dataFrame columns: ' + str(missing))
with validator
+ dedicated validation lib¶
Of course in real world examples you will want to validate much more things. So you will typically rely on a dedicated library for dataframe validation, and you will use valid8
only for its primary target: having a strong control about exceptions readability and exceptions types (for i18n). For example:
from my_pandas_validator import assert_df_minimum_size, assert_index_is_unique, \
assert_index_is_sorted, assert_column_present_with_correct_type
with validation('df', df, instance_of=pd.DataFrame,
error_type=InvalidInputDataFrame):
assert_df_minimum_size(df, min_nb_rows=10)
assert_index_is_unique(df)
assert_index_is_sorted(df)
assert_column_present_with_correct_type(df, 'foo', int)
3- Functions/classes validation¶
Function input¶
with built-in validation functions it is not possible, we have to create our custom function:
from valid8 import validate_arg, instance_of
required_cols = {'foo', 'bar'}
def has_required_cols(df):
missing = required_cols - set(df.columns)
if len(missing) > 0:
raise ValueError('missing dataFrame columns: ' + str(missing))
@validate_arg('df', instance_of(pd.DataFrame), has_required_cols)
def my_function(df):
pass
or with mini-lambda
from valid8 import validate_arg, instance_of
from mini_lambda import Set, Len
from mini_lambda.pandas_ import df
@validate_arg('df', instance_of(pd.DataFrame),
Len(required_cols - Set(df.columns)) > 0)
def my_function(df):
pass
Function output¶
identical but with validate_out
, see other examples.
Function ios¶
See other examples
Class fields¶
In the examples below the class fields are defined as constructor arguments but this also works if they are defined as class descriptors/properties, and is compliant with autoclass and attrs
using custom function:
from valid8 import validate_field, instance_of
@validate_field('df', instance_of(pd.DataFrame), has_required_cols)
class Foo:
def __init__(self, df):
self.df = df
or with mini-lambda
from valid8 import validate_field, instance_of
from mini_lambda import Set, Len
from mini_lambda.pandas_ import df
@validate_field('df', instance_of(pd.DataFrame),
Len(required_cols - Set(df.columns)) > 0)
class Foo:
def __init__(self, df):
self.df = df
With PEP484¶
See other examples