Usage¶
Basic¶
Parsing one file¶
Let's parse a single file containing a string: hello_world.txt. First right-click on this link and select 'save as' to save the file in your current working directory. Then
from parsyfiles import parse_item result = parse_item('hello_world', str) print(result)
yields:
--> Successfully parsed a str from hello_world hello
Never include the file extension !¶
Note that the file extension should never be included when referencing a file with parsyfiles. This is because objects may be parsed from files or folders:
result = parse_item('hello_world.txt', str) # ObjectNotFoundOnFileSystemError
Unsupported extensions¶
Lets rename our file hello_world.ukn
and run the script again. We get a different exception:
parsyfiles.parsing_registries.NoParserFoundForObjectExt: hello_world (singlefile, .ukn) cannot be parsed as a str because no parser supporting that extension (singlefile, .ukn) is registered. If you wish to parse this fileobject in that type, you may replace the file with any of the following extensions currently supported :{'.yml', '.yaml', '<multifile>', '.txt', '.pyc'} (see get_capabilities_for_type(str, strict_type_matching=False) for details). Otherwise, please register a new parser for type str and extension singlefile, .ukn
This is pretty explicit: the framework is not able to parse our file, because there is currently no registered parser for this extension .ukn
and this desired type str
. This example is a good introduction to the way parsyfiles
parser registry works: parsers are registered for a combination of supported file extensions and supported destination types. Everytime you wish to parse some file into some type, the framework checks if there are any capable parsers, and tries them all in the most logical order. We'll see that in more details below.
Log levels¶
By default the library uses a Logger
that has an additional handler to print to stdout
. The default level is INFO
. This is how you change the module default logging level:
import logging logging.getLogger('parsyfiles').setLevel(logging.DEBUG)
Running the same example again yieds a much more verbose output:
**** Starting to parse single object of type <str> at location hello_world **** Checking all files under hello_world hello_world (singlefile, .txt) File checks done Building a parsing plan to parse hello_world (singlefile, .txt) into a str hello_world (singlefile, .txt) > str ------- using <read_str_from_txt> Parsing Plan created successfully Executing Parsing Plan for hello_world (singlefile, .txt) > str ------- using <read_str_from_txt> Parsing hello_world (singlefile, .txt) > str ------- using <read_str_from_txt> --> Successfully parsed a str from hello_world Completed parsing successfully
Understanding the debug-level Log messages¶
In the log output above you see a couple hints on how the parsing framework works:
-
first it checks your files. That is the log section beginning with "
Checking all files ...
". If there is a missing file issue it will appear at this stage as anObjectNotFoundOnFileSystemError
. -
then it creates a parsing plan that is able to produce an object the required type. That's the section beginning with "
Building a parsing plan ...
". The parsing plan is the most important concept in the library, and the most complex. A parsing plan describes what the framework will try to perform when it will actually execute the parsing step (next step, below). If a parsing plan cannot be built at this stage you get aNoParserFoundForObjectExt
error like the one we saw previously. In this example we see that the parsing plan is straightforward: the framework will use a single parser, named<read_str_from_txt>
. -
finally it executes the parsing plan. That's the section beginning with "
Executing Parsing Plan...
". Sometimes you will see in this section that the original plan gets updated as a consequence of parsing errors. For example a 'plan A' will be replaced by a 'plan B'. We will see some examples later on.
It is important to understand these 3 log sections, since the main issue with complex frameworks is debugging when something unexpected happens :-).
Custom logger¶
You may also wish to provide your own logger:
from logging import FileHandler my_logger = logging.getLogger('mine') my_logger.addHandler(FileHandler('hello.log')) my_logger.setLevel(logging.INFO) result = parse_item('hello_world', str, logger = my_logger)
TODO refresh remaining sections¶
Part 1 - Parsing collections of known types¶
(a) Example: parsing a list of DataFrame¶
The most simple case of all: you wish to parse a collection of files that all have the same type, and for which a parser is already registered. For example your wish to parse a list of DataFrame
for a data folder that looks like this:
./demo/simple_collection ├── a.csv ├── b.txt ├── c.xls ├── d.xlsx └── e.xlsm
Note: you may find this example data folder in the project sources
Parsing all of these dataframes is straightforward:
from parsyfiles import parse_collection, pprint_out from pandas import DataFrame dfs = parse_collection('./data/simple_collection', DataFrame) pprint_out(dfs)
Here is the result
**** Starting to parse collection of <DataFrame> at location ./demo/simple_collection **** Checking all files under ./demo/simple_collection ./demo/simple_collection (multifile) ./demo/simple_collection\a (singlefile, .csv) (...) ./demo/simple_collection\e (singlefile, .xlsm) File checks done Building a parsing plan to parse ./demo/simple_collection (multifile) into a Dict[str, DataFrame] ./demo/simple_collection (multifile) > Dict[str, DataFrame] ------- using Multifile Dict parser (based on 'parsyfiles defaults' to find the parser for each item) ./demo/simple_collection\a (singlefile, .csv) > DataFrame ------- using <read_df_or_series_from_csv(stream mode)> (...) ./demo/simple_collection\e (singlefile, .xlsm) > DataFrame ------- using <read_dataframe_from_xls(file mode)> Parsing Plan created successfully Executing Parsing Plan for ./demo/simple_collection (multifile) > Dict[str, DataFrame] ------- using Multifile Dict parser (based on 'parsyfiles defaults' to find the parser for each item) Parsing ./demo/simple_collection (multifile) > Dict[str, DataFrame] ------- using Multifile Dict parser (based on 'parsyfiles defaults' to find the parser for each item) Parsing ./demo/simple_collection\a (singlefile, .csv) > DataFrame ------- using <read_df_or_series_from_csv(stream mode)> --> Successfully parsed a DataFrame from ./demo/simple_collection\a (...) Parsing ./demo/simple_collection\e (singlefile, .xlsm) > DataFrame ------- using <read_dataframe_from_xls(file mode)> --> Successfully parsed a DataFrame from ./demo/simple_collection\e Assembling all parsed child items into a Dict[str, DataFrame] to build ./demo/simple_collection (multifile) --> Successfully parsed a Dict[str, DataFrame] from ./demo/simple_collection Completed parsing successfully {'a': a b c d 0 1 2 3 4, 'b': a b c d 0 1 2 3 4, 'c': c 5 0 d 8 1 e 12 2 f 3, 'd': c 5 0 d 8 1 e 12 2 f 3, 'e': c 5 0 d 8 1 e 12 2 f 3}
Note: the above capture was slightly 'improved' for readability, because unfortunately pprint does not display dictionaries of dataframes as nicely as this.
(b) Understanding the log output¶
In the log output you see a couple hints on how the parsing framework works:
-
first it recursively checks your folder to check that it is entirely compliant with the file mapping format. That is the log section beginning with "
Checking all files under ./demo/simple_collection
". If the same item appears twice (e.g.a.csv
anda.txt
) it will throw an error at this stage (anObjectPresentMultipleTimesOnFileSystemError
). -
then it recursively creates a parsing plan that is able to produce an object the required type. That's the section beginning with "
Building a parsing plan to parse ./demo/simple_collection (multifile) into a Dict[str, DataFrame]
". Here you may note that by default, a collection of items is actually parsed as an object of type dictionary, where the key is the name of the file without extension, and the value is the object that is parsed from the file. If at this stage it does not find a way to parse a given file into the required object type, it will fail. For example if you add a file in the folder, namedunknown_ext_for_dataframe.ukn
, you will get an error (aNoParserFoundForObjectExt
). -
finally it executes the parsing plan. That's the section beginning with "
Executing Parsing Plan for ./demo/simple_collection (multifile) > Dict[str, DataFrame] (...)
".
It is important to understand these 3 log sections, since the main issue with complex frameworks is debugging when something unexpected happens :-).
(c) Parsing a single file only¶
The following code may be used to parse a single file explicitly:
from pprint import pprint from parsyfiles import parse_item from pandas import DataFrame df = parse_item('./demo/simple_collection/c', DataFrame) pprint(df)
Important : note that the file extension does not appear in the argument of the parse_item
function.
(d) Default collection type and other supported types¶
You might have noticed that the demonstrated collection example returned a dict
of dataframes, not a list
. This is the default behaviour of the parse_collection
method - it has the advantage of not making any assumption on the sorting order.
Behind the scenes, parse_collection
redirects to the parse_item
command. So the following code leads to the exact same results:
from parsyfiles import parse_item from pandas import DataFrame from typing import Dict df = parse_item('./demo/simple_collection/c', Dict[str, DataFrame])
The typing
module is used here to entirely specify the type of item that you want to parse (Dict[str, DataFrame]
). The parsed item will be a dictionary with string keys (the file names) and DataFrame values (the parsed file contents).
You may parse a list
, a set
, or a tuple
exactly the same way, using the corresponding typing
class:
from parsyfiles import parse_item from pandas import DataFrame from typing import List, Set, Tuple dfl = parse_item('./demo/simple_collection', List[DataFrame]) # dfs = parse_item('./demo/simple_collection', Set[DataFrame]) dft = parse_item('./demo/simple_collection', Tuple[DataFrame, DataFrame, DataFrame, DataFrame, DataFrame])
For List
and Tuple
the implied order is alphabetical on the file names (similar to using sorted()
on the items of the dictionary).
Note that DataFrame
objects are not mutable, so in this particular case the collection cannot be parsed as a Set
.
Finally, note that it is not possible to mix collection and non-collection items together (for example, Union[int, List[int]]
is not supported).
Part 2 - Simple user-defined types¶
(a) Example: parsing a collection of test cases¶
Suppose that you want to test the following exec_op
function, and you want to read your test datasets from a bunch of files.
def exec_op(x: float, y: float, op: str) -> float: if op is '+': return x+y elif op is '-': return x-y else: raise ValueError('Unsupported operation : \'' + op + '\'')
Each test dataset could be represented as an object, containing the inputs and expected outputs for exec_op
. For example:
class ExecOpTest(object): def __init__(self, x: float, y: float, op: str, expected_result: float): self.x = x self.y = y self.op = op self.expected_result = expected_result def __str__(self): return self.__repr__() def __repr__(self): return str(self.x) + ' ' + self.op + ' ' + str(self.y) + ' =? ' + str(self.expected_result)
Obviously this class is not known by the parsyfiles
framework: there is no registered parser for the ExecOpTest
type. However the type is fairly simple, so it can actually fit into a dictionary containing the values for x
, y
, op
, and expected_results
. parsyfiles
knows a couple ways to parse dictionaries, using python standard libraries:
- From a
.cfg
or.ini
file using theconfigparser
module - From a
.json
file using thejson
module - From a
.properties
or.txt
file using thejprops
module - From a
.yaml
or.yml
file using theyaml
module - From a
.csv
,.txt
,.xls
,.xlsx
,.xlsm
file using thepandas
module - etc.
It also knows how to convert a dictionary into an object, as long as the object constructor contains the right information about expected types. For example in the example above, the constructor has explicit PEP484 annotations x: float, y: float, op: str, expected_result: float
.
So let's try to parse instances of ExecOpTest
from various files. Our test data folder looks like this (available in the project sources):
./demo/simple_objects ├── test_diff_1.cfg ├── test_diff_2.ini ├── test_diff_3_csv_format.txt ├── test_sum_1.json ├── test_sum_2.properties ├── test_sum_3_properties_format.txt ├── test_sum_4.yaml ├── test_sum_5.xls ├── test_sum_6.xlsx └── test_sum_7.xlsm
As usual, we tell the framework that we want to parse a collection of objects of type ExecOpTest
:
from pprint import pprint from parsyfiles import parse_collection sf_tests = parse_collection('./demo/simple_objects', ExecOpTest) pprint(sf_tests)
Here is the result:
**** Starting to parse collection of <ExecOpTest> at location ./demo/simple_objects **** Checking all files under ./demo/simple_objects (...) File checks done Building a parsing plan to parse ./demo/simple_objects (multifile) into a Dict[str, ExecOpTest] (...) Parsing Plan created successfully Executing Parsing Plan for ./demo/simple_objects (multifile) > Dict[str, ExecOpTest] ------- using Multifile Dict parser (based on 'parsyfiles defaults' to find the parser for each item) (...) --> Successfully parsed a Dict[str, ExecOpTest] from ./demo/simple_objects Completed parsing successfully {'test_diff_1': 1.0 - 1.0 =? 0.0, 'test_diff_2': 0.0 - 1.0 =? -1.0, 'test_diff_3_csv_format': 5.0 - 4.0 =? 1.0, 'test_diff_4_csv_format2': 4.0 - 4.0 =? 0.0, 'test_sum_1': 1.0 + 2.0 =? 3.0, 'test_sum_2': 0.0 + 1.0 =? 1.0, 'test_sum_3_properties_format': 1.0 + 1.0 =? 2.0, 'test_sum_4': 2.0 + 5.0 =? 7.0, 'test_sum_5': 56.0 + 12.0 =? 68.0, 'test_sum_6': 56.0 + 13.0 =? 69.0, 'test_sum_7': 56.0 + 14.0 =? 70.0}
(b) Under the hood : why does it work, even on ambiguous files?¶
In the example above, three files were actually quite difficult to parse into a dict
before being converted to an ExecOpTest
: test_diff_3_csv_format.txt
, test_diff_4_csv_format2.txt
and test_sum_4.yaml
. Let's look at both cases in details.
Solved Difficulty 1 - Several formats/parsers for the same file extension¶
test_diff_3_csv_format.txt
and test_diff_4_csv_format2.txt
are both .txt file that contains csv-format data. But
- there are several way to write a dictionary in a csv format (one row of header + one row of values, or one column of names + one column of values).
- .txt files may also contain many other formats such as for example, the 'properties' format.
How does the framework manage to parse these files ? Lets look at the log output for test_diff_3_csv_format.txt
:
Parsing ./demo/simple_objects\test_diff_3_csv_format (singlefile, .txt) > ExecOpTest ------- using $<read_dict_from_properties> => <dict_to_object>$ !! Caught error during execution !! File "C:\W_dev\_pycharm_workspace\python-parsyfiles\parsyfiles\support_for_objects.py", line 273, in dict_to_object attr_name) ParsingException : Error while parsing ./demo/simple_objects\test_diff_3_csv_format (singlefile, .txt) as a ExecOpTest with parser '$<read_dict_from_properties> => <dict_to_object>$' using options=({'MultifileCollectionParser': {'lazy_parsing': False}}) : caught InvalidAttributeNameForConstructorError : Cannot parse object of type <ExecOpTest> using the provided configuration file: configuration contains a property name ('5,4,-,1')that is not an attribute of the object constructor. <ExecOpTest> constructor attributes are : ['y', 'x', 'expected_result', 'op'] Rebuilding local parsing plan with next candidate parser: $<read_str_from_txt> => <base64_ascii_str_pickle_to_object>$ ./demo/simple_objects\test_diff_3_csv_format (singlefile, .txt) > ExecOpTest ------- using $<read_str_from_txt> => <base64_ascii_str_pickle_to_object>$ Parsing ./demo/simple_objects\test_diff_3_csv_format (singlefile, .txt) > ExecOpTest ------- using $<read_str_from_txt> => <base64_ascii_str_pickle_to_object>$ !! Caught error during execution !! File "C:\Anaconda3\envs\azuremlbricks\lib\base64.py", line 88, in b64decode return binascii.a2b_base64(s) ParsingException : Error while parsing ./demo/simple_objects\test_diff_3_csv_format (singlefile, .txt) as a ExecOpTest with parser '$<read_str_from_txt> => <base64_ascii_str_pickle_to_object>$' using options=({'MultifileCollectionParser': {'lazy_parsing': False}}) : caught Error : Incorrect padding Rebuilding local parsing plan with next candidate parser: $<read_str_from_txt> => <constructor_with_str_arg>$ ./demo/simple_objects\test_diff_3_csv_format (singlefile, .txt) > ExecOpTest ------- using $<read_str_from_txt> => <constructor_with_str_arg>$ Parsing ./demo/simple_objects\test_diff_3_csv_format (singlefile, .txt) > ExecOpTest ------- using $<read_str_from_txt> => <constructor_with_str_arg>$ !! Caught error during execution !! File "C:\W_dev\_pycharm_workspace\python-parsyfiles\parsyfiles\support_for_primitive_types.py", line 98, in constructor_with_str_arg return desired_type(source) ParsingException : Error while parsing ./demo/simple_objects\test_diff_3_csv_format (singlefile, .txt) as a ExecOpTest with parser '$<read_str_from_txt> => <constructor_with_str_arg>$' using options=({'MultifileCollectionParser': {'lazy_parsing': False}}) : caught CaughtTypeError : Caught TypeError while calling conversion function 'constructor_with_str_arg'. Note that the conversion function signature should be 'def my_convert_fun(desired_type: Type[T], source: S, logger: Logger, **kwargs) -> T' (unpacked options mode - default) or def my_convert_fun(desired_type: Type[T], source: S, logger: Logger, options: Dict[str, Dict[str, Any]]) -> T (unpack_options = False).Caught error message is : TypeError : __init__() missing 3 required positional arguments: 'y', 'op', and 'expected_result' Rebuilding local parsing plan with next candidate parser: $<read_df_or_series_from_csv> => <single_row_or_col_df_to_dict> -> <dict_to_object>$ ./demo/simple_objects\test_diff_3_csv_format (singlefile, .txt) > ExecOpTest ------- using $<read_df_or_series_from_csv> => <single_row_or_col_df_to_dict> -> <dict_to_object>$ Parsing ./demo/simple_objects\test_diff_3_csv_format (singlefile, .txt) > ExecOpTest ------- using $<read_df_or_series_from_csv> => <single_row_or_col_df_to_dict> -> <dict_to_object>$ --> Successfully parsed a ExecOpTest from ./demo/simple_objects\test_diff_3_csv_format
You can see from the logs that the framework successively tries several ways to parse this file :
$<read_dict_from_properties> => <dict_to_object>$
: the txt file is read in the 'properties' format (usingjprops
) into a dictionary, and then the dictionary is converted to aExecOpTest
object. This fails.$<read_str_from_txt> => <base64_ascii_str_pickle_to_object>$
: the txt file is read as a string, and then the string is interpreted as a base64-encoded pickleExecOpTest
object (!). This fails.$<read_str_from_txt> => <constructor_with_str_arg>$
: the txt file is read as a string, and then the constructor ofExecOpTest
is called with that string as unique argument. This fails again.$<read_df_or_series_from_csv> => <single_row_or_col_df_to_dict> -> <dict_to_object>$
: the txt file is read as a csv into a DataFrame, then the DataFrame is converted to a dictionary, and finally the dictionary is converted into aExecOpTest
object. This finally succeeds.
The same goes for the other file test_diff_4_csv_format2.txt
.
Solved Difficulty 2 - Generic parsers¶
For test_sum_4.yaml
, the difficulty is that yaml format may contain a dictionary directly, but is also able to contain any typed object thanks to the YAML object
directive. Therefore it could contain a ExecOpTest
.
The parsing logs are the following:
Parsing ./demo/simple_objects\test_sum_4 (singlefile, .yaml) > ExecOpTest ------- using <read_object_from_yaml> !! Caught error during execution !! File "C:\W_dev\_pycharm_workspace\python-parsyfiles\parsyfiles\parsing_core_api.py", line 403, in execute res, options) ParsingException : Error while parsing ./demo/simple_objects\test_sum_4 (singlefile, .yaml) as a <class 'test_parsyfiles.DemoTests.test_simple_objects.<locals>.ExecOpTest'> with parser '<read_object_from_yaml>' using options=({'MultifileCollectionParser': {'lazy_parsing': False}}) : parser returned {'y': 5, 'x': 2, 'op': '+', 'expected_result': 7} of type <class 'dict'> which is not an instance of <class 'test_parsyfiles.DemoTests.test_simple_objects.<locals>.ExecOpTest'> Rebuilding local parsing plan with next candidate parser: $<read_collection_from_yaml> => <dict_to_object>$ ./demo/simple_objects\test_sum_4 (singlefile, .yaml) > ExecOpTest ------- using $<read_collection_from_yaml> => <dict_to_object>$ Parsing ./demo/simple_objects\test_sum_4 (singlefile, .yaml) > ExecOpTest ------- using $<read_collection_from_yaml> => <dict_to_object>$ --> Successfully parsed a ExecOpTest from ./demo/simple_objects\test_sum_4
You can see from the logs that the framework successively tries several ways to parse this file :
-
<read_object_from_yaml>
: the file is read according to the yaml format, as anExecOpTest
object directly. This fails. -
$<read_collection_from_yaml> => <dict_to_object>$
: the file is read according to the yaml format, as a dictionary. Then this dictionary is converted into aExecOpTest
object. This succeeds
Understanding the inference logic - in which order are the parsers tried ?¶
These example show how parsyfiles
intelligently combines all registered parsers and converters to create parsing chains that make sense. These parsing chains are tried in order until a solution is found. Note that the order is deterministic:
-
First all exact match parsers. This includes combinations of {parser + converter chain} that lead to an exact match, sorted by converter chain size: first the small conversion chains, last the large conversion chains.
-
Then all approximative match parsers. This is similar to the "exact match" except that these are parsers able to parse a subclass of what you're asking for.
-
Finally all generic parsers. This includes combinations of {parser + converter chain} that end with a generic converter (for example the "dict to object" converter seen in the example above)
In order to know in advance which file extensions and formats the framework will be able to parse, you may wish to use the following command to ask the framework:
from parsyfiles import RootParser RootParser().print_capabilities_for_type(typ=ExecOpTest)
The result is a dictionary where each entry is a file extension:
{'.cfg': {'1_exact_match': [], '2_approx_match': [], '3_generic': [$<read_config> => <merge_all_config_sections_into_a_single_dict> -> <dict_to_object>$, $<read_config> => <config_to_dict_of_dict> -> <dict_of_dict_to_object>$, $<read_config> => <config_to_dict_of_dict> -> <dict_to_object>$]}, '.csv': {'1_exact_match': [], '2_approx_match': [], '3_generic': [$<read_df_or_series_from_csv> => <single_row_or_col_df_to_dict> -> <dict_to_object>$]}, '.ini': {'1_exact_match': [], '2_approx_match': [], '3_generic': [$<read_config> => <merge_all_config_sections_into_a_single_dict> -> <dict_to_object>$, $<read_config> => <config_to_dict_of_dict> -> <dict_of_dict_to_object>$, $<read_config> => <config_to_dict_of_dict> -> <dict_to_object>$]}, '.json': {'1_exact_match': [], '2_approx_match': [], '3_generic': [$<read_dict_or_list_from_json> => <dict_to_object>$]}, '.properties': {'1_exact_match': [], '2_approx_match': [], '3_generic': [$<read_dict_from_properties> => <dict_to_object>$]}, '.pyc': {'1_exact_match': [], '2_approx_match': [], '3_generic': [<read_object_from_pickle>]}, '.txt': {'1_exact_match': [], '2_approx_match': [], '3_generic': [$<read_dict_from_properties> => <dict_to_object>$, $<read_str_from_txt> => <base64_ascii_str_pickle_to_object>$, $<read_str_from_txt> => <constructor_with_str_arg>$, $<read_df_or_series_from_csv> => <single_row_or_col_df_to_dict> -> <dict_to_object>$]}, '.xls': {'1_exact_match': [], '2_approx_match': [], '3_generic': [$<read_dataframe_from_xls> => <single_row_or_col_df_to_dict> -> <dict_to_object>$]}, '.xlsm': {'1_exact_match': [], '2_approx_match': [], '3_generic': [$<read_dataframe_from_xls> => <single_row_or_col_df_to_dict> -> <dict_to_object>$]}, '.xlsx': {'1_exact_match': [], '2_approx_match': [], '3_generic': [$<read_dataframe_from_xls> => <single_row_or_col_df_to_dict> -> <dict_to_object>$]}, '.yaml': {'1_exact_match': [], '2_approx_match': [], '3_generic': [<read_object_from_yaml>, $<read_collection_from_yaml> => <dict_to_object>$]}, '.yml': {'1_exact_match': [], '2_approx_match': [], '3_generic': [<read_object_from_yaml>, $<read_collection_from_yaml> => <dict_to_object>$]}, '<multifile>': {'1_exact_match': [], '2_approx_match': [], '3_generic': [Multifile Object parser (parsyfiles defaults)]}}
Looking at the entries for .txt
and .yaml
, we can find back the ordered list of parsers that were automatically tried in the above examples.
Part 3 - Multifile objects: combining several parsers¶
This 'the' typical use case for this library. Suppose that you want to test the following exec_op_series
function, that uses complex types Series
and AlgoConf
as inputs and AlgoResults
as output:
class AlgoConf(object): def __init__(self, foo_param: str, bar_param: int): self.foo_param = foo_param self.bar_param = bar_param class AlgoResults(object): def __init__(self, score: float, perf: float): self.score = score self.perf = perf from pandas import Series def exec_op_series(x: Series, y: AlgoConf) -> AlgoResults: # ... intelligent stuff here... pass
Similar to what we've done in previous chapter, each test dataset can be represented as an object, containing the inputs and expected outputs. For example with this class:
class ExecOpSeriesTest(object): def __init__(self, x: Series, y: AlgoConf, expected_results: AlgoResults): self.x = x self.y = y self.expected_results = expected_results
Our test data folder look like this :
./demo/complex_objects ├── case1 │ ├── expected_results.txt │ ├── x.csv │ └── y.txt └── case2 ├── expected_results.txt ├── x.csv └── y.txt
You may notice that in this case, we chose to represent each instance of ExecOpSeriesTest
as a folder. This makes them 'multifile'. The default multifile object parser in the framework will try to parse each attribute of the constructor as an independent file in the folder.
The code for parsing remains the same - we tell the framework that we want to parse a collection of objects of type ExecOpSeriesTest
. The rest is handled automatically by the framework:
from pprint import pprint from parsyfiles import parse_collection mf_tests = parse_collection('./demo/complex_objects', ExecOpSeriesTest) pprint(mf_tests)
Here are the results :
**** Starting to parse collection of <ExecOpSeriesTest> at location ./demo/complex_objects **** Checking all files under ./demo/complex_objects ./demo/complex_objects (multifile) (...) File checks done Building a parsing plan to parse ./demo/complex_objects (multifile) into a Dict[str, ExecOpSeriesTest] (...) Parsing Plan created successfully Executing Parsing Plan for ./demo/complex_objects (multifile) > Dict[str, ExecOpSeriesTest] ------- using Multifile Collection parser (parsyfiles defaults) (...) --> Successfully parsed a Dict[str, ExecOpSeriesTest] from ./demo/complex_objects Completed parsing successfully {'case1': <ExecOpSeriesTest object at 0x00000000087DDF98>, 'case2': <ExecOpSeriesTest object at 0x000000000737FBE0>}
Note that multifile objects and singlefile objects may coexist in the same folder, and that parsing is recursive - meaning that multifile objects or collections may contain multifile children as well.
Part 4 - Advanced topics¶
The parse_collection
and parse_item
that we have used in most examples are actually just helper methods to build a parser registry (RootParser()
) and use it. Most of the advanced topics below use this object directly.
(a) Lazy parsing¶
The multifile collection parser included in the library provides an option to return a lazy collection instead of a standard set
, list
, dict
or tuple
. This collection will trigger parsing of each element only when that element is required. In addition to better controlling the parsing time, this feature is especially useful if you want to parse the most items possible, even if one item in the list fails parsing.
from pprint import pprint from parsyfiles import parse_collection from pandas import DataFrame dfs = parse_collection('./demo/simple_collection', DataFrame, lazy_mfcollection_parsing=True) print('dfs length : ' + str(len(dfs))) print('dfs keys : ' + str(dfs.keys())) print('Is b in dfs : ' + str('b' in dfs)) pprint(dfs.get('b'))
The result log shows that parse_collection
returned without parsing, and that parsing is executed when item 'b'
is read from the dictionary:
Executing Parsing Plan for ./demo/simple_collection (multifile) > Dict[str, DataFrame] ------- using Multifile Collection parser (parsyfiles defaults) Parsing ./demo/simple_collection (multifile) > Dict[str, DataFrame] ------- using Multifile Collection parser (parsyfiles defaults) Assembling a Dict[str, DataFrame] from all children of ./demo/simple_collection (multifile) (lazy parsing: children will be parsed when used) --> Successfully parsed a Dict[str, DataFrame] from ./demo/simple_collection Completed parsing successfully dfs length : 5 dfs keys : {'a', 'e', 'd', 'c', 'b'} Is b in dfs : True Executing Parsing Plan for ./demo/simple_collection\b (singlefile, .txt) > DataFrame ------- using <read_df_or_series_from_csv> Parsing ./demo/simple_collection\b (singlefile, .txt) > DataFrame ------- using <read_df_or_series_from_csv> --> Successfully parsed a DataFrame from ./demo/simple_collection\b Completed parsing successfully a b c d 0 1 2 3 4
(b) Passing options to existing parsers¶
Parsers and converters support options. In order to know which options are available for a specific parser, the best is to identify it and ask it. For example if you want to know what are the options available for the parsers reading DataFrame
objects :
from pandas import DataFrame from parsyfiles import RootParser # create a root parser parser = RootParser() # retrieve the parsers of interest parsers = parser.get_capabilities_for_type(DataFrame, strict_type_matching=False) df_csv_parser = parsers['.csv']['1_exact_match'][0] p_id_csv = df_csv_parser.get_id_for_options() print('Parser id for csv is : ' + p_id_csv + ', implementing function is ' + repr(df_csv_parser._parser_func)) print(' * ' + df_csv_parser.options_hints()) df_xls_parser = parsers['.xls']['1_exact_match'][0] p_id_xls = df_xls_parser.get_id_for_options() print('Parser id for csv is : ' + p_id_xls + ', implementing function is ' + repr(df_xls_parser._parser_func)) print(' * ' + df_xls_parser.options_hints())
The result is:
Parser id for csv is : read_df_or_series_from_csv, implementing function is <function read_df_or_series_from_csv at 0x0000000007391378> * read_df_or_series_from_csv: all options from read_csv are supported, see http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html Parser id for csv is : read_dataframe_from_xls, implementing function is <function read_dataframe_from_xls at 0x0000000007391158> * read_dataframe_from_xls: all options from read_excel are supported, see http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_excel.html
Then you may set the options accordingly on the root parser before calling it
from parsyfiles import create_parser_options, add_parser_options # configure the DataFrame parsers to automatically parse dates and use the first column as index opts = create_parser_options() opts = add_parser_options(opts, 'read_df_or_series_from_csv', {'parse_dates': True, 'index_col': 0}) opts = add_parser_options(opts, 'read_dataframe_from_xls', {'index_col': 0}) dfs = parser.parse_collection('./test_data/demo/ts_collection', DataFrame, options=opts) print(dfs)
Results:
{'a': a b c d time 2015-08-28 23:30:00 1 2 3 4 2015-08-29 00:00:00 1 2 3 5, 'c': a b date 2015-01-01 1 2 2015-01-02 4 3, 'b': a b c d time 2015-08-28 23:30:00 1 2 3 4 2015-08-29 00:00:00 1 2 3 5}
(c) Parsing subclasses of existing types - registering converters¶
Imagine that you want to parse a subtype of something the framework already knows to parse. For example a TimeSeries
class of your own, that extends DataFrame
:
from pandas import DataFrame, DatetimeIndex class TimeSeries(DataFrame): """ A dummy timeseries class that extends DataFrame """ def __init__(self, df: DataFrame): """ Constructor from a DataFrame. The DataFrame index should be an instance of DatetimeIndex :param df: """ if isinstance(df, DataFrame) and isinstance(df.index, DatetimeIndex): if df.index.tz is None: df.index = df.index.tz_localize(tz='UTC')# use the UTC hypothesis in absence of other hints self._df = df else: raise ValueError('Error creating TimeSeries from DataFrame: provided DataFrame does not have a ' 'valid DatetimeIndex') def __getattr__(self, item): # Redirects anything that is not implemented here to the base dataframe. # this is called only if the attribute was not found the usual way # easy version of the dynamic proxy just to save time :) # see http://code.activestate.com/recipes/496741-object-proxying/ for "the answer" df = object.__getattribute__(self, '_df') if hasattr(df, item): return getattr(df, item) else: raise AttributeError('\'' + self.__class__.__name__ + '\' object has no attribute \'' + item + '\'') def update(self, other, join='left', overwrite=True, filter_func=None, raise_conflict=False): """ For some reason this method was abstract in DataFrame so we have to implement it """ return self._df.update(other, join=join, overwrite=overwrite, filter_func=filter_func, raise_conflict=raise_conflict)
It is relatively easy to write a converter between a DataFrame
and a TimeSeries
. parsyfiles
provides classes that you should use to define your converters, for example here ConverterFunction
, that takes as argument a conversion method with a specific signature - hence the extra unused arguments in df_to_ts
:
from typing import Type from logging import Logger from parsyfiles.converting_core import ConverterFunction def df_to_ts(desired_type: Type[TimeSeries], df: DataFrame, logger: Logger) -> TimeSeries: """ Converter from DataFrame to TimeSeries """ return TimeSeries(df) my_converter = ConverterFunction(from_type=DataFrame, to_type=TimeSeries, conversion_method=df_to_ts)
You have to create the parser manually in order to register your converter:
from parsyfiles import RootParser, create_parser_options, add_parser_options # create a parser parser = RootParser('parsyfiles with timeseries') parser.register_converter(my_converter)
In some cases you may wish to change the underlying parsers options. This is possible provided that you know the identifier of the parser you wish to configure (typically it is the one appearing in the logs):
# configure the DataFrame parsers to read the first column as an datetime index opts = create_parser_options() opts = add_parser_options(opts, 'read_df_or_series_from_csv', {'parse_dates': True, 'index_col': 0}) opts = add_parser_options(opts, 'read_dataframe_from_xls', {'index_col': 0})
Finally, parsing is done the same way than before:
dfs = parser.parse_collection('./test_data/demo/ts_collection', TimeSeries, options=opts)
Note: you might have noticed that TimeSeries
is a dynamic proxy. The TimeSeries
class extends the DataFrame
class, but delegates everything to the underlying DataFrame
implementation provided in the constructor. This pattern is a good way to create specialized versions of generic objects created by your favourite parsers. For example two DataFrame
might represent a training set, and a prediction table. Both objects, although similar (both are tables with rows and columns), might have very different contents (column names, column types, number of rows, etc.). We can make this fundamental difference appear at parsing level, by creating two classes.
(d) Registering a new parser¶
Parsyfiles offers several ways to register a parser. Here is a simple example, where we register a basic 'singlefile' xml parser:
from typing import Type from parsyfiles import RootParser from parsyfiles.parsing_core import SingleFileParserFunction, T from logging import Logger from xml.etree.ElementTree import ElementTree, parse, tostring def read_xml(desired_type: Type[T], file_path: str, encoding: str, logger: Logger, **kwargs): """ Opens an XML file and returns the tree parsed from it as an ElementTree. :param desired_type: :param file_path: :param encoding: :param logger: :param kwargs: :return: """ return parse(file_path) my_parser = SingleFileParserFunction(parser_function=read_xml, streaming_mode=False, supported_exts={'.xml'}, supported_types={ElementTree}) parser = RootParser('parsyfiles with timeseries') parser.register_parser(my_parser) xmls = parser.parse_collection('./test_data/demo/xml_collection', ElementTree) print({name: tostring(x.getroot()) for name, x in xmls.items()})
For more examples on how the parser API can be used, please have a look at the core and optional plugins.
(e) Contract validation for parsed objects : combo with classtools-autocode and attrs¶
Users may wish to use classtools_autocode or attrs in order to create very compact classes representing their objects while at the same time ensuring that parsed data is valid according to some contract. Parsyfiles is totally compliant with such classes, as shown in the examples below
classtools-autocode example¶
from classtools_autocode import autoprops, autoargs from contracts import contract, new_contract # custom PyContract used in the class new_contract('allowed_op', lambda x: x in {'+','*'}) @autoprops class ExecOpTest(object): @autoargs @contract(x='int|float', y='int|float', op='str,allowed_op', expected_result='int|float') def __init__(self, x: float, y: float, op: str, expected_result: float): pass def __str__(self): return self.__repr__() def __repr__(self): return str(self.x) + ' ' + self.op + ' ' + str(self.y) + ' =? ' + str(self.expected_result) sf_tests = parse_collection('./demo/simple_objects', ExecOpTest)
The above code has a contract associated to allowed_op
that checks that it must be in {'+','*'}
. When '-'
is found in a test file, it fails:
ParsingException : Error while parsing ./demo/simple_objects\test_diff_1 (singlefile, .cfg) as a ExecOpTest with parser '$<read_config> => <merge_all_config_sections_into_a_single_dict> -> <dict_to_object>$' using options=({'MultifileCollectionParser': {'lazy_parsing': False}}) : caught ObjectInstantiationException : Error while building object of type <ExecOpTest> using its constructor and parsed contents : {'y': 1.0, 'x': 1.0, 'expected_result': 0.0, 'op': '-'} : <class 'contracts.interface.ContractNotRespected'> Breach for argument 'op' to ExecOpTest:generated_setter_fun(). Value does not pass criteria of <lambda>()() (module: test_parsyfiles). checking: callable() for value: Instance of <class 'str'>: '-' checking: allowed_op for value: Instance of <class 'str'>: '-' checking: str,allowed_op for value: Instance of <class 'str'>: '-'
attrs example¶
In order for parsyfiles to find the required type for each attribute declared using attrs
, you will have to use attr.validators.instance_of
. However, since you may wish to also implement some custom validation logic, we provide (until it is offically added in attrs
) a chaining operator. The code below shows how to create a similar example than the previous one:
import attr from attr.validators import instance_of from parsyfiles.plugins_optional.support_for_attrs import chain # custom contract used in the class def validate_op(instance, attribute, value): allowed = {'+','*'} if value not in allowed: raise ValueError('\'op\' has to be a string, in ' + str(allowed) + '!') @attr.s class ExecOpTest(object): x = attr.ib(convert=float, validator=instance_of(float)) y = attr.ib(convert=float, validator=instance_of(float)) # we use the 'chain' validator here to keep using instance_of op = attr.ib(convert=str, validator=chain(instance_of(str), validate_op)) expected_result = attr.ib(convert=float, validator=instance_of(float)) # with self.assertRaises(ParsingException): sf_tests = parse_collection('./test_data/demo/simple_objects', ExecOpTest)
When '-'
is found in a test file, it also fails with a nice error message:
ParsingException : Error while parsing ./test_data/demo/simple_objects\test_diff_1 (singlefile, .cfg) as a ExecOpTest with parser '$<read_config> => <merge_all_config_sections_into_a_single_dict> -> <dict_to_object>$' using options=({'MultifileCollectionParser': {'lazy_parsing': False}}) : caught ObjectInstantiationException : Error while building object of type <ExecOpTest> using its constructor and parsed contents : {'y': 1.0, 'x': 1.0, 'expected_result': 0.0, 'op': '-'} : <class 'ValueError'> 'op' has to be a string, in {'*', '+'}!
Note: unfortunately, as of today (version 16.3), attrs
does not validate attribute contents when fields are later modified on the object directly. A pull request is ongoing.
(e) File mappings: Wrapped/Flat and encoding¶
In 3- Multifile objects: combining several parsers we used folders to encapsulate objects. In previous examples we also used the root folder to encapsulate the main item collection. This default setting is known as 'Wrapped' mode and correspond behind the scenes to a WrappedFileMappingConfiguration
being used, with default python encoding.
Alternatively you may wish to use flat mode. In this case the folder structure should be flat, as shown below. Item names and field names are separated by a configurable character string. For example to parse the same example as in 3- Multifile objects: combining several parsers but with the following flat tree structure:
. ├── case1--expected_results.txt ├── case1--x.csv ├── case1--y.txt ├── case2--expected_results.txt ├── case2--x.csv └── case2--y.txt
you'll need to call
from parsyfiles import FlatFileMappingConfiguration dfs = parse_collection('./demo/complex_objects_flat', DataFrame, file_mapping_conf=FlatFileMappingConfiguration())
Note that FlatFileMappingConfiguration
may be configured to use another separator sequence than '--'
by passing it to the constructor: e.g. FlatFileMappingConfiguration(separator='_')
. A dot '.'
may be safely used as a separator too.
Finally you may change the file encoding used by both file mapping configurations : WrappedFileMappingConfiguration(encoding='utf-16')
FlatFileMappingConfiguration(encoding='utf-16')
.
(f) Recursivity: Multifile children of Multifile objects¶
As said earlier in this tutorial, parsyfiles is able to parse multifile recursively, for example multifile collections of multifile objects, multifile objects containing attributes, etc.
Example recursivity in flat mode¶
./custom_old_demo_flat_coll ├── case1--input_a.txt ├── case1--input_b.txt ├── case1--output.txt ├── case2--input_a.txt ├── case2--input_b.txt ├── case2--options.txt ├── case2--output.txt ├── case3--input_a.txt ├── case3--input_b.txt ├── case3--input_c--keyA--item1.txt ├── case3--input_c--keyA--item2.txt ├── case3--input_c--keyB--item1.txt ├── case3--options.cfg └── case3--output.txt
Example recursivity in wrapped mode¶
./custom_old_demo_coll ├── case1 │ ├── input_a.txt │ ├── input_b.txt │ └── output.txt ├── case2 │ ├── input_a.txt │ ├── input_b.txt │ ├── options.txt │ └── output.txt └── case3 ├── input_a.txt ├── input_b.txt ├── input_c │ ├── keyA │ │ ├── item1.txt │ │ └── item2.txt │ └── keyB │ └── item1.txt ├── options.cfg └── output.txt
(g) Diversity of formats supported: DataFrames - revisited¶
Now that we've seen that parsyfiles is able to combine parsers and converters, we can try to parse DataFrame
objects from many more sources:
./demo/simple_collection_dataframe_inference ├── a.csv ├── b.txt ├── c.xls ├── d.xlsx ├── s_b64_pickle.txt ├── t_pickle.pyc ├── u.json ├── v_properties.txt ├── w.properties ├── x.yaml ├── y.cfg └── z.ini
Note: once again you may find this example data folder in the project sources
The code is the same:
from pprint import pprint from parsyfiles import parse_collection from pandas import DataFrame dfs = parse_collection('./demo/simple_collection_dataframe_inference', DataFrame) pprint(dfs)
And here is the result
TODO