Repository URL to install this package:
|
Version:
4.2.0.dev2 ▾
|
| sarus_data_spec |
| sarus_data_spec.egg-info |
| MANIFEST.in |
| PKG-INFO |
| README.md |
| pyproject.toml |
| setup.cfg |
| setup.py |
A library to manage Sarus datasets
To start using sarus_dataset,
pip install sarus_data_spec
You may want to set export SYSTEM_VERSION_COMPAT=1 so that tensorflow installs on MacOS
What has been added:
iter: allows to iterate over arrow batches.batch.Contains utilities to convert sarus_schema to arrow_schema and sarus_types to arrow_types.
So far, only a local manager has been implemented. The manager provides in particular the following methods:
schema: returns schema of a dataset.to_arrow: returns an arrow iterator. It first stores the data to parquet if it cannot find them.to_pandas: returns a dataframe version of the dataset.size: method to compute the size of the dataset and each table.bounds compute bounds of a dataset.marginals: compute marginals of the dataset.The operations are implemented in the manager/ops directory.
The sql folder contains the main methods to derive the schema from a dataset and collect the data.
Some tests are available at tests/units/test_manager/test_sql to check to better understand each method.
The version is WIP for the tests, in order to play with it, the tests in the schema folder should be run
first because they create the database.
In the schema.py, one can find the methods to derive the schema:
sql_schema is the main method that makes all the computations: it reflects the dataset and calls inner functions:sarus_schema_from_metadata returns a first protobuf schema. It iterates over each table, transforming it in a struct via
the method get_table_types. It also stores in the properties two dictionaries: one contains the list of primary keys in each table and the other the list of foreign keys per table along with the table and column they are pointing at.direct_paths_to_protected_entity lists for each table connected directly to the protected entity all the possible paths.schema with protected entity: takes the schema proto, it adds to its properties the list of public tables (the ones that are not directly connected to the protected entity) and the paths.
The type.py module contains methods to transform sql types to sarus types.The data collection is done through the methods in the module queries.py:
sql_to_parquet is the main method: it reflects the table and calls different methods to compose the queries then save
the result to parquet.get merged queries returns a dict where each item queries one table to which one protected entity has been added to each row (via the path established in the schema).add_counts modify each query of the dict to also return the weight of each row (1/number of times it has been duplicated).save to parquet iterates over each query, execute it and stores the data in the same parquet file.The manager operations to retrieve the schema/iterate over a transformed dataset, are in the transformed directory. The manager recursively applies transformations to the parent dataset.
Some tests are available in tests/unit/test_manager/test_transforms to observe the behaviour of each transform over a dataset.
The manager returns the schema of the parent dataset, and samples some indices from the parent dataset to iterate over it.
The manager returns the schema of the parent dataset, and shuffles the indices of the parent dataset to iterate over it.
Each operation is applied backwards.
To be implemented.
The new schema is retrieved directly in the transform.
The clip_visitor module provides a method to scan the data and clip its values according to the schema max and mins.
Strong typing using protocols. Base classes and mixins for static shared behavior. Composition passed at construction for dynamic shared behavior
Makefile > Dockerfile > Pipfile > lock