sarus / sarus_data_spec python

Repository URL to install this package:

Details

sarus_data_spec

sarus_data_spec.egg-info

MANIFEST.in

PKG-INFO

README.md

pyproject.toml

setup.cfg

setup.py

README.md

sarus-dataset

A library to manage Sarus datasets

Installation

To start using sarus_dataset,

pip install sarus_data_spec

You may want to set export SYSTEM_VERSION_COMPAT=1 so that tensorflow installs on MacOS

Quickstart

Project status

Current features

Dataset

What has been added:

iter: allows to iterate over arrow batches.
_batch_size: determines the size of each batch when iterating. It defaults to one but can be set via the dataset method batch.

Arrow

Contains utilities to convert sarus_schema to arrow_schema and sarus_types to arrow_types.

Manager

So far, only a local manager has been implemented. The manager provides in particular the following methods:

schema: returns schema of a dataset.
to_arrow: returns an arrow iterator. It first stores the data to parquet if it cannot find them.
to_pandas: returns a dataframe version of the dataset.
size: method to compute the size of the dataset and each table.
bounds compute bounds of a dataset.
marginals: compute marginals of the dataset.

Manager Operations

The operations are implemented in the manager/ops directory.

SQL

The sql folder contains the main methods to derive the schema from a dataset and collect the data. Some tests are available at tests/units/test_manager/test_sql to check to better understand each method. The version is WIP for the tests, in order to play with it, the tests in the schema folder should be run first because they create the database.

Schema

In the schema.py, one can find the methods to derive the schema:

sql_schema is the main method that makes all the computations: it reflects the dataset and calls inner functions:
sarus_schema_from_metadata returns a first protobuf schema. It iterates over each table, transforming it in a struct via the method get_table_types. It also stores in the properties two dictionaries: one contains the list of primary keys in each table and the other the list of foreign keys per table along with the table and column they are pointing at.
direct_paths_to_protected_entity lists for each table connected directly to the protected entity all the possible paths.
schema with protected entity: takes the schema proto, it adds to its properties the list of public tables (the ones that are not directly connected to the protected entity) and the paths. The type.py module contains methods to transform sql types to sarus types.

Data collection/ iteration

The data collection is done through the methods in the module queries.py:

sql_to_parquet is the main method: it reflects the table and calls different methods to compose the queries then save the result to parquet.
get merged queries returns a dict where each item queries one table to which one protected entity has been added to each row (via the path established in the schema).
add_counts modify each query of the dict to also return the weight of each row (1/number of times it has been duplicated).
save to parquet iterates over each query, execute it and stores the data in the same parquet file.

Transforms

The manager operations to retrieve the schema/iterate over a transformed dataset, are in the transformed directory. The manager recursively applies transformations to the parent dataset. Some tests are available in tests/unit/test_manager/test_transforms to observe the behaviour of each transform over a dataset.

Sample

The manager returns the schema of the parent dataset, and samples some indices from the parent dataset to iterate over it.

Shuffle

The manager returns the schema of the parent dataset, and shuffles the indices of the parent dataset to iterate over it.

Composed

Each operation is applied backwards.

Filter

To be implemented.

UserSettings

The new schema is retrieved directly in the transform. The clip_visitor module provides a method to scan the data and clip its values according to the schema max and mins.

Known limitations / Bugs

Acknowledgements

Contributors

Miscellaneous

Coding choices

Strong typing using protocols. Base classes and mixins for static shared behavior. Composition passed at construction for dynamic shared behavior

Locking 2024

Makefile > Dockerfile > Pipfile > lock