Gemfury

sarus / sarus_data_spec python

Repository URL to install this package:
Details
sarus_data_spec / PKG-INFO
Metadata-Version: 2.1
Name: sarus_data_spec
Version: 4.5.4.dev1
Summary: A library to manage Sarus datasets
Author: Sarus
License: PRIVATE
Classifier: Programming Language :: Python :: 3.9
Requires-Python: >=3.8
Description-Content-Type: text/markdown
Requires-Dist: requests
Requires-Dist: importlib-metadata; python_version < "3.10"
Requires-Dist: protobuf==3.20.3
Requires-Dist: numpy>=1.20.0
Requires-Dist: pyarrow~=15.0.0
Requires-Dist: fsspec[gcs,http]>=2021.0
Requires-Dist: pandas~=1.4.0
Requires-Dist: typing_extensions>=4.1.0
Provides-Extra: tests
Requires-Dist: pytest>=6.2; extra == "tests"
Requires-Dist: pytest-mock>=3.6; extra == "tests"
Requires-Dist: pytest-cov>=2.12; extra == "tests"
Requires-Dist: pyspark; extra == "tests"
Requires-Dist: psycopg2-binary; extra == "tests"
Requires-Dist: docker; extra == "tests"
Requires-Dist: types-cachetools; extra == "tests"
Requires-Dist: types-requests; extra == "tests"
Requires-Dist: types-setuptools; extra == "tests"
Requires-Dist: types-python-dateutil; extra == "tests"
Requires-Dist: mypy==1.9.0; extra == "tests"
Requires-Dist: mypy-protobuf==2.10.0; extra == "tests"
Requires-Dist: sqlalchemy~=2.0; extra == "tests"
Requires-Dist: pre-commit; extra == "tests"
Requires-Dist: types-protobuf==3.18.4; extra == "tests"
Requires-Dist: iso3166; extra == "tests"
Provides-Extra: tensorflow
Requires-Dist: tensorflow>=2.0; (sys_platform != "darwin" or platform_machine != "arm64") and extra == "tensorflow"
Requires-Dist: tensorflow-macos>=2.0; (sys_platform == "darwin" and platform_machine == "arm64") and extra == "tensorflow"
Provides-Extra: onboarding
Requires-Dist: sarus-statistics>=4.0.1; extra == "onboarding"
Requires-Dist: sarus-synthetic-data>=4.0.7; extra == "onboarding"
Requires-Dist: clevercsv; extra == "onboarding"
Provides-Extra: external
Requires-Dist: scikit-learn==1.2.2; extra == "external"
Requires-Dist: scipy>=1.9.0; extra == "external"
Requires-Dist: shap==0.42.1; extra == "external"
Requires-Dist: imbalanced-learn; extra == "external"
Requires-Dist: scikit-optimize; extra == "external"
Requires-Dist: ydata-profiling<4.7.0; extra == "external"
Requires-Dist: visions; extra == "external"
Requires-Dist: plotly; extra == "external"
Requires-Dist: optbinning; extra == "external"
Requires-Dist: xgboost~=1.6.1; extra == "external"
Provides-Extra: sql
Requires-Dist: pyqrlew>=0.9.26; extra == "sql"
Provides-Extra: bigquery
Requires-Dist: google-cloud; extra == "bigquery"
Requires-Dist: google-cloud-dataproc; extra == "bigquery"
Requires-Dist: google-cloud-bigquery; extra == "bigquery"
Requires-Dist: google-cloud-bigquery-storage; extra == "bigquery"
Requires-Dist: google-auth-stubs; extra == "bigquery"
Requires-Dist: google-api-python-client-stubs; extra == "bigquery"
Requires-Dist: sqlalchemy-bigquery; extra == "bigquery"
Provides-Extra: dpops
Requires-Dist: sarus-statistics>=4.0.0; extra == "dpops"
Requires-Dist: sarus-differential-privacy>=1.0.1; extra == "dpops"
Provides-Extra: llm
Requires-Dist: sarus-llm~=1.0.0; extra == "llm"

# sarus-dataset

A library to manage Sarus datasets

## Installation
To start using `sarus_dataset`,

```shell script
pip install sarus_data_spec
```

You may want to set `export SYSTEM_VERSION_COMPAT=1` so that tensorflow installs on MacOS

## Quickstart


## Project status

### Current features

#### Dataset
What has been added:
- `iter`: allows to iterate over arrow batches.
- _batch_size: determines the size of each batch when iterating. It defaults to one but can be set via the dataset method `batch`.

#### Arrow

Contains utilities to convert sarus_schema to arrow_schema and sarus_types to arrow_types.

#### Manager

So far, only a local manager has been implemented. The manager provides in particular the following methods:
- `schema`: returns schema of a dataset.
- `to_arrow`: returns an arrow iterator. It first stores the data to parquet if it cannot find them.
- `to_pandas`: returns a dataframe version of the dataset.
- `size`: method to compute the size of the dataset and each table.
- `bounds` compute bounds of a dataset.
- `marginals`: compute marginals of the dataset.

#### Manager Operations
The operations are implemented in the `manager/ops` directory.

#### SQL

The `sql` folder contains the main methods to derive the schema from a dataset and collect the data.
Some tests are available at `tests/units/test_manager/test_sql` to check to better understand each method.
The version is WIP for the tests, in order to play with it, the tests in the schema folder should be run
first because they create the database.

##### Schema
In the `schema.py`, one can find the methods to derive the schema:
- `sql_schema` is the main method that makes all the computations: it reflects the dataset and calls inner functions:
-  `sarus_schema_from_metadata` returns a first protobuf schema. It iterates over each table, transforming it in a struct via
the method `get_table_types`. It also stores in the properties two dictionaries: one contains the list of primary keys in each table and the other the list of foreign keys per table along with the table and column they are pointing at.
-  `direct_paths_to_protected_entity` lists for each table connected directly to the protected entity all the possible paths.
- `schema with protected entity`: takes the schema proto, it adds to its properties the list of public tables (the ones that are not directly connected to the protected entity) and the paths.
The `type.py` module contains methods to transform sql types to sarus types.

##### Data collection/ iteration
The data collection is done through the methods in the module `queries.py`:
- `sql_to_parquet` is the main method: it reflects the table and calls different methods to compose the queries then save
the result to parquet.
- `get merged queries` returns a dict where each item queries one table to which one protected entity has been added to each row (via the path established in the schema).
- `add_counts` modify each query of the dict to also return the weight of each row (1/number of times it has been duplicated).
- `save to parquet` iterates over each query, execute it and stores the data in the same parquet file.

### Transforms
The manager operations to retrieve the schema/iterate over a transformed dataset, are in the `transformed` directory. The manager recursively applies transformations to the parent dataset.
Some tests are available in `tests/unit/test_manager/test_transforms` to observe the behaviour of each transform over a dataset.
#### Sample
The manager returns the schema of the parent dataset, and samples some indices from the parent dataset to iterate over it.

##### Shuffle
The manager returns the schema of the parent dataset, and shuffles the indices of the parent dataset to iterate over it.

#### Composed
Each operation is applied backwards.

##### Filter
To be implemented.

##### UserSettings
The new schema is retrieved directly in the transform.
The `clip_visitor` module provides a method to scan the data and clip its values according to the schema max and mins.


### Known limitations / Bugs


## Acknowledgements

### Contributors

### Miscellaneous

## Coding choices

Strong typing using protocols.
Base classes and mixins for static shared behavior.
Composition passed at construction for dynamic shared behavior

# Locking 2024

Makefile > Dockerfile > Pipfile > lock
sarus / sarus_data_spec python

Products

About

Resources

Contact Gemfury