Metadata-Version: 2.1
Name: nanoarrow
Version: 0.7.0.dev132
Summary: Python bindings to the nanoarrow C library
Author-email: Apache Arrow Developers <dev@arrow.apache.org>
Maintainer-email: Apache Arrow Developers <dev@arrow.apache.org>
License: Apache-2.0
Project-URL: Homepage, https://arrow.apache.org
Project-URL: Repository, https://github.com/apache/arrow-nanoarrow
Project-URL: Issues, https://github.com/apache/arrow-nanoarrow/issues
Project-URL: Changelog, https://github.com/apache/arrow-nanoarrow/blob/main/CHANGELOG.md
Requires-Python: >=3.8
Description-Content-Type: text/markdown
Provides-Extra: test
Requires-Dist: pyarrow; extra == "test"
Requires-Dist: python-dateutil; extra == "test"
Requires-Dist: pytest; extra == "test"
Requires-Dist: numpy; extra == "test"
Provides-Extra: verify
Requires-Dist: python-dateutil; extra == "verify"
Requires-Dist: pytest; extra == "verify"
Requires-Dist: numpy; extra == "verify"
<!---
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
<!-- Render with jupyter nbconvert --to markdown README.ipynb -->
# nanoarrow for Python
The nanoarrow Python package provides bindings to the nanoarrow C library. Like
the nanoarrow C library, it provides tools to facilitate the use of the
[Arrow C Data](https://arrow.apache.org/docs/format/CDataInterface.html)
and [Arrow C Stream](https://arrow.apache.org/docs/format/CStreamInterface.html)
interfaces.
## Installation
The nanoarrow Python bindings are available from [PyPI](https://pypi.org/) and
[conda-forge](https://conda-forge.org/):
```shell
pip install nanoarrow
conda install nanoarrow -c conda-forge
```
Development versions (based on the `main` branch) are also available:
```shell
pip install --extra-index-url https://pypi.fury.io/arrow-nightlies/ \
--prefer-binary --pre nanoarrow
```
If you can import the namespace, you're good to go!
```python
import nanoarrow as na
```
## Data types, arrays, and array streams
The Arrow C Data and Arrow C Stream interfaces are comprised of three structures: the `ArrowSchema` which represents a data type of an array, the `ArrowArray` which represents the values of an array, and an `ArrowArrayStream`, which represents zero or more `ArrowArray`s with a common `ArrowSchema`. These concepts map to the `nanoarrow.Schema`, `nanoarrow.Array`, and `nanoarrow.ArrayStream` in the Python package.
```python
na.int32()
```
<Schema> int32
```python
na.Array([1, 2, 3], na.int32())
```
nanoarrow.Array<int32>[3]
1
2
3
The `nanoarrow.Array` can accommodate arrays with any number of chunks, reflecting the reality that many array containers (e.g., `pyarrow.ChunkedArray`, `polars.Series`) support this.
```python
chunked = na.Array.from_chunks([[1, 2, 3], [4, 5, 6]], na.int32())
chunked
```
nanoarrow.Array<int32>[6]
1
2
3
4
5
6
Whereas chunks of an `Array` are always fully materialized when the object is constructed, the chunks of an `ArrayStream` have not necessarily been resolved yet.
```python
stream = na.ArrayStream(chunked)
stream
```
nanoarrow.ArrayStream<int32>
```python
with stream:
for chunk in stream:
print(chunk)
```
nanoarrow.Array<int32>[3]
1
2
3
nanoarrow.Array<int32>[3]
4
5
6
The `nanoarrow.ArrayStream` also provides an interface to nanoarrow's [Arrow IPC](https://arrow.apache.org/docs/format/Columnar.html#serialization-and-interprocess-communication-ipc) reader:
```python
url = "https://github.com/apache/arrow-experiments/raw/main/data/arrow-commits/arrow-commits.arrows"
na.ArrayStream.from_url(url)
```
nanoarrow.ArrayStream<non-nullable struct<commit: string, time: timestamp('us', 'UTC'), files: int3...>
These objects implement the [Arrow PyCapsule interface](https://arrow.apache.org/docs/format/CDataInterface/PyCapsuleInterface.html) for both producing and consuming and are interchangeable with `pyarrow` objects in many cases:
```python
import pyarrow as pa
pa.field(na.int32())
```
pyarrow.Field<: int32>
```python
pa.chunked_array(chunked)
```
<pyarrow.lib.ChunkedArray object at 0x12a49a250>
[
[
1,
2,
3
],
[
4,
5,
6
]
]
```python
pa.array(chunked.chunk(1))
```
<pyarrow.lib.Int32Array object at 0x11b552500>
[
4,
5,
6
]
```python
na.Array(pa.array([10, 11, 12]))
```
nanoarrow.Array<int64>[3]
10
11
12
```python
na.Schema(pa.string())
```
<Schema> string
## Low-level C library bindings
The nanoarrow Python package also provides lower level wrappers around Arrow C interface structures. You can create these using `nanoarrow.c_schema()`, `nanoarrow.c_array()`, and `nanoarrow.c_array_stream()`.
### Schemas
Use `nanoarrow.c_schema()` to convert an object to an `ArrowSchema` and wrap it as a Python object. This works for any object implementing the [Arrow PyCapsule Interface](https://arrow.apache.org/docs/format/CDataInterface.html) (e.g., `pyarrow.Schema`, `pyarrow.DataType`, and `pyarrow.Field`).
```python
na.c_schema(pa.decimal128(10, 3))
```
<nanoarrow.c_schema.CSchema decimal128(10, 3)>
- format: 'd:10,3'
- name: ''
- flags: 2
- metadata: NULL
- dictionary: NULL
- children[0]:
Using `c_schema()` is a good fit for testing and for ephemeral schema objects that are being passed from one library to another. To extract the fields of a schema in a more convenient form, use `Schema()`:
```python
schema = na.Schema(pa.decimal128(10, 3))
schema.precision, schema.scale
```
(10, 3)
The `CSchema` object cleans up after itself: when the object is deleted, the underlying `ArrowSchema` is released.
### Arrays
You can use `nanoarrow.c_array()` to convert an array-like object to an `ArrowArray`, wrap it as a Python object, and attach a schema that can be used to interpret its contents. This works for any object implementing the [Arrow PyCapsule Interface](https://arrow.apache.org/docs/format/CDataInterface.html) (e.g., `pyarrow.Array`, `pyarrow.RecordBatch`).
```python
na.c_array(["one", "two", "three", None], na.string())
```
<nanoarrow.c_array.CArray string>
- length: 4
- offset: 0
- null_count: 1
- buffers: (4754305168, 4754307808, 4754310464)
- dictionary: NULL
- children[0]:
Using `c_array()` is a good fit for testing and for ephemeral array objects that are being passed from one library to another. For a higher level interface, use `Array()`:
```python
array = na.Array(["one", "two", "three", None], na.string())
array.to_pylist()
```
['one', 'two', 'three', None]
```python
array.buffers
```
(nanoarrow.c_lib.CBufferView(bool[1 b] 11100000),
nanoarrow.c_lib.CBufferView(int32[20 b] 0 3 6 11 11),
nanoarrow.c_lib.CBufferView(string[11 b] b'onetwothree'))
Advanced users can create arrays directly from buffers using `c_array_from_buffers()`:
```python
na.c_array_from_buffers(
na.string(),
2,
[None, na.c_buffer([0, 3, 6], na.int32()), b"abcdef"]
)
```
<nanoarrow.c_array.CArray string>
- length: 2
- offset: 0
- null_count: 0
- buffers: (0, 5002908320, 4999694624)
- dictionary: NULL
- children[0]:
### Array streams
You can use `nanoarrow.c_array_stream()` to wrap an object representing a sequence of `CArray`s with a common `CSchema` to an `ArrowArrayStream` and wrap it as a Python object. This works for any object implementing the [Arrow PyCapsule Interface](https://arrow.apache.org/docs/format/CDataInterface.html) (e.g., `pyarrow.RecordBatchReader`, `pyarrow.ChunkedArray`).
```python
pa_batch = pa.record_batch({"col1": [1, 2, 3]})
reader = pa.RecordBatchReader.from_batches(pa_batch.schema, [pa_batch])
array_stream = na.c_array_stream(reader)
array_stream
```
<nanoarrow.c_array_stream.CArrayStream>
- get_schema(): struct<col1: int64>
You can pull the next array from the stream using `.get_next()` or use it like an iterator. The `.get_next()` method will raise `StopIteration` when there are no more arrays in the stream.
```python
for array in array_stream:
print(array)
```
<nanoarrow.c_array.CArray struct<col1: int64>>
- length: 3
- offset: 0
- null_count: 0
- buffers: (0,)
- dictionary: NULL
- children[1]:
'col1': <nanoarrow.c_array.CArray int64>
- length: 3
- offset: 0
- null_count: 0
- buffers: (0, 2642948588352)
- dictionary: NULL
- children[0]:
Use `ArrayStream()` for a higher level interface:
```python
reader = pa.RecordBatchReader.from_batches(pa_batch.schema, [pa_batch])
na.ArrayStream(reader).read_all()
```
nanoarrow.Array<non-nullable struct<col1: int64>>[3]
{'col1': 1}
{'col1': 2}
{'col1': 3}
## Development
Python bindings for nanoarrow are managed with [setuptools](https://setuptools.pypa.io/en/latest/index.html).
This means you can build the project using:
```shell
git clone https://github.com/apache/arrow-nanoarrow.git
cd arrow-nanoarrow/python
pip install -e .
```
Tests use [pytest](https://docs.pytest.org/):
```shell
# Install dependencies
pip install -e ".[test]"
# Run tests
pytest -vvx
```
CMake is currently required to ensure that the vendored copy of nanoarrow in the Python package stays in sync with the nanoarrow sources in the working tree.