Why Gemfury? Push, build, and install  RubyGems npm packages Python packages Maven artifacts PHP packages Go Modules Bower components Debian packages RPM packages NuGet packages

arrow-nightlies / nanoarrow   python

Repository URL to install this package:

Version: 0.7.0.dev132 

  README.md

nanoarrow for Python

The nanoarrow Python package provides bindings to the nanoarrow C library. Like the nanoarrow C library, it provides tools to facilitate the use of the Arrow C Data and Arrow C Stream interfaces.

Installation

The nanoarrow Python bindings are available from PyPI and conda-forge:

pip install nanoarrow
conda install nanoarrow -c conda-forge

Development versions (based on the main branch) are also available:

pip install --extra-index-url https://pypi.fury.io/arrow-nightlies/ \
    --prefer-binary --pre nanoarrow

If you can import the namespace, you're good to go!

import nanoarrow as na

Data types, arrays, and array streams

The Arrow C Data and Arrow C Stream interfaces are comprised of three structures: the ArrowSchema which represents a data type of an array, the ArrowArray which represents the values of an array, and an ArrowArrayStream, which represents zero or more ArrowArrays with a common ArrowSchema. These concepts map to the nanoarrow.Schema, nanoarrow.Array, and nanoarrow.ArrayStream in the Python package.

na.int32()
<Schema> int32
na.Array([1, 2, 3], na.int32())
nanoarrow.Array<int32>[3]
1
2
3

The nanoarrow.Array can accommodate arrays with any number of chunks, reflecting the reality that many array containers (e.g., pyarrow.ChunkedArray, polars.Series) support this.

chunked = na.Array.from_chunks([[1, 2, 3], [4, 5, 6]], na.int32())
chunked
nanoarrow.Array<int32>[6]
1
2
3
4
5
6

Whereas chunks of an Array are always fully materialized when the object is constructed, the chunks of an ArrayStream have not necessarily been resolved yet.

stream = na.ArrayStream(chunked)
stream
nanoarrow.ArrayStream<int32>
with stream:
    for chunk in stream:
        print(chunk)
nanoarrow.Array<int32>[3]
1
2
3
nanoarrow.Array<int32>[3]
4
5
6

The nanoarrow.ArrayStream also provides an interface to nanoarrow's Arrow IPC reader:

url = "https://github.com/apache/arrow-experiments/raw/main/data/arrow-commits/arrow-commits.arrows"
na.ArrayStream.from_url(url)
nanoarrow.ArrayStream<non-nullable struct<commit: string, time: timestamp('us', 'UTC'), files: int3...>

These objects implement the Arrow PyCapsule interface for both producing and consuming and are interchangeable with pyarrow objects in many cases:

import pyarrow as pa

pa.field(na.int32())
pyarrow.Field<: int32>
pa.chunked_array(chunked)
<pyarrow.lib.ChunkedArray object at 0x12a49a250>
[
  [
    1,
    2,
    3
  ],
  [
    4,
    5,
    6
  ]
]
pa.array(chunked.chunk(1))
<pyarrow.lib.Int32Array object at 0x11b552500>
[
  4,
  5,
  6
]
na.Array(pa.array([10, 11, 12]))
nanoarrow.Array<int64>[3]
10
11
12
na.Schema(pa.string())
<Schema> string

Low-level C library bindings

The nanoarrow Python package also provides lower level wrappers around Arrow C interface structures. You can create these using nanoarrow.c_schema(), nanoarrow.c_array(), and nanoarrow.c_array_stream().

Schemas

Use nanoarrow.c_schema() to convert an object to an ArrowSchema and wrap it as a Python object. This works for any object implementing the Arrow PyCapsule Interface (e.g., pyarrow.Schema, pyarrow.DataType, and pyarrow.Field).

na.c_schema(pa.decimal128(10, 3))
<nanoarrow.c_schema.CSchema decimal128(10, 3)>
- format: 'd:10,3'
- name: ''
- flags: 2
- metadata: NULL
- dictionary: NULL
- children[0]:

Using c_schema() is a good fit for testing and for ephemeral schema objects that are being passed from one library to another. To extract the fields of a schema in a more convenient form, use Schema():

schema = na.Schema(pa.decimal128(10, 3))
schema.precision, schema.scale
(10, 3)

The CSchema object cleans up after itself: when the object is deleted, the underlying ArrowSchema is released.

Arrays

You can use nanoarrow.c_array() to convert an array-like object to an ArrowArray, wrap it as a Python object, and attach a schema that can be used to interpret its contents. This works for any object implementing the Arrow PyCapsule Interface (e.g., pyarrow.Array, pyarrow.RecordBatch).

na.c_array(["one", "two", "three", None], na.string())
<nanoarrow.c_array.CArray string>
- length: 4
- offset: 0
- null_count: 1
- buffers: (4754305168, 4754307808, 4754310464)
- dictionary: NULL
- children[0]:

Using c_array() is a good fit for testing and for ephemeral array objects that are being passed from one library to another. For a higher level interface, use Array():

array = na.Array(["one", "two", "three", None], na.string())
array.to_pylist()
['one', 'two', 'three', None]
array.buffers
(nanoarrow.c_lib.CBufferView(bool[1 b] 11100000),
 nanoarrow.c_lib.CBufferView(int32[20 b] 0 3 6 11 11),
 nanoarrow.c_lib.CBufferView(string[11 b] b'onetwothree'))

Advanced users can create arrays directly from buffers using c_array_from_buffers():

na.c_array_from_buffers(
    na.string(),
    2,
    [None, na.c_buffer([0, 3, 6], na.int32()), b"abcdef"]
)
<nanoarrow.c_array.CArray string>
- length: 2
- offset: 0
- null_count: 0
- buffers: (0, 5002908320, 4999694624)
- dictionary: NULL
- children[0]:

Array streams

You can use nanoarrow.c_array_stream() to wrap an object representing a sequence of CArrays with a common CSchema to an ArrowArrayStream and wrap it as a Python object. This works for any object implementing the Arrow PyCapsule Interface (e.g., pyarrow.RecordBatchReader, pyarrow.ChunkedArray).

pa_batch = pa.record_batch({"col1": [1, 2, 3]})
reader = pa.RecordBatchReader.from_batches(pa_batch.schema, [pa_batch])
array_stream = na.c_array_stream(reader)
array_stream
<nanoarrow.c_array_stream.CArrayStream>
- get_schema(): struct<col1: int64>

You can pull the next array from the stream using .get_next() or use it like an iterator. The .get_next() method will raise StopIteration when there are no more arrays in the stream.

for array in array_stream:
    print(array)
<nanoarrow.c_array.CArray struct<col1: int64>>
- length: 3
- offset: 0
- null_count: 0
- buffers: (0,)
- dictionary: NULL
- children[1]:
  'col1': <nanoarrow.c_array.CArray int64>
    - length: 3
    - offset: 0
    - null_count: 0
    - buffers: (0, 2642948588352)
    - dictionary: NULL
    - children[0]:

Use ArrayStream() for a higher level interface:

reader = pa.RecordBatchReader.from_batches(pa_batch.schema, [pa_batch])
na.ArrayStream(reader).read_all()
nanoarrow.Array<non-nullable struct<col1: int64>>[3]
{'col1': 1}
{'col1': 2}
{'col1': 3}

Development

Python bindings for nanoarrow are managed with setuptools. This means you can build the project using:

git clone https://github.com/apache/arrow-nanoarrow.git
cd arrow-nanoarrow/python
pip install -e .

Tests use pytest:

# Install dependencies
pip install -e ".[test]"

# Run tests
pytest -vvx

CMake is currently required to ensure that the vendored copy of nanoarrow in the Python package stays in sync with the nanoarrow sources in the working tree.