Gemfury

arrow-nightlies / nanoarrow python

Repository URL to install this package:
Details
src
subprojects
tests
.coveragerc
.gitignore
MANIFEST.in
README.ipynb
README.md
bootstrap.py
generate_dist.py
generate_type_stubs.sh
meson.build
pyproject.toml
PKG-INFO
README.ipynb
{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<!---\n",
    "  Licensed to the Apache Software Foundation (ASF) under one\n",
    "  or more contributor license agreements.  See the NOTICE file\n",
    "  distributed with this work for additional information\n",
    "  regarding copyright ownership.  The ASF licenses this file\n",
    "  to you under the Apache License, Version 2.0 (the\n",
    "  \"License\"); you may not use this file except in compliance\n",
    "  with the License.  You may obtain a copy of the License at\n",
    "\n",
    "    http://www.apache.org/licenses/LICENSE-2.0\n",
    "\n",
    "  Unless required by applicable law or agreed to in writing,\n",
    "  software distributed under the License is distributed on an\n",
    "  \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
    "  KIND, either express or implied.  See the License for the\n",
    "  specific language governing permissions and limitations\n",
    "  under the License.\n",
    "-->\n",
    "\n",
    "<!-- Render with jupyter nbconvert --to markdown README.ipynb -->\n",
    "\n",
    "# nanoarrow for Python\n",
    "\n",
    "The nanoarrow Python package provides bindings to the nanoarrow C library. Like\n",
    "the nanoarrow C library, it provides tools to facilitate the use of the\n",
    "[Arrow C Data](https://arrow.apache.org/docs/format/CDataInterface.html) \n",
    "and [Arrow C Stream](https://arrow.apache.org/docs/format/CStreamInterface.html) \n",
    "interfaces.\n",
    "\n",
    "## Installation\n",
    "\n",
    "The nanoarrow Python bindings are available from [PyPI](https://pypi.org/) and\n",
    "[conda-forge](https://conda-forge.org/):\n",
    "\n",
    "```shell\n",
    "pip install nanoarrow\n",
    "conda install nanoarrow -c conda-forge\n",
    "```\n",
    "\n",
    "Development versions (based on the `main` branch) are also available:\n",
    "\n",
    "```shell\n",
    "pip install --extra-index-url https://pypi.fury.io/arrow-nightlies/ \\\n",
    "    --prefer-binary --pre nanoarrow\n",
    "```\n",
    "\n",
    "If you can import the namespace, you're good to go!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "metadata": {},
   "outputs": [],
   "source": [
    "import nanoarrow as na"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Data types, arrays, and array streams\n",
    "\n",
    "The Arrow C Data and Arrow C Stream interfaces are comprised of three structures: the `ArrowSchema` which represents a data type of an array, the `ArrowArray` which represents the values of an array, and an `ArrowArrayStream`, which represents zero or more `ArrowArray`s with a common `ArrowSchema`. These concepts map to the `nanoarrow.Schema`, `nanoarrow.Array`, and `nanoarrow.ArrayStream` in the Python package."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<Schema> int32"
      ]
     },
     "execution_count": 47,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "na.int32()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "nanoarrow.Array<int32>[3]\n",
       "1\n",
       "2\n",
       "3"
      ]
     },
     "execution_count": 48,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "na.Array([1, 2, 3], na.int32())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `nanoarrow.Array` can accommodate arrays with any number of chunks, reflecting the reality that many array containers (e.g., `pyarrow.ChunkedArray`, `polars.Series`) support this."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "nanoarrow.Array<int32>[6]\n",
       "1\n",
       "2\n",
       "3\n",
       "4\n",
       "5\n",
       "6"
      ]
     },
     "execution_count": 49,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "chunked = na.Array.from_chunks([[1, 2, 3], [4, 5, 6]], na.int32())\n",
    "chunked"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Whereas chunks of an `Array` are always fully materialized when the object is constructed, the chunks of an `ArrayStream` have not necessarily been resolved yet."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "nanoarrow.ArrayStream<int32>"
      ]
     },
     "execution_count": 50,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "stream = na.ArrayStream(chunked)\n",
    "stream"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 51,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "nanoarrow.Array<int32>[3]\n",
      "1\n",
      "2\n",
      "3\n",
      "nanoarrow.Array<int32>[3]\n",
      "4\n",
      "5\n",
      "6\n"
     ]
    }
   ],
   "source": [
    "with stream:\n",
    "    for chunk in stream:\n",
    "        print(chunk)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `nanoarrow.ArrayStream` also provides an interface to nanoarrow's [Arrow IPC](https://arrow.apache.org/docs/format/Columnar.html#serialization-and-interprocess-communication-ipc) reader:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 52,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "nanoarrow.ArrayStream<non-nullable struct<commit: string, time: timestamp('us', 'UTC'), files: int3...>"
      ]
     },
     "execution_count": 52,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "url = \"https://github.com/apache/arrow-experiments/raw/main/data/arrow-commits/arrow-commits.arrows\"\n",
    "na.ArrayStream.from_url(url)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "These objects implement the [Arrow PyCapsule interface](https://arrow.apache.org/docs/format/CDataInterface/PyCapsuleInterface.html) for both producing and consuming and are interchangeable with `pyarrow` objects in many cases:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 53,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "pyarrow.Field<: int32>"
      ]
     },
     "execution_count": 53,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pyarrow as pa\n",
    "\n",
    "pa.field(na.int32())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 54,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<pyarrow.lib.ChunkedArray object at 0x12a49a250>\n",
       "[\n",
       "  [\n",
       "    1,\n",
       "    2,\n",
       "    3\n",
       "  ],\n",
       "  [\n",
       "    4,\n",
       "    5,\n",
       "    6\n",
       "  ]\n",
       "]"
      ]
     },
     "execution_count": 54,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pa.chunked_array(chunked)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 55,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<pyarrow.lib.Int32Array object at 0x11b552500>\n",
       "[\n",
       "  4,\n",
       "  5,\n",
       "  6\n",
       "]"
      ]
     },
     "execution_count": 55,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pa.array(chunked.chunk(1))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 56,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "nanoarrow.Array<int64>[3]\n",
       "10\n",
       "11\n",
       "12"
      ]
     },
     "execution_count": 56,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "na.Array(pa.array([10, 11, 12]))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 57,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<Schema> string"
      ]
     },
     "execution_count": 57,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "na.Schema(pa.string())"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Low-level C library bindings\n",
    "\n",
    "The nanoarrow Python package also provides lower level wrappers around Arrow C interface structures. You can create these using `nanoarrow.c_schema()`, `nanoarrow.c_array()`, and `nanoarrow.c_array_stream()`.\n",
    "\n",
    "### Schemas\n",
    "\n",
    "Use `nanoarrow.c_schema()` to convert an object to an `ArrowSchema` and wrap it as a Python object. This works for any object implementing the [Arrow PyCapsule Interface](https://arrow.apache.org/docs/format/CDataInterface.html) (e.g., `pyarrow.Schema`, `pyarrow.DataType`, and `pyarrow.Field`)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 58,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<nanoarrow.c_schema.CSchema decimal128(10, 3)>\n",
       "- format: 'd:10,3'\n",
       "- name: ''\n",
       "- flags: 2\n",
       "- metadata: NULL\n",
       "- dictionary: NULL\n",
       "- children[0]:"
      ]
     },
     "execution_count": 58,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "na.c_schema(pa.decimal128(10, 3))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Using `c_schema()` is a good fit for testing and for ephemeral schema objects that are being passed from one library to another. To extract the fields of a schema in a more convenient form, use `Schema()`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 59,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(10, 3)"
      ]
     },
     "execution_count": 59,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "schema = na.Schema(pa.decimal128(10, 3))\n",
    "schema.precision, schema.scale"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `CSchema` object cleans up after itself: when the object is deleted, the underlying `ArrowSchema` is released."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Arrays\n",
    "\n",
    "You can use `nanoarrow.c_array()` to convert an array-like object to an `ArrowArray`, wrap it as a Python object, and attach a schema that can be used to interpret its contents. This works for any object implementing the [Arrow PyCapsule Interface](https://arrow.apache.org/docs/format/CDataInterface.html) (e.g., `pyarrow.Array`, `pyarrow.RecordBatch`)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 60,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<nanoarrow.c_array.CArray string>\n",
       "- length: 4\n",
       "- offset: 0\n",
       "- null_count: 1\n",
       "- buffers: (4754305168, 4754307808, 4754310464)\n",
       "- dictionary: NULL\n",
       "- children[0]:"
      ]
     },
     "execution_count": 60,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "na.c_array([\"one\", \"two\", \"three\", None], na.string())"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Using `c_array()` is a good fit for testing and for ephemeral array objects that are being passed from one library to another. For a higher level interface, use `Array()`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 61,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['one', 'two', 'three', None]"
      ]
     },
     "execution_count": 61,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "array = na.Array([\"one\", \"two\", \"three\", None], na.string())\n",
    "array.to_pylist()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 62,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(nanoarrow.c_lib.CBufferView(bool[1 b] 11100000),\n",
       " nanoarrow.c_lib.CBufferView(int32[20 b] 0 3 6 11 11),\n",
       " nanoarrow.c_lib.CBufferView(string[11 b] b'onetwothree'))"
      ]
     },
     "execution_count": 62,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "array.buffers"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Advanced users can create arrays directly from buffers using `c_array_from_buffers()`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 63,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<nanoarrow.c_array.CArray string>\n",
       "- length: 2\n",
       "- offset: 0\n",
       "- null_count: 0\n",
       "- buffers: (0, 5002908320, 4999694624)\n",
       "- dictionary: NULL\n",
       "- children[0]:"
      ]
     },
     "execution_count": 63,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "na.c_array_from_buffers(\n",
    "    na.string(),\n",
    "    2,\n",
    "    [None, na.c_buffer([0, 3, 6], na.int32()), b\"abcdef\"]\n",
    ")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Array streams\n",
    "\n",
    "You can use `nanoarrow.c_array_stream()` to wrap an object representing a sequence of `CArray`s with a common `CSchema` to an `ArrowArrayStream` and wrap it as a Python object. This works for any object implementing the [Arrow PyCapsule Interface](https://arrow.apache.org/docs/format/CDataInterface.html) (e.g., `pyarrow.RecordBatchReader`, `pyarrow.ChunkedArray`)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 64,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<nanoarrow.c_array_stream.CArrayStream>\n",
       "- get_schema(): struct<col1: int64>"
      ]
     },
     "execution_count": 64,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pa_batch = pa.record_batch({\"col1\": [1, 2, 3]})\n",
    "reader = pa.RecordBatchReader.from_batches(pa_batch.schema, [pa_batch])\n",
    "array_stream = na.c_array_stream(reader)\n",
    "array_stream"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can pull the next array from the stream using `.get_next()` or use it like an iterator. The `.get_next()` method will raise `StopIteration` when there are no more arrays in the stream."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 65,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<nanoarrow.c_array.CArray struct<col1: int64>>\n",
      "- length: 3\n",
      "- offset: 0\n",
      "- null_count: 0\n",
      "- buffers: (0,)\n",
      "- dictionary: NULL\n",
      "- children[1]:\n",
      "  'col1': <nanoarrow.c_array.CArray int64>\n",
      "    - length: 3\n",
      "    - offset: 0\n",
      "    - null_count: 0\n",
      "    - buffers: (0, 2642948588352)\n",
      "    - dictionary: NULL\n",
      "    - children[0]:\n"
     ]
    }
   ],
   "source": [
    "for array in array_stream:\n",
    "    print(array)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Use `ArrayStream()` for a higher level interface:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 66,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "nanoarrow.Array<non-nullable struct<col1: int64>>[3]\n",
       "{'col1': 1}\n",
       "{'col1': 2}\n",
       "{'col1': 3}"
      ]
     },
     "execution_count": 66,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "reader = pa.RecordBatchReader.from_batches(pa_batch.schema, [pa_batch])\n",
    "na.ArrayStream(reader).read_all()"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Development\n",
    "\n",
    "Python bindings for nanoarrow are managed with [setuptools](https://setuptools.pypa.io/en/latest/index.html).\n",
    "This means you can build the project using:\n",
    "\n",
    "```shell\n",
    "git clone https://github.com/apache/arrow-nanoarrow.git\n",
    "cd arrow-nanoarrow/python\n",
    "# Build dependencies:\n",
    "# pip install meson meson-python cython\n",
    "pip install -e . --no-build-isolation\n",
    "```\n",
    "\n",
    "Tests use [pytest](https://docs.pytest.org/):\n",
    "\n",
    "```shell\n",
    "# Install dependencies\n",
    "pip install -e \".[test]\"\n",
    "\n",
    "# Run tests\n",
    "pytest -vvx\n",
    "```\n",
    "\n",
    "CMake is currently required to ensure that the vendored copy of nanoarrow in the Python package stays in sync with the nanoarrow sources in the working tree."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.3"
  },
  "orig_nbformat": 4
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
arrow-nightlies / nanoarrow python

Products

About

Resources

Contact Gemfury