Apache Spark

Driver Version v0.1.0-alpha.2 Tested With Spark 3.x 3.5-livy Tested With Spark 3.x 3.5-thrift Tested With Spark 4.x 4.0-connect Tested With Spark 4.x 4.0-thrift Tested With Spark 4.x 4.0-thrifthttp

This driver provides access to Apache Spark (commonly referred to as just “Spark”).

Note

This project is not affiliated with the Apache Software Foundation.

Installation & Quickstart

The driver can be installed with dbc :

dbc install spark --pre

Note

Only prerelease versions of the driver are currently available, so you must use --pre with dbc 0.2.0 or newer to install the driver.

Connecting

To use the driver, provide a connection string as the uri option.

from adbc_driver_manager import dbapi

dbapi.connect(
  driver="spark",
  db_kwargs={
      "uri": "spark://localhost:10000?auth_type=plain&api=thrift%2Bbinary"
  }
)

Note: The example above is for Python using the adbc-driver-manager package but the process will be similar for other driver managers. See adbc-quickstarts .

Connection String Format

  • The URI scheme is “spark://”.

  • The host and port should be provided.

  • If not specified, the api defaults to thrift+binary (URI-encoded: thrift%2Bbinary).

  • Options can be specified as query parameters or as driver options.

Note

Reserved characters in URI elements must be URI-encoded. For example, @ becomes %40 and + becomes %2B.

Connection Options

These parameters can be specified in the URI as query parameters, or as connection parameters:

spark.api (query parameter: api)

Values: connect, livy, thrift+binary, or thrift+http.

The protocol used to connect to Spark.

Value

Backend

connect

Spark Connect

livy

Apache Livy

thrift+binary

HiveServer2 Thrift (over TCP)

thrift+http

HiveServer2 Thrift (over HTTP)

spark.auth_type (query parameter: auth_type)

Values: sql, spark, or pyspark.

How to authenticate to Spark.

Auth Type

Applicable Backends

Description

aws_sigv4

livy

Use AWS SDK

basic

livy

Username/password

ldap

thrift+binary, thrift+http

Not yet implemented

kerberos

thrift+binary, thrift+http

Not yet implemented

none

connect, livy

No authentication

nosasl

thrift+binary, thrift+http

No authentication

plain

thrift+binary, thrift+http

Username/password

token

connect

Username/password (token)

spark.livy.session_kind (query parameter: livy.session_kind)

Values: sql, spark, or pyspark.

For the Livy backend, what kind of session to create.

Warning

Currently only sql is tested/supported.

spark.tls (query parameter: tls)

Type boolean. Default: false.

Whether to use TLS for connecting. Only applies to connect, livy, and thrift+http.

spark.validate_server_certificate (query parameter: validateservercertificate)

Type boolean. Default: true.

Whether to validate the server’s TLS certificate. Should only be disabled for development/testing.

Limitations

Different backends and cluster configurations have limitations; some limitations related to data type support are also noted further below.

HiveServer2/Thrift Protocol

  • In Spark 3.x, binary data that does not happen to be valid UTF-8 will be corrupted.

  • The client cannot tell whether a timestamp carries a time zone or not; all timestamps are assumed to be in UTC as a result.

Apache Livy

  • Only the first 1000 rows of a result set can be fetched. This can be tuned by configuring Spark with spark.sql.repl.eagerEval.maxNumRows.

  • In general, we have found that performance is worse than with Spark Connect or HiveServer2.

  • Connecting to an Amazon EMR (Serverless) cluster via Livy requires setting the emr-serverless.session.executionRoleArn session config option to an appropriate role ARN.

Spark Connect

  • In our testing, connecting to an Amazon EMR (Serverless) cluster via Spark Connect does not work; we believe it is an incompatibility in the Spark Connect client library and plan to address this in a future version of the driver.

Amazon EMR (Serverless)

  • Bulk ingest with an AWS Glue catalog is not currently supported as there is no way to specify the LOCATION clause.

  • Amazon EMR is not currently enabled in our automated integration testing.

Feature & Type Support

Feature Spark 3.x Spark 4.x
Bulk Ingestion Create
Append
Create/Append
Replace
Temporary Table
Target Catalog
Target Schema
Non-nullable fields are marked NOT NULL
Catalog (GetObjects) depth=catalogs
depth=db_schemas
depth=tables
depth=columns (all)
Get Parameter Schema
Get Table Schema
Prepared Statements
Transactions

Types

Database to Arrow

Database Type Spark 3.x Spark 4.x

BIGINT

int64

BINARY

binary ⚠️ [1]

BOOLEAN

bool

DATE

date32[day]

DOUBLE

double

INT

int32

NUMERIC

decimal128

REAL

float

SMALLINT

int16

TIMESTAMP

timestamp[us] (with time zone)

TIMESTAMP_NTZ

timestamp[us] (with time zone) ⚠️ [3] [4] [5], ❌ [2] [3] [4]

timestamp[us] ⚠️ [3], timestamp[us] (with time zone) ⚠️ [3]

VARCHAR

string

Arrow to Database

Arrow Type Spark 3.x Type Spark 4.x Type
Bind Ingest Bind Ingest

binary

VARBINARY

VARBINARY

binary_view

(NA/not tested)

(NA/not tested)

bool

BOOLEAN

BOOLEAN

date32[day]

DATE

DATE

decimal128

NUMERIC

NUMERIC

double

DOUBLE PRECISION

DOUBLE PRECISION

fixed_size_binary

float

REAL

REAL

halffloat

(NA/not tested)

(NA/not tested)

int16

SMALLINT

SMALLINT

int32

INT

INT

int64

BIGINT

BIGINT

large_binary

VARBINARY

VARBINARY

large_string

VARCHAR

VARCHAR

string

VARCHAR

VARCHAR

string_view

(NA/not tested)

(NA/not tested)

time32[ms]

(NA/not tested)

(NA/not tested)

time32[s]

(NA/not tested)

(NA/not tested)

time64[ns]

(NA/not tested)

(NA/not tested)

time64[us]

(NA/not tested)

(NA/not tested)

timestamp[ms]

TIMESTAMP(3)

timestamp[ms] (with time zone)

TIMESTAMP(3) WITH TIME ZONE

timestamp[ns]

timestamp[ns] (with time zone)

timestamp[s]

TIMESTAMP(0)

timestamp[s] (with time zone)

TIMESTAMP(0) WITH TIME ZONE

timestamp[us]

TIMESTAMP(6)

timestamp[us] (with time zone)

TIMESTAMP(6) WITH TIME ZONE

Compatibility

This driver was tested on:

  • Apache Spark 3.5.8 5a48a37b2dbd7b51e3640cd1d947438459556cc6 (Apache Livy)

  • Apache Spark 3.5.8 (HiveServer2+binary)

  • Apache Spark 4.0.0 fa33ea000a0bda9e5a3fa1af98e8e85b8cc5e4d4 (Spark Connect)

  • Apache Spark 4.0.0 (HiveServer2+binary)

  • Apache Spark 4.0.0 (HiveServer2+HTTP)