Apache Spark¶

Driver Version v0.1.0-alpha.3 Release Date 2026-07-10 Tested With Spark 3.x 3.5-livy Tested With Spark 3.x 3.5-thrift Tested With Spark 4.x 4.0-connect Tested With Spark 4.x 4.0-thrift Tested With Spark 4.x 4.0-thrifthttp

This driver provides access to Apache Spark (commonly referred to as just “Spark”).

Note

This project is not affiliated with the Apache Software Foundation.

Installation & Quickstart¶

The driver can be installed with dbc :

dbc install spark --pre

Note

Only prerelease versions of the driver are currently available, so you must use --pre with dbc 0.2.0 or newer to install the driver.

Connecting¶

To use the driver, provide a connection string as the uri option.

from adbc_driver_manager import dbapi

dbapi.connect(
  driver="spark",
  db_kwargs={
      "uri": "spark://localhost:10000?auth_type=plain&api=thrift%2Bbinary"
  }
)

Note: The example above is for Python using the adbc-driver-manager package but the process will be similar for other driver managers. See adbc-quickstarts .

Connection String Format¶

The URI scheme is “spark://”.
The host and port should be provided.
If not specified, the api defaults to thrift+binary (URI-encoded: thrift%2Bbinary).
Options can be specified as query parameters or as driver options.

Note

Reserved characters in URI elements must be URI-encoded. For example, @ becomes %40 and + becomes %2B.

Connection Options¶

These parameters can be specified in the URI as query parameters, or as connection parameters:

spark.api (query parameter: api)

Values: connect, livy, thrift+binary, or thrift+http.

The protocol used to connect to Spark.

Value	Backend
`connect`	Spark Connect
`livy`	Apache Livy
`thrift+binary`	HiveServer2 Thrift (over TCP)
`thrift+http`	HiveServer2 Thrift (over HTTP)

spark.auth_type (query parameter: auth_type)

Values: sql, spark, or pyspark.

How to authenticate to Spark.

Auth Type	Applicable Backends	Description
`aws_sigv4`	`livy`	Use AWS SDK
`basic`	`livy`	Username/password
`ldap`	`thrift+binary`, `thrift+http`	Not yet implemented
`kerberos`	`thrift+binary`, `thrift+http`	Not yet implemented
`none`	`connect`, `livy`	No authentication
`nosasl`	`thrift+binary`, `thrift+http`	No authentication
`plain`	`thrift+binary`, `thrift+http`	Username/password
`token`	`connect`	Username/password (token)

spark.livy.session_kind (query parameter: livy.session_kind)

Values: sql, spark, or pyspark.

For the Livy backend, what kind of session to create.

Warning

Currently only sql is tested/supported.

spark.connect.session_id (query parameter: connect.session_id)

Type: string.

For the Spark Connect backend, reuse this client session.

spark.connect.release_session (query parameter: connect.release_session)

Type: boolean. Default: true.

For the Spark Connect backend, whether to call ReleaseSession when the connection is closed. Set to false to keep a session available after closing the connection.

spark.livy.session_id (query parameter: livy.session_id)

Type: string.

For the Livy backend, reuse this client session.

spark.livy.release_session (query parameter: livy.release_session)

Type: boolean. Default: true.

For the Livy backend, whether to delete the session when the connection is closed. Set to false to keep a session available after closing the connection.

spark.tls (query parameter: tls)

Type boolean. Default: false.

Whether to use TLS for connecting. Only applies to connect, livy, and thrift+http.

spark.validate_server_certificate (query parameter: validateservercertificate)

Type boolean. Default: true.

Whether to validate the server’s TLS certificate. Should only be disabled for development/testing.

Limitations¶

Different backends and cluster configurations have limitations; some limitations related to data type support are also noted further below.

HiveServer2/Thrift Protocol¶

In Spark 3.x, binary data that does not happen to be valid UTF-8 will be corrupted.
The client cannot tell whether a timestamp carries a time zone or not; all timestamps are assumed to be in UTC as a result.

Apache Livy¶

Only the first 1000 rows of a result set can be fetched. This can be tuned by configuring Spark with spark.sql.repl.eagerEval.maxNumRows.
In general, we have found that performance is worse than with Spark Connect or HiveServer2.
Connecting to an Amazon EMR (Serverless) cluster via Livy requires setting the emr-serverless.session.executionRoleArn session config option to an appropriate role ARN. This can be set via the ADBC option spark.opt.emr-serverless.session.executionRoleArn.
By default, the driver will attempt to start a new Livy session, which tends to take some time (~a few minutes), especially when using Amazon EMR. To amortize this time across multiple connections, the option spark.livy.session_id can be used to fetch the session ID, and to provide it upon connection, bypassing creating a new session.
By default, the driver will close the session when the connection is closed. Setting spark.livy.release_session to false on connection will avoid this, making it easier to reuse the session.

Spark Connect¶

To connect to Amazon EMR, the connection URI should look like this:

spark://:<AUTH TOKEN>@<SESSION ID>.s.emr-serverless-services.<REGION>.amazonaws.com:443?tls=true&auth_type=token&api=connect

The full hostname can be obtained from the AWS API, e.g. via the CLI:

aws emr-serverless get-session-endpoint --application-id <APPLICATION ID> --session-id <SESSION ID>

This command will also give you the auth token.

Amazon EMR (Serverless)¶

Amazon EMR is not currently enabled in our automated integration testing.
To use bulk ingest, set spark.ingest.location to a path on S3 where the table data will be stored.

Also see the above caveats for specific ways to connect to EMR.

Feature & Type Support¶

Feature		Spark 3.x
Bulk Ingestion	Create	✅
	Append	✅
	Create/Append	✅
	Replace	✅
	Temporary Table	❌
	Target Catalog	❌
	Target Schema	❌
	Non-nullable fields are marked NOT NULL	❌
Catalog (GetObjects)	depth=catalogs	✅
	depth=db_schemas	✅
	depth=tables	✅
	depth=columns (all)	✅
Get Parameter Schema		❌
Get Table Schema		❌
Prepared Statements		✅
Transactions		❌

Types¶

Database to Arrow¶

Database Type	Spark 3.x	Spark 4.x
BIGINT	int64
BINARY	binary ⚠️ [1]
BOOLEAN	bool
DATE	date32[day]
DOUBLE	double
INT	int32
NUMERIC	decimal128
REAL	float
SMALLINT	int16
TIMESTAMP	timestamp[us] (with time zone)
TIMESTAMP_NTZ	timestamp[us] (with time zone) ⚠️ [3] [4] [5], ❌ [2] [3] [4]	timestamp[us] (with time zone) ⚠️ [3], timestamp[us] ⚠️ [3]
VARCHAR	string

Arrow to Database¶

Arrow Type	Spark 3.x Type		Spark 4.x Type
	Bind	Ingest	Bind	Ingest
	binary	❌	VARBINARY	❌	VARBINARY
binary_view	❌	(NA/not tested)	❌	(NA/not tested)
bool	❌	BOOLEAN	❌	BOOLEAN
date32[day]	❌	DATE	❌	DATE
decimal128	❌	NUMERIC ⚠️ [6]	❌	NUMERIC ⚠️ [6]
double	❌	DOUBLE PRECISION	❌	DOUBLE PRECISION
fixed_size_binary	❌
float	❌	REAL	❌	REAL
halffloat	❌	(NA/not tested)	❌	(NA/not tested)
int16	❌	SMALLINT	❌	SMALLINT
int32	❌	INT	❌	INT
int64	❌	BIGINT	❌	BIGINT
large_binary	❌	VARBINARY	❌	VARBINARY
large_string	❌	VARCHAR	❌	VARCHAR
string	❌	VARCHAR	❌	VARCHAR
string_view	❌	(NA/not tested)	❌	(NA/not tested)
time32[ms]	❌	(NA/not tested)	❌	(NA/not tested)
time32[s]	❌	(NA/not tested)	❌	(NA/not tested)
time64[ns]	❌	(NA/not tested)	❌	(NA/not tested)
time64[us]	❌	(NA/not tested)	❌	(NA/not tested)
timestamp[ms]	❌			TIMESTAMP(3)
timestamp[ms] (with time zone)	❌			TIMESTAMP(3) WITH TIME ZONE
timestamp[ns]	❌
timestamp[ns] (with time zone)	❌
timestamp[s]	❌			TIMESTAMP(0)
timestamp[s] (with time zone)	❌			TIMESTAMP(0) WITH TIME ZONE
timestamp[us]	❌			TIMESTAMP(6)
timestamp[us] (with time zone)	❌			TIMESTAMP(6) WITH TIME ZONE

Compatibility¶

This driver was tested on:

Apache Spark 3.5.8 5a48a37b2dbd7b51e3640cd1d947438459556cc6 (Apache Livy)
Apache Spark 3.5.8 (HiveServer2+binary)
Apache Spark 4.0.0 fa33ea000a0bda9e5a3fa1af98e8e85b8cc5e4d4 (Spark Connect)
Apache Spark 4.0.0 (HiveServer2+binary)
Apache Spark 4.0.0 (HiveServer2+HTTP)

Previous Versions¶

To see documentation for previous versions of this driver, see the following:

v0.1.0-alpha.2.md