Apache Spark Driver v0.1.0-alpha.2¶

Driver Version v0.1.0-alpha.2 Release Date 2026-06-02 Tested With Spark 3.x 3.5-livy Tested With Spark 3.x 3.5-thrift Tested With Spark 4.x 4.0-connect Tested With Spark 4.x 4.0-thrift Tested With Spark 4.x 4.0-thrifthttp

This driver provides access to Apache Spark (commonly referred to as just “Spark”).

Note

This project is not affiliated with the Apache Software Foundation.

Installation & Quickstart¶

The driver can be installed with dbc :

dbc install spark --pre

Note

Only prerelease versions of the driver are currently available, so you must use --pre with dbc 0.2.0 or newer to install the driver.

Connecting¶

To use the driver, provide a connection string as the uri option.

from adbc_driver_manager import dbapi

dbapi.connect(
  driver="spark",
  db_kwargs={
      "uri": "spark://localhost:10000?auth_type=plain&api=thrift%2Bbinary"
  }
)

Note: The example above is for Python using the adbc-driver-manager package but the process will be similar for other driver managers. See adbc-quickstarts .

Connection String Format¶

The URI scheme is “spark://”.
The host and port should be provided.
If not specified, the api defaults to thrift+binary (URI-encoded: thrift%2Bbinary).
Options can be specified as query parameters or as driver options.

Note

Reserved characters in URI elements must be URI-encoded. For example, @ becomes %40 and + becomes %2B.

Connection Options¶

These parameters can be specified in the URI as query parameters, or as connection parameters:

spark.api (query parameter: api)

Values: connect, livy, thrift+binary, or thrift+http.

The protocol used to connect to Spark.

Value	Backend
`connect`	Spark Connect
`livy`	Apache Livy
`thrift+binary`	HiveServer2 Thrift (over TCP)
`thrift+http`	HiveServer2 Thrift (over HTTP)

spark.auth_type (query parameter: auth_type)

Values: sql, spark, or pyspark.

How to authenticate to Spark.

Auth Type	Applicable Backends	Description
`aws_sigv4`	`livy`	Use AWS SDK
`basic`	`livy`	Username/password
`ldap`	`thrift+binary`, `thrift+http`	Not yet implemented
`kerberos`	`thrift+binary`, `thrift+http`	Not yet implemented
`none`	`connect`, `livy`	No authentication
`nosasl`	`thrift+binary`, `thrift+http`	No authentication
`plain`	`thrift+binary`, `thrift+http`	Username/password
`token`	`connect`	Username/password (token)

spark.livy.session_kind (query parameter: livy.session_kind)

Values: sql, spark, or pyspark.

For the Livy backend, what kind of session to create.

Warning

Currently only sql is tested/supported.

spark.tls (query parameter: tls)

Type boolean. Default: false.

Whether to use TLS for connecting. Only applies to connect, livy, and thrift+http.

spark.validate_server_certificate (query parameter: validateservercertificate)

Type boolean. Default: true.

Whether to validate the server’s TLS certificate. Should only be disabled for development/testing.

Limitations¶

Different backends and cluster configurations have limitations; some limitations related to data type support are also noted further below.

HiveServer2/Thrift Protocol¶

In Spark 3.x, binary data that does not happen to be valid UTF-8 will be corrupted.
The client cannot tell whether a timestamp carries a time zone or not; all timestamps are assumed to be in UTC as a result.

Apache Livy¶

Only the first 1000 rows of a result set can be fetched. This can be tuned by configuring Spark with spark.sql.repl.eagerEval.maxNumRows.
In general, we have found that performance is worse than with Spark Connect or HiveServer2.
Connecting to an Amazon EMR (Serverless) cluster via Livy requires setting the emr-serverless.session.executionRoleArn session config option to an appropriate role ARN.

Spark Connect¶

In our testing, connecting to an Amazon EMR (Serverless) cluster via Spark Connect does not work; we believe it is an incompatibility in the Spark Connect client library and plan to address this in a future version of the driver.

Amazon EMR (Serverless)¶

Bulk ingest with an AWS Glue catalog is not currently supported as there is no way to specify the LOCATION clause.
Amazon EMR is not currently enabled in our automated integration testing.

Feature & Type Support¶

Feature		Spark 3.x
Bulk Ingestion	Create	✅
	Append	✅
	Create/Append	✅
	Replace	✅
	Temporary Table	❌
	Target Catalog	❌
	Target Schema	❌
	Non-nullable fields are marked NOT NULL	❌
Catalog (GetObjects)	depth=catalogs	✅
	depth=db_schemas	✅
	depth=tables	✅
	depth=columns (all)	✅
Get Parameter Schema		❌
Get Table Schema		❌
Prepared Statements		✅
Transactions		❌

Types¶

Database to Arrow¶

Database Type	Spark 3.x	Spark 4.x
BIGINT	int64
BINARY	binary ⚠️ [1]
BOOLEAN	bool
DATE	date32[day]
DOUBLE	double
INT	int32
NUMERIC	decimal128
REAL	float
SMALLINT	int16
TIMESTAMP	timestamp[us] (with time zone)
TIMESTAMP_NTZ	timestamp[us] (with time zone) ⚠️ [3] [4] [5], ❌ [2] [3] [4]	timestamp[us] ⚠️ [3], timestamp[us] (with time zone) ⚠️ [3]
VARCHAR	string

Arrow to Database¶

Arrow Type	Spark 3.x Type		Spark 4.x Type
	Bind	Ingest	Bind	Ingest
	binary	❌	VARBINARY	❌	VARBINARY
binary_view	❌	(NA/not tested)	❌	(NA/not tested)
bool	❌	BOOLEAN	❌	BOOLEAN
date32[day]	❌	DATE	❌	DATE
decimal128	❌	NUMERIC	❌	NUMERIC
double	❌	DOUBLE PRECISION	❌	DOUBLE PRECISION
fixed_size_binary	❌
float	❌	REAL	❌	REAL
halffloat	❌	(NA/not tested)	❌	(NA/not tested)
int16	❌	SMALLINT	❌	SMALLINT
int32	❌	INT	❌	INT
int64	❌	BIGINT	❌	BIGINT
large_binary	❌	VARBINARY	❌	VARBINARY
large_string	❌	VARCHAR	❌	VARCHAR
string	❌	VARCHAR	❌	VARCHAR
string_view	❌	(NA/not tested)	❌	(NA/not tested)
time32[ms]	❌	(NA/not tested)	❌	(NA/not tested)
time32[s]	❌	(NA/not tested)	❌	(NA/not tested)
time64[ns]	❌	(NA/not tested)	❌	(NA/not tested)
time64[us]	❌	(NA/not tested)	❌	(NA/not tested)
timestamp[ms]	❌			TIMESTAMP(3)
timestamp[ms] (with time zone)	❌			TIMESTAMP(3) WITH TIME ZONE
timestamp[ns]	❌
timestamp[ns] (with time zone)	❌
timestamp[s]	❌			TIMESTAMP(0)
timestamp[s] (with time zone)	❌			TIMESTAMP(0) WITH TIME ZONE
timestamp[us]	❌			TIMESTAMP(6)
timestamp[us] (with time zone)	❌			TIMESTAMP(6) WITH TIME ZONE

Compatibility¶

This driver was tested on:

Apache Spark 3.5.8 5a48a37b2dbd7b51e3640cd1d947438459556cc6 (Apache Livy)
Apache Spark 3.5.8 (HiveServer2+binary)
Apache Spark 4.0.0 fa33ea000a0bda9e5a3fa1af98e8e85b8cc5e4d4 (Spark Connect)
Apache Spark 4.0.0 (HiveServer2+binary)
Apache Spark 4.0.0 (HiveServer2+HTTP)