Apache Spark Driver v0.1.0-alpha.2¶
Driver Version v0.1.0-alpha.2 Tested With Spark 3.x 3.5-livy Tested With Spark 3.x 3.5-thrift Tested With Spark 4.x 4.0-connect Tested With Spark 4.x 4.0-thrift Tested With Spark 4.x 4.0-thrifthttp
This driver provides access to Apache Spark (commonly referred to as just “Spark”).
Note
This project is not affiliated with the Apache Software Foundation.
Installation & Quickstart¶
The driver can be installed with dbc :
dbc install spark --pre
Note
Only prerelease versions of the driver are currently available, so you must use --pre with dbc 0.2.0 or newer to install the driver.
Connecting¶
To use the driver, provide a connection string as the uri option.
from adbc_driver_manager import dbapi
dbapi.connect(
driver="spark",
db_kwargs={
"uri": "spark://localhost:10000?auth_type=plain&api=thrift%2Bbinary"
}
)
Note: The example above is for Python using the adbc-driver-manager package but the process will be similar for other driver managers. See adbc-quickstarts .
Connection String Format¶
The URI scheme is “spark://”.
The host and port should be provided.
If not specified, the
apidefaults tothrift+binary(URI-encoded:thrift%2Bbinary).Options can be specified as query parameters or as driver options.
Note
Reserved characters in URI elements must be URI-encoded. For example, @ becomes %40 and + becomes %2B.
Connection Options¶
These parameters can be specified in the URI as query parameters, or as connection parameters:
spark.api(query parameter:api)Values:
connect,livy,thrift+binary, orthrift+http.The protocol used to connect to Spark.
Value
Backend
connectSpark Connect
livyApache Livy
thrift+binaryHiveServer2 Thrift (over TCP)
thrift+httpHiveServer2 Thrift (over HTTP)
spark.auth_type(query parameter:auth_type)Values:
sql,spark, orpyspark.How to authenticate to Spark.
Auth Type
Applicable Backends
Description
aws_sigv4livyUse AWS SDK
basiclivyUsername/password
ldapthrift+binary,thrift+httpNot yet implemented
kerberosthrift+binary,thrift+httpNot yet implemented
noneconnect,livyNo authentication
nosaslthrift+binary,thrift+httpNo authentication
plainthrift+binary,thrift+httpUsername/password
tokenconnectUsername/password (token)
spark.livy.session_kind(query parameter:livy.session_kind)Values:
sql,spark, orpyspark.For the Livy backend, what kind of session to create.
Warning
Currently only
sqlis tested/supported.spark.tls(query parameter:tls)Type boolean. Default: false.
Whether to use TLS for connecting. Only applies to
connect,livy, andthrift+http.spark.validate_server_certificate(query parameter:validateservercertificate)Type boolean. Default: true.
Whether to validate the server’s TLS certificate. Should only be disabled for development/testing.
Limitations¶
Different backends and cluster configurations have limitations; some limitations related to data type support are also noted further below.
HiveServer2/Thrift Protocol¶
In Spark 3.x, binary data that does not happen to be valid UTF-8 will be corrupted.
The client cannot tell whether a timestamp carries a time zone or not; all timestamps are assumed to be in UTC as a result.
Apache Livy¶
Only the first 1000 rows of a result set can be fetched. This can be tuned by configuring Spark with
spark.sql.repl.eagerEval.maxNumRows.In general, we have found that performance is worse than with Spark Connect or HiveServer2.
Connecting to an Amazon EMR (Serverless) cluster via Livy requires setting the
emr-serverless.session.executionRoleArnsession config option to an appropriate role ARN.
Spark Connect¶
In our testing, connecting to an Amazon EMR (Serverless) cluster via Spark Connect does not work; we believe it is an incompatibility in the Spark Connect client library and plan to address this in a future version of the driver.
Amazon EMR (Serverless)¶
Bulk ingest with an AWS Glue catalog is not currently supported as there is no way to specify the
LOCATIONclause.Amazon EMR is not currently enabled in our automated integration testing.
Feature & Type Support¶
| Feature | Spark 3.x | Spark 4.x | |
|---|---|---|---|
| Bulk Ingestion | Create | ✅ | |
| Append | ✅ | ||
| Create/Append | ✅ | ||
| Replace | ✅ | ||
| Temporary Table | ❌ | ||
| Target Catalog | ❌ | ||
| Target Schema | ❌ | ||
| Non-nullable fields are marked NOT NULL | ❌ | ||
| Catalog (GetObjects) | depth=catalogs | ✅ | |
| depth=db_schemas | ✅ | ||
| depth=tables | ✅ | ||
| depth=columns (all) | ✅ | ||
| Get Parameter Schema | ❌ | ||
| Get Table Schema | ❌ | ||
| Prepared Statements | ✅ | ||
| Transactions | ❌ | ||
Types¶
Database to Arrow¶
| Database Type | Spark 3.x | Spark 4.x |
|---|---|---|
|
BIGINT |
int64 |
|
|
BINARY |
binary ⚠️ [1] |
|
|
BOOLEAN |
bool |
|
|
DATE |
date32[day] |
|
|
DOUBLE |
double |
|
|
INT |
int32 |
|
|
NUMERIC |
decimal128 |
|
|
REAL |
float |
|
|
SMALLINT |
int16 |
|
|
TIMESTAMP |
timestamp[us] (with time zone) |
|
|
TIMESTAMP_NTZ |
timestamp[us] (with time zone) ⚠️ [3] [4] [5], ❌ [2] [3] [4] |
|
|
VARCHAR |
string |
|
Arrow to Database¶
| Arrow Type | Spark 3.x Type | Spark 4.x Type | ||
|---|---|---|---|---|
| Bind | Ingest | Bind | Ingest | |
|
binary |
❌ |
VARBINARY |
❌ |
VARBINARY |
|
binary_view |
❌ |
(NA/not tested) |
❌ |
(NA/not tested) |
|
bool |
❌ |
BOOLEAN |
❌ |
BOOLEAN |
|
date32[day] |
❌ |
DATE |
❌ |
DATE |
|
decimal128 |
❌ |
NUMERIC |
❌ |
NUMERIC |
|
double |
❌ |
DOUBLE PRECISION |
❌ |
DOUBLE PRECISION |
|
fixed_size_binary |
❌ |
|||
|
float |
❌ |
REAL |
❌ |
REAL |
|
halffloat |
❌ |
(NA/not tested) |
❌ |
(NA/not tested) |
|
int16 |
❌ |
SMALLINT |
❌ |
SMALLINT |
|
int32 |
❌ |
INT |
❌ |
INT |
|
int64 |
❌ |
BIGINT |
❌ |
BIGINT |
|
large_binary |
❌ |
VARBINARY |
❌ |
VARBINARY |
|
large_string |
❌ |
VARCHAR |
❌ |
VARCHAR |
|
string |
❌ |
VARCHAR |
❌ |
VARCHAR |
|
string_view |
❌ |
(NA/not tested) |
❌ |
(NA/not tested) |
|
time32[ms] |
❌ |
(NA/not tested) |
❌ |
(NA/not tested) |
|
time32[s] |
❌ |
(NA/not tested) |
❌ |
(NA/not tested) |
|
time64[ns] |
❌ |
(NA/not tested) |
❌ |
(NA/not tested) |
|
time64[us] |
❌ |
(NA/not tested) |
❌ |
(NA/not tested) |
|
timestamp[ms] |
❌ |
TIMESTAMP(3) |
||
|
timestamp[ms] (with time zone) |
❌ |
TIMESTAMP(3) WITH TIME ZONE |
||
|
timestamp[ns] |
❌ |
|||
|
timestamp[ns] (with time zone) |
❌ |
|||
|
timestamp[s] |
❌ |
TIMESTAMP(0) |
||
|
timestamp[s] (with time zone) |
❌ |
TIMESTAMP(0) WITH TIME ZONE |
||
|
timestamp[us] |
❌ |
TIMESTAMP(6) |
||
|
timestamp[us] (with time zone) |
❌ |
TIMESTAMP(6) WITH TIME ZONE |
||
Compatibility¶
This driver was tested on:
Apache Spark
3.5.8 5a48a37b2dbd7b51e3640cd1d947438459556cc6 (Apache Livy)Apache Spark
3.5.8 (HiveServer2+binary)Apache Spark
4.0.0 fa33ea000a0bda9e5a3fa1af98e8e85b8cc5e4d4 (Spark Connect)Apache Spark
4.0.0 (HiveServer2+binary)Apache Spark
4.0.0 (HiveServer2+HTTP)