on 01-26-2023 10:46 AM - edited on 01-31-2023 03:21 PM by Tristan
Delta Sharing is a simple REST protocol that securely shares access to part of a cloud dataset. It leverages modern cloud storage systems, such as S3, ADLS, or GCS, to reliably transfer large datasets. There are two parties involved: Data Providers and Data Recipients.
As the Data Provider, Delta Sharing lets you share existing tables or parts (e.g., specific table versions of that table) stored on your cloud data lake in Delta Lake format. The data provider decides what data they want to share and runs a sharing server in front of it that implements the Delta Sharing protocol and manages access for recipients.
As a Data Recipient, all you need is one of the many Delta Sharing clients that support the protocol. In addition, there are open-source connectors for pandas, Apache Spark, Rust, and Python.
The actual exchange is carefully designed to be efficient by leveraging the functionality of cloud storage systems and Delta Lake.
Delta Sharing is an open-source standard for communication with the following properties:
The protocol works as follows:
The server will act as a middleware between GCS and the data recipient. Effectively, it will behave like a consumer of GCS and a producer for the end-user.
Delta Sharing uses pre-signed, short-lived URLs; therefore, data is retrieved at the speed of the cloud object storage. As a result, throughput and bandwidth are not limited by the sharing server.
The pre-signed URL is not even visible to the end user. Instead, the user has a share profile file. The file is used inside a client (through either a Python Connector or an Apache Spark Connector ) to request data from the sharing server.
The server sends a pre-signed URL that the client will fetch data through. The endpoint in the profile file is simply the client's access point to the server hosting the middleware.
The python connector needs a share profile file created by which it can authenticate with the server. The profile file, according to the protocol, is as follows:
{
"shareCredentialsVersion": 1,
"endpoint": "http://{delta-sharing-server-address}:{port}/delta-sharing",
"apikey": "{incorta-user-public-api-key}",
"instanceName": "{incorta-cluster-instance-name}",
"tenantName": "{incorta-tenant-name}"
}
In this example, we will name our share profile file `share.json.`
import incorta_delta_sharing
# Point to the profile file.
profile_file = "share.json"
# Create a SharingClient.
client = incorta_delta_sharing.IncortaSharingClient(profile_file)
# List all shared tables.
print(client.list_all_tables())
# load data as pandas dataframe (or Spark if you prefer that)
table1_url = profile_file + "#share1.schema1.table1"
table1_pandas_df = incorta_delta_sharing.load_as_pandas(table1_url)
table1_spark_df = incorta_delta_sharing.load_as_spark(table1_url)
Zaharia, Matei, et al. "Introducing Delta Sharing: an Open Protocol for Secure Data Sharing." Databricks, 26 May 2021, https://www.databricks.com/blog/2021/05/26/introducing-delta-sharing-an-open-protocol-for-secure-dat.... Accessed 10 January 2023.