包阅导读总结
1.
关键词:Delta Sharing、数据共享、跨平台、开源、协作
2.
总结:Delta Sharing 是由 Databricks 和 Linux Foundation 开发的跨平台数据共享开源方案,能消除平台差异实现无缝数据共享,众多企业采用,文中还介绍了其连接器、应用场景及优势。
3.
主要内容:
– Delta Sharing 介绍:
– 由 Databricks 和 Linux Foundation 开发,是首个跨平台、云、地区的数据共享开源方案。
– 客户不再受限于自身平台,能与所有客户、伙伴等共享数据。
– 采用情况:
– 自 2022 年宣布可用以来,众多企业采用。
– 如 Atlassian 和 Nasdaq 用于向合作伙伴和客户交付数据。
– Delta Sharing 特性:
– 支持 Databricks 到 Databricks(D2D)和 Databricks 到开放(D2O)。
– D2O 很受欢迎,40%的活跃份额使用开放连接器。
– 可与任何计算平台上的任何用户无缝共享数据。
– 开源连接器:
– 包括 Python、Apache Spark™、Excel、Tableau、PowerBI 等。
– 不同连接器具有不同功能。
– 与其他系统集成:
– 如 BigQuery 和 Snowflake 等缺乏原生连接器的系统,可用 Python 连接器解决。
– Delta Sharing API:
– 客户用其 REST API 创建定制数据共享应用。
– 如 Atlassian 用其增强灵活性和洞察力,Nasdaq 用其安全高效交付数据。
思维导图:
文章地址:https://www.databricks.com/blog/democratizing-data-sharing-platform-agnostic-approach
文章来源:databricks.com
作者:Databricks
发布时间:2024/7/31 16:00
语言:英文
总字数:1975字
预计阅读时间:8分钟
评分:82分
标签:数据共享,Delta Sharing,开源,Databricks,云平台
以下为原文内容
本内容来源于用户推荐转载,旨在分享知识与观点,如有侵权请联系删除 联系邮箱 media@ilingban.com
Companies across all industries want to share data with each other to enable collaboration and accelerate innovation. However, these organizations often use different data or cloud platforms, which creates friction or blocks collaboration. Databricks and the Linux Foundation developed Delta Sharing, marking a significant milestone in the democratization of data exchange with the first open source approach to data sharing across platforms, clouds, and regions. With Delta Sharing, customers are no longer limited to collaborating within their own platform and customer base but can instead go beyond and share data with all of their customers, partners, and any other collaborators.
Since announcing general availability of Delta Sharing in 2022, we have seen many enterprises adopt it to maximize their reach and collaborate with their customers and partners —regardless of cloud or platform. Databricks customers use the managed Delta Sharing service offered natively, which supports both Databricks-to-Databricks (D2D) and Databricks-to-Open (D2O) for non-Databricks customers. Thanks to its open reach, D2O is very popular with customers, with 40% of active shares using open connectors. Databricks customers Atlassian and Nasdaq use Databricks D2O to deliver data to all their partners and customers on any computing platform, anywhere. Data and software platforms such as Oracle have also adopted Delta Sharing for Oracle-to-Open sharing to help enable their customers.
Databricks-to-Open (D2O) Delta Sharing revolutionizes how organizations share data, enabling seamless sharing of data managed in a Unity Catalog-enabled workspace with any user on any computing platform, anywhere. This approach enables Databricks customers to collaborate with all of their partners, customers, and suppliers – regardless of whichever data or cloud platform they use.
This blog will showcase the pivotal role of D2O in modern data sharing strategies with real-world applications. We will explore D2O scenarios that empower organizations to extend their data sharing capabilities, enabling interoperability with external partners’ systems, and reaching customers anywhere.
In addition, we will highlight the most commonly used Delta Sharing open source connectors, such as Python, Apache Spark™, Excel, Tableau, PowerBI, part of the growing, open Delta Sharing ecosystem. We will also showcase how Databricks customers leverage D2O combined with the Delta Sharing REST API to build a cohesive data fabric architecture, customizing their data sharing experiences across their entire customer base.
Finally, we will review Databricks’ Marketplace’s recent support for D2O, which now enables recipient access to Marketplace listings via the Delta Sharing open connectors. For example, we will explain how a Python connector or Spark connector can be used to consume a Delta Sharing listing in systems where there is no native connector, such as Amazon EMR, Google BigQuery, and Snowflake.
Increasingly, enterprises are implementing a D2O workflow to simplify collaboration externally across multiple platforms to unlock the potential of their data to drive innovation, ensure robust governance, and accelerate growth.
Open Ecosystem of Connectors
Consuming data shared using the Delta Sharing open sharing protocol requires an OSS connector, authenticated using a credential file that is typically obtained when a provider shares an activation token with a recipient.
The table below summarizes the OSS connectors that Delta Sharing currently supports, with links for download and major features for each. For example, the Python Connector offers robust capabilities for querying metadata, accessing snapshots, supporting Change Data Feed (CDF), and supporting Pandas. Another one is the Apache Spark Connector which provides similar capabilities to the Python connector, ensuring seamless integration into Spark users’ workflows. These connectors are part of the broader OSS Delta Sharing project, aimed at simplifying data sharing and consumption through familiar APIs and promoting open and accessible data sharing. All of these connectors also help read data from the Unity Catalog (UC) for recipients not yet on UC.
Connector | Description | Download | Major Features |
---|---|---|---|
Python |
Python / PySpark sharing client | GitHub |
|
Apache Spark |
Apache Spark sharing client | GitHub |
|
Microsoft Power BI |
Power BI uses Power Query to connect to data sources. Read documentation. | Power BI Delta Sharing Connector | |
Microsoft Excel |
Excel add-in for Delta Sharing and writing Delta tables | Exponam Excel Add-in |
|
Tableau |
Tableau Delta Sharing connector provides joint integration. Read the blog post. | Tableau Delta Sharing Connector |
|
Earlier this year, a new Tableau Delta Sharing connector was announced to support seamless data sharing between Tableau and Databricks.
Meet Your Customers Wherever They Are: BigQuery and Snowflake Examples
When integrating Delta Sharing with systems that lack native connectors, such as BigQuery and Snowflake, the Python delta sharing connector provides a versatile solution to bridge these gaps effectively. For BigQuery users, PySpark can be leveraged to authenticate and access shared data via the ‘delta_sharing’ library, followed by loading this data into a DataFrame and writing it directly to BigQuery. This process utilizes Google Cloud Dataproc for scalable data processing, ensuring that data handling is both efficient and secure. To learn more about how to use Delta Sharing with BigQuery, read Medium blog post from Databricks experts.
Similarly, for Snowflake integration, recipients can utilize the Python connector with the Pandas library to import data into a DataFrame. Following the data import, Snowflake’s Snowpark Python API facilitates the connection to Snowflake databases, allowing for seamless data writing from the Pandas DataFrame into Snowflake tables.
Code example:
<span class="subtle">pip install delta-sharing, snowflake-snowpark-python pandasimport delta_sharingimport pandas as pd# Path to the Delta Sharing profile JSON fileprofile_file = "path/to/your/profile.delta-sharing.json"# Load the profileclient = delta_sharing.SharingClient(profile_file)# Load a specific table into a DataFrametable_url = "delta-sharing://<profile>#schema_name.table_name"df = delta_sharing.load_as_pandas(table_url)# Snowflake Snowpark session setupconnection_parameters = { …}# Create a Snowflake sessionsession = Session.builder.configs(connection_parameters).create()# Write the pandas DataFrame directly to a Snowflake tablesession.write_pandas(df_pandas, "your_snowflake_table_name", auto_create_table=True)</span>
This method offers significant advantages because it eliminates the need for providers to replicate data in a separate system simply for sharing purposes, which would otherwise require additional computing, storage, and technical effort. By using Delta Sharing, data providers can directly share from their Databricks environment, enabling recipients to access the live data across various platforms, without the need for replication. This approach not only demonstrates the flexibility and cost-effectiveness of Delta Sharing but also enhances efficiency by consolidating data in a single system.
Enhance Your Data Services with the Delta Sharing API
Many customers build their own products and interfaces on top of Databricks. These customers use Databricks Delta Sharing’s REST API to create tailored data sharing applications for their customers. Such applications are designed not only to enhance user experience but also to fit seamlessly into a comprehensive data fabric strategy.
Clients are leveraging these custom-built applications to control their data exchange environments, enabling them to share data hosted on Databricks with their customers who may not be using the same platform.
By customizing user interfaces to external partners’ needs, organizations enhance collaboration and drive innovation, transforming data exchange into a strategic asset that improves business relationships and customer engagement. This approach strengthens their competitive edge in a data-driven market. The emphasis on flexibility and adaptability in these customized interfaces marks a new era of strategic data exchange.
For example, Atlassian integrates with Delta Sharing to help their customers drive insights with a flexible, open ecosystem. Atlassian Analytics’ latest feature data shares is powered by Databricks Delta Sharing’s open-source protocol. Data shares allows you to access Atlassian data in your environments and in any BI tool. Watch Atlassian’s 2024 Data + AI Summit session, “Empowering Enterprise Grade Customers with Delta Sharing – an Atlassian Analytics Story.”
“Atlassian Analytics recently launched Data Shares, leveraging Delta Sharing from Databricks, to boost flexibility and accelerate customers’ time-to-insight. Whether users choose to work within Atlassian Analytics or continue using dashboards they’re already familiar with, Delta Sharing’s open ecosystem of connectors, including Tableau, PowerBI, and Spark, enables customers to easily power their environments with data directly from the Atlassian Data Lake.”
— Ben Jackson, Senior Group Product Manager, Data & Analytics, Atlassian
Another Databricks customer, Nasdaq has been using Delta Sharing for their Data Link Platform which delivers market data, alternative data, and partner data to its users. As their data sets increased, they needed to have a scalable solution to deliver terabytes of data securely and efficiently, while reducing egress costs. Nasdaq uses Delta Sharing customized for their specific needs in a scalable way which includes built-in governance from Databricks. To learn more about how Nasdaq uses D2O sharing, hear from them in the 2024 Data + AI Summit session,“Delta Sharing unlocks the value of your data to partners and customers.”
Oracle announced Delta Sharing integration for their Oracle Autonomous Database users last year to connect with Databricks across clouds. Customers no longer have to deal with having their data locked in one platform or have to copy their data to share it with another platform. Now, with Delta Sharing, these platforms can see each other’s data without the need for copying. This helps avoid issues with outdated data, unnecessary computer usage, and extra work. Read Oracle’s blog post to learn more about this integration. You can also learn more from Oracle in the 2024 Data + AI Summit session “Delta Sharing: Open Protocol for Secure Data Sharing (OSS).”
Databricks Marketplace D2O
Databricks Marketplace is an open marketplace for all your data and AI assets, such as AI models, tabular data, file-based data, as well as industry-based Solution Accelerators.
The Databricks Marketplace D2O (Databricks-to-Open) feature extends the capabilities of Marketplace to support recipients across non-Databricks platforms, leveraging the power of Delta Sharing. This extension enables a broader range of data sharing possibilities beyond the conventional Databricks-to-Databricks (D2D) interactions, by implementing a unique credential system for recipient identification. Unlike the standard procedure that relies on mutual authentication between Databricks account metastores, D2O facilitates the sharing of data through an open protocol, allowing recipients to access shared assets without the necessity of a Databricks account. Furthermore, after the listing is installed, the feature offers the functionality for users to download and renew the credential token needed to access the shared data. This enhances the Databricks Marketplace’s utility by enabling integration with external tools such as Spark, PowerBI, Excel, and non-UC Databricks accounts, thus broadening the scope of data accessibility and collaboration.
Advancing Data Collaboration through D2O
Our exploration of D2O Delta Sharing highlights its pivotal role in facilitating data exchange across Databricks and non-Databricks platforms. By deploying connectors, D2O enhances data accessibility and ensures seamless integration with various platforms, including Spark, PowerBI, Tableau, and Excel. This strategic interoperability fosters a more inclusive data ecosystem, improving the utility and applicability of data in diverse analytical and operational scenarios.
D2O’s approach to data sharing marks a significant advancement in data democratization, empowering organizations to spread insights and foster collaboration beyond traditional boundaries. The impact of this feature is substantial, simplifying data operations, sparking innovation, and opening new avenues for growth and efficiency.
Reflecting on the capabilities and potential of D2O Delta Sharing, it is clear that this innovation is more than just technological progress; it is a commitment to open, accessible, and collaborative data exchange. With the advancements made by D2O, the future of data sharing looks promising, cementing data’s role as a crucial element in decision-making and innovation in today’s digital world.
Getting Started with Delta Sharing
To learn more about how to implement Delta Sharing within your organization, check out the latest resources including new eBooks and related blogs below, or deep dive into the Delta Sharing technical documentation.
If you are already a Delta Sharing customer, you can also reach out to the team with questions or to provide feedback at datasharing[at]databricks.com.