包阅导读总结
关键词:
Google Cloud Dataplex、Box.Inc、Data Platform、数据治理、挑战应对
总结:
本文主要讲述 Box.Inc 在数据驱动的世界中面临数据管理挑战,采用 Google Cloud Dataplex 来改进数据平台,包括增强数据治理、发现、可观测性等能力,解决了数据发现、观测、 lineage、治理和安全等方面的问题。
主要内容:
– 背景
– 在数据驱动的世界,Box.Inc 面临管理大量数据的挑战,需确保其安全、可访问和合规。
– 数据平台
– 基于多租户模型和数据网格架构,建在 Google Cloud 的 BigQuery 上。
– 处理海量文件和服务众多用户,利用 BigQuery 优势处理大量事件和存储需求。
– 面临的挑战
– 数据发现:耗费时间长,团队不了解数据来源和获取途径。
– 数据可观测性:数据工程师监测数据管道困难,影响生产力。
– 数据 lineage:缺乏端到端可见性和可追溯性,分析问题耗时久。
– 数据治理和安全:难以控制敏感数据访问,合规困难。
– 应对措施
– 采用 Dataplex 解决挑战,作为中央数据目录,提供多种能力。
– 利用 Dataplex 实现数据发现的优化等功能。
思维导图:
文章地址:https://cloud.google.com/blog/products/data-analytics/dataplex-at-box-inc/
文章来源:cloud.google.com
作者:Yeshvant Kumar Bhavnasi Venkat Satya,Asmita Kulkarni
发布时间:2024/8/22 0:00
语言:英文
总字数:1021字
预计阅读时间:5分钟
评分:82分
标签:数据管理,数据发现,数据可观察性,数据血缘,数据安全
以下为原文内容
本内容来源于用户推荐转载,旨在分享知识与观点,如有侵权请联系删除 联系邮箱 media@ilingban.com
In today’s data-driven world, organizations face the challenge of managing vast amounts of data while ensuring its security, accessibility, and compliance. For Box.Inc, a global leader in Cloud Content Management, implementing an advanced data catalog solution became crucial to streamline our Data Platform operations. By adopting Google Cloud Dataplex, a capability also within BigQuery, as our go-to tool for enhanced data governance, discovery and observability, Box.Inc successfully transformed our approach towards managing complex analytics use cases to drive product innovation and growth.
Our Data Platform is built on a multi-tenant model grounded in data mesh architecture. Data mesh decentralizes data ownership to teams with the greatest contextual understanding and gives business domains self-serve data platforms and federated governance, allowing them to model, develop, deploy, and operate data services independently. This ensures agile decision-making and efficient data utilization.
Handling billions of files and serving millions of users globally, our Data Platform that’s built on Google Cloud’s BigQuery as Data Lake solution, makes it easier for us to process hundreds of thousands of events per second and manages petabyte-scale storage demands with its serverless architecture. Additionally, it manages thousands of query jobs daily, leveraging BigQuery’s massive parallel processing capabilities to process large datasets across various teams within the organization. Balancing this scale with effective infrastructure management is essential to sustain our profitable growth. To plan ahead, we aim to unlock the potential of predictive and prescriptive analytics. By augmenting our Data Platform’s capabilities, teams can tackle complex analytical challenges, and drive innovative products that cater to our complex internal and external analytical needs.
Challenges faced by growing business operations
As Box.Inc continued to expand globally, with millions of users relying on our data platform services every day, we encountered several challenges.
-
Data discovery: Product analysts, data scientists, and ML engineers struggled with the time-consuming (multiple days to weeks) processes involved in discovering, retrieving and understanding relevant datasets. Teams didn’t always understand where to find data sourced from a particular product or service, or who could give them access to data, or how existing data was structured.
-
Data observability: Data engineers also found it challenging to monitor data pipelines for debugging purposes, resulting in prolonged data downtime and resolution times (up to a few weeks), impacting productivity significantly.
-
Data lineage: The lack of end-to-end visibility and traceability of data pipelines for data engineers and software developers prevented them from proactively detecting and resolving data issues. It would take them days to weeks to perform impact and root cause analysis.
-
Data governance and security: It’s hard to govern fine-grained data access control over sensitive data to comply with regulations like GDPR, especially amidst extreme growth in volume, variety, and velocity. With a lack of appropriate tools, the Box.Inc Security team faced difficulties identifying, classifying, and protecting sensitive customer information; nor was it easy to find out who could approve data access for production systems.
To address these challenges, we turned to Dataplex, a powerful data governance solution that offers robust capabilities for metadata management and data discovery.
Leveraging Dataplex, we embarked on a transformative journey to enhance our Data Platform by enhancing developer efficiency while tightening security policies across all regions. Dataplex serves as our central data catalog, providing data discovery, lineage tracking, and governance capabilities.
Leveraging Dataplex
Dataplex brings a wide range of capabilities to our practice: Data discovery, data lineage, data observability, data governance, and security for compliance. Let’s take a look at each of these.
1. Streamlined data discovery using metadata tags
Dataplex metadata tags, alongside tag templates, empowered product and business analysts, data scientists, and other stakeholders to discover and utilize specific data more easily by reading operational and business metadata tags associated with each dataset. These standardized metadata frameworks and tag templates facilitated faster insights generation, dashboard creation, and report development, enabling quicker decision-making processes throughout Box.Inc.
Here is the high-level architecture that automates updating custom tag values in Dataplex.