Posted in

Discord 如何使用开源工具进行可扩展数据调度和转换_AI阅读总结 — 包阅AI

包阅导读总结

1. `Discord`、`Data Orchestration`、`Transformation`、`Open-Source Tools`、`Data Quality`

2. 本文主要讲述了 Discord 如何利用开源工具进行可扩展的数据编排和转换,新系统使数据团队受益,包括自动化指示、资产生命周期管理、数据质量把控、标准化计算和定制化开发等方面。

3.

– 新系统核心就位后,数据团队很快受益

– 回答“数据资产为何不更新”等简单但重要的问题变得便捷

– 借助 Dagster 的资产定义页面,资产所有者能无缝管理资产生命周期

– 血统视图能快速确定表着陆时间和识别阻塞

– 数据质量是新系统的核心

– 工程师能编写时间点质量检查,可设置为“警告”或“阻止”

– 通过 DAN 通知系统向表所有者告警

– 在 dbt 方面

– 标准化复杂指标计算,使用宏消除业务差异

– 创建内部自定义 dbt CLI 命令套件,如自动生成模型所需的 YAML 文件

– 实施稳健的 CI/CD 流程

– 利用和贡献 dbt 包实现新功能

思维导图:

文章地址:https://discord.com/blog/how-discord-uses-open-source-tools-for-scalable-data-orchestration-transformation

文章来源:discord.com

作者:Zach Bluhm

发布时间:2024/7/12 0:00

语言:英文

总字数:1951字

预计阅读时间:8分钟

评分:85分

标签:数据调度,开源,Discord,Dagster,dbt


以下为原文内容

本内容来源于用户推荐转载,旨在分享知识与观点,如有侵权请联系删除 联系邮箱 media@ilingban.com

How users are benefiting from the new system

Once the core pieces were in place, it didn’t take long for our data teams to benefit from the new tools at their disposal. For one, answering the very simple, yet important, question “Why isn’t my data asset updating?” is now a self-serve, at-a-glance feature:

The automation tab indicates exactly why or why not a given asset is being queued up for materialization

Empowered by Dagster’s asset definition pages, asset owners now seamlessly manage the lifecycles of their data assets, from backfilling to incremental updates. This interface provides comprehensive, real-time insights into every activity related to an asset. One asset owner has been quoted as enjoying the UI so much that they feel confident doing things like “launching backfills from my phone.” (We don’t recommend trying this at home)

The lineage view allows anyone to quickly determine table landing times and identify blockages that prevent downstream execution.

Data quality is at the heart of this new system — our engineers can now write point-in-time quality checks that can be tuned to “warn,” or even “block,” downstream runs on failure. This allows us to quickly catch, alert, and fix issues before impacting critical downstream use cases like company dashboards. We alert our table owners by utilizing a notification system called DAN (Data Asset Notifications) that informs users of table failures via a Discord app. (Who uses email nowadays?)

On the dbt side, we’ve been able to standardize complex metric calculations using macros, which has played a key role in removing discrepancies across the business and streamlined the way data practitioners are transforming and consuming data.

To boost developer productivity, we created an internal suite of custom dbt CLI commands One of these, dubbed autogen-schema, automatically generates the boilerplate dbt YAML files a new model requires, which can often be verbose when creating from scratch.

We also implemented a robust CI/CD process to prevent disruptive changes across table logic, macros, dbt tests, and more. Our advanced dbt table configurations and custom materializations are tailored to meet business demands while effortlessly integrating with our Dagster orchestration system and maintaining parity with the previous Derived system.

Last but not least, we were able to leverage and contribute to the wide range of dbt packages, such as great-expectations and elementary, to quickly enable new features and functionality in our project.