包阅导读总结
1. 关键词:Datastream、append-only mode、CDC、BigQuery、Operational Databases
2. 总结:本文介绍了 Datastream 的 append-only 模式,它能解决传统 CDC 复制中难以追踪数据变更历史的问题,适用于多种场景,具有成本效率、数据准确和实时洞察等优势,使用方便。
3. 主要内容:
– Datastream append-only 模式简介
– 组织需要最新“真相源”和数据变更历史,常用 CDC 复制数据到云数据仓库
– Datastream 推出 append-only 模式,简化从操作数据库到 BigQuery 的变更复制
– 理解 append-only 模式
– 传统 CDC 复制会覆盖目标记录,append-only 模式将每次变更作为新行保留,包含变更类型等元数据
– 用例和好处
– 适用于审计合规、趋势分析、客户 360 等场景
– 具有成本效率、数据准确、实时洞察等优点
– 示例
– 以存储客户信息为例,说明 append-only 模式如何记录变更
– 如何使用 append-only 模式
– 在用户界面或通过 API 创建流时可启用,自动生成含元数据列的 BigQuery 表
思维导图:
文章地址:https://cloud.google.com/blog/products/data-analytics/understanding-datastream-append-only-mode/
文章来源:cloud.google.com
作者:Etai Margolin,Yaara Gazit
发布时间:2024/6/21 0:00
语言:英文
总字数:532字
预计阅读时间:3分钟
评分:86分
标签:数据分析,流式处理,Datastream,BigQuery,变更数据捕获
以下为原文内容
本内容来源于用户推荐转载,旨在分享知识与观点,如有侵权请联系删除 联系邮箱 media@ilingban.com
Organizations often grapple with the need to have both an up-to-date “source of truth” as well as the ability to track the complete history of changes within their data. When managing this data within operational databases like MySQL or PostgreSQL, a common approach is to utilize change data capture (CDC) to replicate the changes to a cloud data warehouse such as BigQuery.
Datastream, Google Cloud’s serverless CDC service, recently introduced a new feature called append-only mode that streamlines the process of replicating changes from your operational databases to BigQuery. This feature offers an efficient and cost-effective way to maintain historical records and track changes to operational data over time.
Understanding append-only mode
In traditional CDC-based replication, when a record in your source database is updated or deleted, the corresponding record in the destination is overwritten, making it difficult to track the history of changes. Append-only mode addresses this challenge by preserving every change as a new row in your target BigQuery table. Each row includes metadata that captures the type of change (insert, update, or delete), a unique identifier, timestamp, and other relevant information, which can be used to order and filter the data as needed.
Use cases and benefits
Append-only mode is particularly beneficial in scenarios where you need to maintain a historical record of changes. Some common use cases include:
- Auditing and compliance: Track every modification to data for regulatory compliance or internal audits.
- Trend analysis: Analyze historical data to identify patterns, trends, and anomalies over time.
- Customer 360: Maintain a comprehensive view of customer interactions and preferences by tracking changes in customer data.
- Analyzing embedding drift: With a historical record of embeddings, you can analyze how embeddings have drifted and assess the impact on your model’s performance.
- Time travel: Query your data warehouse as it was at a specific point in time, enabling historical analysis and comparisons.
Example
Suppose you store customer information in a MySQL table and need MySQL to act as your primary source of truth. Your analytics team needs to track changes to customer records to analyze behavior and preferences. With append-only mode activated, all changes to this table, such as inserts, updates, and deletes, will be recorded as new rows in the associated BigQuery table. This simplifies the process for retrieving the necessary data for analysis by the data analytics team.
Benefits of append-only mode
-
Cost efficiency: Reduces processing costs by only appending new rows instead of applying complex merge operations with existing data.
-
Improved data accuracy: Ensures a complete and accurate history of changes, minimizing the risk of data loss.
-
Real-time insights: Enables real-time analysis of changes as they occur, facilitating faster decision-making.
How to use append-only mode
You can easily enable append-only mode while creating the stream in the user interface or via the API. Datastream automatically generates BigQuery tables with the required metadata columns to allow you to monitor modifications.