包阅导读总结
1. 关键词:Photon Engine、Feature Engineering、Databricks、Machine Learning、Speed Up
2. 总结:Photon Engine 能在 Databricks Machine Learning Runtime 中启用,大幅提升 Spark 作业和特征工程的速度。通过更快的 ETL 和特征工程,为机器学习模型准备数据,新的点即时连接在不同规模表的处理上均有显著加速。
3. 主要内容:
– Photon Engine 简介
– 是高性能查询引擎,用 C++实现,能让 Spark SQL 和 Spark DataFrame 运行更快,降低工作负载成本。
– 对机器学习工作负载的帮助
– 加速 ETL,早期客户的 SQL 查询平均提速 2 – 4 倍。
– 加速特征工程,点即时连接在启用 Photon 时更快。
– 不同规模特征表的点即时连接加速效果不同。
– 在 Databricks Machine Learning Runtime 中的使用
– 从 15.2 及以上版本可创建启用 Photon 的 ML Runtime 集群,15.4 LTS 及以上版本有原生 Spark 版本的点即时连接。
思维导图:
文章地址:https://www.databricks.com/blog/accelerate-feature-engineering-photon
文章来源:databricks.com
作者:Databricks
发布时间:2024/8/2 19:09
语言:英文
总字数:569字
预计阅读时间:3分钟
评分:91分
标签:Photon 引擎,特征工程,Databricks,机器学习,ETL
以下为原文内容
本内容来源于用户推荐转载,旨在分享知识与观点,如有侵权请联系删除 联系邮箱 media@ilingban.com
Training a high-quality machine learning model requires careful data and feature preparation. To fully utilize raw data stored as tables in Databricks, running ETL pipelines and feature engineering may be required to transform the raw data into helpful feature tables. If your table is large, this step could be very time-consuming. We are excited to announce that the Photon Engine can now be enabled in Databricks Machine Learning Runtime, capable of speeding up spark jobs and feature engineering workloads by 2x or more.
“By enabling Photon and using a new PIT join, the time required to generate the training dataset using our Feature Store was reduced by more than 20 times.” – Sem Sinchenko, Advanced Analytics Expert Data Engineer, Raiffeisen Bank International AG
What is Photon?
The Photon Engine is a high-performance query engine that can run Spark SQL and Spark DataFrame faster, reducing the total cost per workload. Under the hood, Photon is implemented with C++, and specific Spark execution units are replaced with Photon’s native engine implementation.
How does Photon help machine learning workloads?
Now that Photon can be enabled in Databricks Machine Learning Runtime, when does it make sense to integrate a Photon-enabled cluster for machine learning development workflows? Here are some of the main considerations:
- Faster ETL: Photon speeds up Spark SQL and Spark DataFrame workloads for data preparation. Early customers of Photon have observed an averagespeedup of2x-4xfor their SQL queries.
- Faster feature engineering: When using the Databricks Feature Engineering Python API for time series feature tables, point-in-time join becomes faster when Photon is enabled.
Faster feature engineering with Photon
The Databricks Feature Engineering library has implemented a new version of point-in-time join for time series data. The new implementation, which was inspired by a suggestion from Semyon Sinchenko of Databricks customer Raiffeisen Bank International, uses native Spark instead of the Tempo library, making it more scalable and robust than the previous version. Moreover, the native Spark implementation hugely benefits from the Photon Engine. The larger the tables, the more improvements Photon can bring.
- When joining a feature table of 10M rows (10k unique IDs, with 1000 timestamps per ID) with a label table (100k unique IDs, with 100 timestamps per ID), Photon speeds up the point-in-time join by 2.0x
- When joining a feature table of 100M rows (100k unique IDs), Photon speeds up the point-in-time join by 2.1x
- When joining a feature table of 1B rows (1M unique IDs), Photon speeds up the point-in-time join by 2.4x
The figure above compares the run time of joining feature tables of 3 different sizes with the same label table. Each experiment was performed on a Databricks AWS cluster with an r6id.xlarge instance type and one worker node. The setup was repeated five times to calculate the average run time.
Select Photon in Databricks Machine Learning Runtime cluster
The query performance of Photon and the pre-built AI infrastructure of Databricks ML Runtime make it faster and easier to build machine learning models. Starting from Databricks Machine Learning Runtime 15.2 and above, users can create an ML Runtime cluster with Photon by selecting “Use Photon Acceleration”. Meanwhile, the native Spark version of point-in-time join comes with ML Runtime 15.4 LTS and above.
To learn more about Photon and feature engineering with Databricks, consult the following documentation pages for more information.