谷歌推出大查询 TreeAH 索引，针对大批量查询优化_AI阅读总结

包阅导读总结

1. 关键词：BigQuery、TreeAH 索引、向量搜索、性能优化、近似最近邻算法

2. 总结：BigQuery 持续增添能力，成为 Gemini 时代的 AI 就绪数据平台。今年推出向量搜索等功能，现又宣布预览 TreeAH 向量索引。它采用独特技术，相比 IVF 索引在某些情况能显著降低延迟和成本，文中对比了两者的差异和性能。

3. 主要内容：

– BigQuery 致力于成为 AI 就绪数据平台，不断增加新功能

– 今年推出向量搜索，还增添了存储列和预过滤等功能

– 宣布预览 TreeAH 向量索引

– 带来谷歌近似最近邻算法的核心部分

– 相比 IVF 索引有优势

– IVF 与 TreeAH 索引对比

– IVF 用可扩展 k-means 聚类算法分区向量数据

– TreeAH 基于 ScaNN 算法，使用不对称哈希和压缩嵌入，搜索更快更高效

– TreeAH 性能表现

– 工程团队对不同配置和查询批量大小进行基准测试，结果显示优势

思维导图：

文章地址：https://cloud.google.com/blog/products/data-analytics/introducing-scann-in-bigquery-vector-search-for-large-query-batches/

文章来源：cloud.google.com

作者：Francis Lan

发布时间：2024/8/20 0:00

语言：英文

总字数：647字

预计阅读时间：3分钟

评分：90分

标签：大查询,向量索引,树AH,谷歌 ScaNN,近似最近邻居 (ANN)

以下为原文内容

本内容来源于用户推荐转载，旨在分享知识与观点，如有侵权请联系删除联系邮箱 media@ilingban.com

We continue to add more capabilities to BigQuery to make it the AI-ready data platform for the Gemini era. Earlier this year, we introduced vector search, which enables vector similarity search on BigQuery data. Since then, we also added several functionalities such as stored columns and pre-filtering. Already, the scale, performance, and ease-of-use offered through BigQuery’s vector search and AI capabilities is empowering customers to build pipelines and applications, ranging from semantic search to LLM-based retrieval-augmented generation (RAG).

Today, we are announcing the preview of the TreeAH vector index, which brings core pieces from Google’s research and innovation in approximate nearest neighbor algorithms to BigQuery. This new index type uses the same underlying technology that powers some of Google’s most popular services and delivers significant latency and cost reductions in certain situations compared to the first index we implemented in BigQuery, the inverted file index (IVF). How significant? Read on to learn about architectural differences between the two, performance results, and when and how to use TreeAH rather than IVF.

IVF vs. TreeAH indexes: a comparison

Using a vector index allows BigQuery to optimize the lookups and distance computations required to identify closely matching embeddings. Both IVF and TreeAH indexes allow BigQuery to perform approximate nearest neighbor (ANN) search instead of exact nearest neighbor search, which trades some accuracy for lower query latency and cost.

BigQuery’s first vector index, IVF, uses a scalable k-means clustering algorithm to partition the vector data into clusters. When you use the VECTOR_SEARCH function to search the vector data, it finds the clusters that are closest to the query vector and only ranks the vector data from those clusters; this reduces the number of distance calculations by a large factor.

The new TreeAH index is based on Google’s ScaNN algorithm, which is used in a multitude of Google services for similarity search. The main difference with the IVF index is the use of asymmetric hashing (the “AH” in the TreeAH), which uses product quantization to compress embeddings. Coupled with a CPU-optimized distance computation algorithm, vector search using TreeAH can be orders of magnitude faster and more cost-efficient than IVF. Index generation can also be 10x faster and cheaper and have a smaller memory footprint, as only the compressed embeddings are stored.

TreeAH performance

Our engineering team conducted benchmarks across various table configurations and query batch sizes to compare TreeAH with IVF. Here are the results:

分类

谷歌推出大查询 TreeAH 索引，针对大批量查询优化_AI阅读总结 — 包阅AI

以下为原文内容

IVF vs. TreeAH indexes: a comparison

TreeAH performance