Column Granularity Indexing in BigQuery Alters Query Speed

Column Granularity Indexing

BigQuery Introduces Column-Granularity Indexing to Transform Search Query Efficiency and Cost

Google Cloud BigQuery has revealed that indexing with column granularity, which is currently in public preview, represents a major improvement in its indexing capabilities. Users should see a significant improvement in query performance and cost effectiveness with this new feature.

BigQuery stores data in a columnar style, with each column having its own file block, and organizes table data into physical files. Data tokens are mapped to the files that contain them via the default search index, which functions at the file level. By selectively scanning pertinent files, this method efficiently narrows down the search space and works especially well when search tokens are selective and only show up in a small number of files.

You can also read BigQuery Data Products: Create, Utilize And Shares Your Data

Nevertheless, problems occur when search tokens are widespread across columns but selective within others, causing them to show up in the majority of files and reducing the file-level index’s usefulness. Consider, for example, looking through a table having Title and Content columns for articles about “Google Cloud Logging”. Even though the particular combination or their inclusion in the Title column is uncommon, the tokens “google,” “cloud,” and “logging” may be present in every file. Because the individual tokens are present in every file, a query searching the Title column would still require scanning every file even with the default file-level index on both columns.

Indexing with column granularity is crucial in this situation. This new feature adds column-specific information to indexes, improving them. This enables BigQuery to identify pertinent information in columns even when the search tokens are used often throughout the files in the table.

By constructing a search index with column granularity set by OPTIONS (default_index_column_granularity = ‘COLUMN’), as seen in the TechArticles example, the index now retains the column information linked to every token. BigQuery may make use of the column information in the index while searching for “Google Cloud Logging” specifically within the Title column. Specifically, it can identify which files in the ‘Title’ column contain the tokens ‘google’, ‘cloud’, and ‘logging’. This lookup would show that, in the above example, only ‘file1’ has all three tokens in the ‘Title’ column, enabling BigQuery to scan just one file rather than all four.

Two major advantages result immediately from this capability:

Improved Query Performance: Execution times are greatly sped up by accurately locating pertinent data within columns, especially for queries that contain selective search tokens in particular columns.
Increased Cost Efficiency: By reducing processed bytes and/or slot time, the more accurate index pruning immediately lowers costs.

These advantages are especially useful when queries often filter or aggregate data based on particular columns, or when search tokens are common overall but selective within particular columns. Benchmark findings on a 1TB table with logging data showed that using column granularity indexing produced even more notable benefits in execution time, processed bytes, and slot consumption, even though the default index already assisted in reducing search space.

One key development for improving query performance and cost effectiveness is indexing with column granularity. Users are recommended to:

Determine high-impact columns by examining query patterns to ascertain which columns would be most advantageous in order to achieve the best results.
Keep a close eye on performance and modify their indexing plan as necessary.
Keep in mind that indexing and storage expenses may rise.

Users only need to enable indexing with column granularity to start utilising this capability. The CREATE SEARCH INDEX DDL documentation has more information. This new functionality is a great tool for improving search queries in BigQuery, especially for huge datasets and complicated data structures where specific information needs to be quickly retrieved within certain columns.

You can also read Earth Engine in BigQuery: A New Geospatial SQL Analytics