Lightning Engine: A New Era for Apache Spark Speed

Apache Spark helps organizations analyze massive data sets for ETL, data science, machine learning, and more. However, scaled performance and cost efficiency might be problematic. Users frequently run into bottlenecks with regard to resource use, data input/output (I/O), and query execution speed, which lengthens processing times and raises infrastructure expenses.

Google Cloud have a thorough understanding of these difficulties. It is launching Lightning Engine (in preview), newest and most potent Spark engine to date, designed to unleash the full potential of your lakehouse and provide best-in-class performance for Spark.

What is Lightning Engine?

Lightning Engine is a multi-layer optimization engine that prioritizes selected optimizations in the file-system layer and data-access connectors in addition to more conventional optimization methods like query and execution optimizations.

For instance, compared to open source Spark running on comparable infrastructure, Lightning Engine improves Spark query performance by 3.6x on workloads similar to TPC-H at a dataset size of 10TB.

Lightning Engine’s key enhancements — Image credit to Google Cloud

Above image shows some of Lightning Engine’s main improvements, which include:

Query optimizer: Using Google’s experience with engines like F1 and Procella, Lightning Engine integrates a much improved Spark optimizer. Enhanced adaptive query execution for join removal and exchange reuse, subquery fusion to consolidate scans, advanced inferred filters for semi-join pushdowns, dynamic in-filter generation for effective row-group pruning in Iceberg and Delta tables, an optimized Bloom filters implementation based on listing call statistics, and more are all features of this advanced optimizer. When combined, they result in significant scan and shuffle savings.

Execution engine: Through a native implementation based on Apache Gluten and Velox that has been specially created to take use of Google’s hardware, Lightning Engine’s execution engine improves performance. In order to dynamically switch between off-heap and on-heap memory without needing modifications to current Spark settings, this incorporates unified memory management. Along with increased support for operators, functions, and Spark data types, Lightning Engine now has intelligence that can automatically spot when to use the native engine for the best pushdown results.

Shuffle: To reduce shuffle data, Lightning Engine uses columnar shuffle in conjunction with an optimized serializer-deserializer.

File parsers: To minimise data scans and metadata operations, Lightning Engine has a specialized parquet parser for prefetching, clever caching, and sophisticated in-filtering.

Connectors: In order to maximize the speed of its native engine, Lightning Engine improves communication to BigQuery and Google Cloud Storage. Performance and dependability for Spark applications are unlocked by an optimized file output committer, while the enhanced Cloud Storage connection minimizes metadata operations to cut expenses. Furthermore, by sending data to the engine directly in Apache Arrow format and removing the need for row-to-columnar conversions, the new native BigQuery connection simplifies data transmission.

Because Lightning Engine is compatible with SQL APIs and Apache Spark DataFrame, workloads may be executed smoothly without needing changes to current code.

Why a Lightning Engine?

Lightning Engine is more cost-effective and performs better than competing cloud Spark alternatives. When paired with BigQuery and Google Cloud’s cutting-edge AI/ML, support for open formats like Apache Iceberg and Delta Lake can help you increase business efficiency.

Additionally, Lightning Engine outperforms do-it-yourself Spark implementations in terms of performance, which can result in significant cost savings and free up your time to concentrate on your major business issues rather than platform upkeep.

Advantages

Principal advantages of the lightning engine

Boosted performance: Delivers much quicker query performance by utilizing a new Spark processing engine with vectorized execution, integrated intelligent caching, and optimized storage I/O.

Price-performance ratio that leads the industry: Enables customers to handle more data for less money by providing exceptional performance and cost effectiveness.

Lakehouse integration that is intuitive: Offers a single platform for data analytics and artificial intelligence by integrating with Google Cloud services like BigQuery and Vertex AI, Apache Iceberg, and Delta Lake.

Improved access to data: Improved throughput, decreased metadata processes, and improved data access latency are all made possible via optimized connections for BigQuery and Cloud Storage.

Adaptable deployments: Accessible in cluster-based and serverless architectures.

Although Lightning Engine provides significant performance improvements, the precise effect varies depending on the workload. Instead of I/O-bound activities, it works best for compute-intensive workloads that make use of Spark Dataframe APIs and Spark SQL queries.

Spark’s future on Google Cloud

Google Cloud is thrilled to apply Google’s size, performance, and technical prowess to Apache Spark workloads with the new Lightning Engine high-performance query engine for data, spurring innovation and enabling developers everywhere. It already have plans to make it much quicker in the upcoming months, so this is only the beginning!

Both Google Cloud Serverless for Apache Spark and Dataproc on Google Compute Engine premium tiers provide a sample of Lightning Engine. Both services already provide task monitoring features for operational efficiency and GPU support for faster machine learning workloads.