Big Lake Storage: An Open Data Lakehouse on Google Cloud

Big Lake Storage

Built open, high-performance, enterprise Big Lake storage Lakehouses which are iceberg native

Google Cloud has announced significant updates to its Big Lake storage engine, enabling businesses to leverage Apache Iceberg and create open, high-performance, enterprise-grade data lakehouses on Google Cloud. With these enhancements, customers will no longer have to choose between using fully controlled, enterprise-grade storage management and embracing the flexibility of open formats like Apache Iceberg.

Businesses are looking for adaptable, open, and interoperable architectures that enable various engines to run on a single copy of data while data management undergoes a revolution. In this context, Apache Iceberg has become a prominent open table format. The most recent Big Lake storage development gives Apache Iceberg access to Google’s infrastructure, laying the groundwork for open data lakehouses.

You can also read Lightning Engine: A New Era for Apache Spark Speed

Among the major advancements revealed are:

  • BigLake Metastore is available generally: The BigLake Metastore, formerly the BigQuery metastore, is now widely accessible (GA). Across BigQuery and other Iceberg-compatible engines, this fully managed, serverless, and scalable solution streamlines runtime metadata maintenance and operations. It eliminates the need to oversee the implementation of proprietary metastores by utilising Google’s extensive global metadata management infrastructure. In order to achieve open interoperability, the BigLake Metastore is essential.
  • Iceberg REST Catalogue API (Preview) Introduction: For further interoperability, the Iceberg REST Catalogue (Preview) offers a standard REST interface to supplement the GA Custom Iceberg Catalogue. This enables the BigLake metastore to be used as a serverless Iceberg catalogue by users, including those running Spark. Spark and other open-source engines can interact with BigLake tables for Apache Iceberg and BigQuery with the Custom Iceberg Catalogue (GA).
  • New, High-Performance Iceberg-Native Cloud Storage: By combining Apache Iceberg with Google Cloud Storage management capabilities, Google Cloud is making lakehouse maintenance easier. This includes support for Cloud Storage capabilities like auto-class tiering and encryption, as well as automatic table maintenance services like compaction and trash collection. This expands the management possibilities of Cloud Storage for Iceberg data in particular.
  • BigLake tables for Apache Iceberg are generally available in BigQuery: These tables, which are now widely accessible, combine BigQuery’s extremely scalable, real-time metadata with the transparency of Iceberg formats. Advanced features like high-throughput streaming ingestion with BigQuery’s Write API, scaling to tens of GiB/second with zero-latency reads, are made possible by this combination. Additionally, it offers automatic table management (compaction, trash collection), native interaction with Vertex AI, speed enhancements including auto-reclustering, and future fine-grained DML and multi-table transactions (coming soon in preview). These tables preserve Iceberg’s openness while offering a fully managed, enterprise-ready experience. An Apache Iceberg V2 specification-compliant metadata snapshot is automatically created and registered by BigLake in the BigLake metastore. This snapshot is updated automatically upon revisions.
  • AI-Powered Governance with Dataplex Integration: Dataplex Universal Catalogue is natively supported in the BigLake releases. Through this interface, unified and fine-grained access controls are provided, ensuring that governance policies established centrally in Dataplex are uniformly applied across several engines. Table-level access control is supported for direct access to Cloud Storage, while BigQuery offers finer-grained access control that may be achieved through Storage API connectors for open-source engines. With features like search, discovery, profiling, data quality checks, and end-to-end data lineage, the Dataplex integration greatly improves governance for BigQuery and BigLake Iceberg tables. Additionally, Dataplex makes data discovery easier with AI-generated insights and semantic search. Benefits of end-to-end governance are applied automatically and don’t require additional registration.

You can also read APT41’s Actions Highlight the Need for Threat Monitoring

Access to data from other runtimes, including BigQuery, AlloyDB (preview), and open-source engines like Spark and Flink, is made possible by the BigLake metastore, which serves as the basis for interoperability. For AlloyDB users, this improved compatibility is very potent because it makes it possible to easily consume analytical BigLake tables for Apache Iceberg straight within AlloyDB (Preview). This supports operational and AI-driven use cases by enabling PostgreSQL users to integrate real-time transactional data from AlloyDB with rich analytical data.

Regarding their use of Google’s BigLake, CME Group Executive Director Zenul Pomal said: “We needed teams throughout the company to access data in a consistent and secure way – regardless of where it stored or what technologies they were using. BigLake from Google was an obvious pick. Without requiring data to be moved or duplicated, it offers a uniform layer for accessing data and a fully managed experience with enterprise capabilities via BigQuery, whether the data is in standard tables or open table formats like Apache Iceberg. As it continues to investigate possible use applications for gen AI, metadata quality is crucial. To assist in maintaining high-quality metadata, we are making use of BigLake Metastore and Data Catalogue.

As shown at Google Cloud Next ’25, Google Cloud intends to introduce further features in the upcoming months, such as support for change data capture, multi-statement transactions, and fine-grained DML.

By removing the trade-offs between open and managed data solutions, Google Cloud is transforming BigLake into a comprehensive storage engine that makes use of open-source, third-party, and Google Cloud services. This will help to accelerate data and AI innovation.

You can also read FSx for Lustre Intelligent Tiering Saves 96% on Storage

Thank you for your Interest in Cloud Computing. Please Reply

Discover more from Cloud Computing

Subscribe now to keep reading and get access to the full archive.

Continue reading