Apache Iceberg is a powerful, open-source data management tool that is gaining increasing importance in the data engineering and analytics communities.
In this blog, we will discuss Apache Iceberg and the benefits it provides in managing big data. We will then demonstrate how to implement it in a data table format using a low-code approach leveraging AWS cloud native solutions including AWS Glue, Amazon Athena and Amazon S3.
Apache Iceberg Overview
With Apache Iceberg, there’s no need to choose between a data lake or a data warehouse. Instead, you can have the best capabilities of both.
One objective of a data lake is to store vast amounts of data from multiple sources. However, that can come at a cost when it comes to maintaining, updating and querying the data. On the other hand, a data warehouse provides a single, reliable source of truth, where queries are isolated from concurrent changes by having the query layer handling all access to data.
Without a data table format like Apache Iceberg, you would have to analyse the trade-offs of each technology and decide on what is more appropriate for your workload. The Apache Iceberg table format addresses those limitations, allowing you to have data warehouse features in a data lake format. Apache Iceberg also allows for:
- ACID transactions: ACID (Atomicity, Consistency, Isolation, and Durability) transactions are provided as a built-in feature, allowing for reliable, transactional updates to large-scale datasets. This enables Iceberg users to perform atomic, consistent, isolated, and durable transactions on their data, ensuring that their datasets remain consistent and accurate, even in the face of concurrent read and write operations.
- Time-travel and version-travel: Each Apache Iceberg table maintains a versioned manifest of the objects that it contains. Previous versions of the manifest can be used for time travel and version travel queries. Users can also employ version rollback to quickly correct problems by resetting tables to a good state.
- Scalability: Apache Iceberg is designed to handle large-scale data sets that span multiple data lakes, data warehouses, or other data storage systems. It provides a scalable architecture that can handle data sets of any size and can easily be extended to support new storage systems.
- Flexibility: Apache Iceberg is designed to work with a variety of data storage systems, including Amazon S3 and Apache Hudi. This makes it easy to integrate with existing data management systems and eliminates the need for costly data migrations.
- Performance: Apache Iceberg is built with performance in mind, providing features such as efficient data filtering and partitioning, column pruning, and optimised metadata management. This results in faster query execution and reduced query latency, allowing for real-time data analysis.
Apache Iceberg table format
Apache Iceberg is a table format only, therefore a catalog is required to manage the tables. You can choose from a range of catalogs including AWS Glue Data Catalog, Hive metastore, Hadoop warehouse, Spark’s built-in catalog, or even your own custom catalog. It’s important to note that not all Iceberg features are available to all catalogs, so we need to consider the features we need when selecting the right one. As we are focussing on AWS native solutions in this blog, we have selected AWS Glue as the Iceberg catalog.
The Apache Iceberg file format with an overview of its components is shown in Figure 1. There are three main components:
- Iceberg Catalog
- Metadata layer
- Data layer
The catalog stores the current metadata pointer and must support atomic operations for updating the current metadata (all of the catalogs mentioned previously can do that)
In the metadata layer, the metadata file links to one or many manifest lists. Manifests are used to track changes to the data over time, and to provide efficient mechanisms for reading and writing data. A manifest list contains a list of manifest files. The manifest files contain metadata about the data files in a table, including their locations and versions.
We then have the data files, which contains a subset of the data in a table. Data files are organised into directories based on their partitioning schema, and are tracked by manifests.
The addition of new information creates new data files, which is the same behaviour in any data lake. The interesting aspect, and that is one of the selling points in adopting Apache Iceberg as a table format, is how changes with updates and deletes of records are managed. As the information is immutable, there are no changes to the original data, thus when an item is modified (updated or deleted), a new data file is created and added to the record. There are two distinct mechanisms controlling changes namely copy-on-write and merge-on-read and the decision on which mechanism to adopt will depend on your workload (e.g write vs read intensive).
There are currently two versions (v1 and v2) of Apache Iceberg tables. There are many key differences between them and the recommended advice is to use v2 tables due to the many benefits compared with v1. Table 1 summarises the key differences between v1 and v2 Apache Iceberg tables:
Table 1. Comparison table between v1 and v2 Apache Iceberg tables
Partitioning by one or more columns
More flexible partitioning options using expressions to derive partition values
Metadata stored in table’s metadata directory
Metadata stored in separate metadata files, reducing metadata I/O overhead
Lower write performance due to no support for concurrent writes
Better write performance due to support for concurrent writes
No transactional updates
Transactional updates ensure ACID properties when making updates to a table
As you can see, v2 tables in Apache Iceberg provide better performance, flexibility, and support for transactional updates compared to v1 tables. The automatic schema evolution and improved partitioning options in v2 tables make it easier to manage large datasets and evolve schemas without manual intervention.
As Apache Iceberg is an open source project, many features developed by the community or commercially might not be available or be fully compatible with the services you are currently using, forcing you to increase the complexity of your workload and potentially increasing the technical debt of your infrastructure. For example, we can’t trigger a stored procedure to rollback transactions in Iceberg tables because this feature is not available in Amazon Athena. If you require this functionality, you may need to bring other services to accomplish the rollback task.
There is always a risk of adopting a rapidly evolving technology as new features more often than not introduce breaking changes, so your time may spend a considerable amount of time dealing with version compatibility and configuration issues. Those issues do tend to reduce as the product matures.
Apache Iceberg provides great functionality out of the box to speed up query execution from your data lake, but you still need to perform data management, including the usage of appropriate file formats, partition of your data effectively and use appropriate compression.
The source data for this blog comes from the TLC Trip Record dataset. This dataset records the yellow and green taxi rides in New York City. Three months of data (September, October and November 2022) are used to demonstrate the Iceberg table structure and the time travel capabilities. The raw data files are stored by month and are in a parquet file.
Creating a new Apache Iceberg table in Athena
The snippet below shows the code used to create the Iceberg table in Athena. One of the draw cards for using Apache Iceberg is the SQL syntax, so if you are comfortable with writing SQL code, using Apache Iceberg will be straightforward. The create table query contains the fields of the table, the location where to store this table in S3, the partition and table properties. There are many more table properties than what is shown here. The only required value in the table property is: `table_type=iceberg`so Athena knows this is an Iceberg table and can create the required metadata. After table creation, a JSON file containing the metadata of the Iceberg table is created on the S3 location.
There are many AWS native and third-party solutions to load data into an Apache Iceberg table. The most appropriate solution will depend on your workload (e.g. streaming vs batch processing). The complexity may also vary depending on how much transformation needs to happen to your data as Apache Iceberg works better with data in a parquet format.
AWS EMR and AWS Glue Jobs are two common AWS solutions to extract the raw data, transform it and load into Iceberg tables. These approaches allow you to stream or batch more complex workloads, and it might be the preferred option for your workload. For simplicity purposes and to demonstrate some of the capabilities of Apache Iceberg, this blog focuses on a simpler yet valid approach. In a follow up blog, with a more complex workload with a Glue Job loading data into an Iceberg table.
As the approach for this blog is to use low-code tooling we use batch processing with Glue Crawler for table discovery and catalog. The Glue Crawler allows us to quickly discover data, infer the schema and catalog new data stored in S3 with very few lines of code or no code if you create a Glue Crawler via the AWS Console. All of the tables created are managed by the Glue Catalog – another AWS Managed service which can catalog Iceberg and non-Iceberg tables and interact with many other AWS services.
To load the Iceberg tables, we use the following commands to load the raw data tables into the Iceberg tables on the Athena console:
The above commands copies all of the contents from the raw data into the Iceberg tables. Athena will look at the Iceberg table configuration specification and copy the data accordingly (i.e. partition the data per day and store it as one file per day). Figure 2 shows the metadata files after creating the table and running the `INSERT INTO` command. As mentioned previously during the table creation, a JSON file containing metadata is created and a new JSON file and two AVRO files are created every time new data comes in (ADD, DELETE, UPDATE action).
Let’s now query the Iceberg table using the $history parameter which brings all of the valid snapshots of the table.
Figure 3 shows the valid snapshots of this table, meaning we can query the data using the timestamp on the `made_current_at` field and it will allow us to visualise the data at that particular point in time. In case of any issues with the latest updates, we can also rollback to any of the previous snapshots. This is a neat feature!
In Figure 4, a quick count of all rows illustrates this point. The top of Fig 4 shows the number of green taxi trips done (69,031) where only the data of September was available, whereas the bottom represents the number of trips of the three months (200,666).
You can create SQL statements to select and join as many tables as you would like, in the same way you do in any transactional database and add the timestamp clause at the end so you can compare your data at different points in time without creating extra temporary tables or views.
In this blog, we demonstrated how Apache Iceberg can be part of your modern data architecture by bringing a range of capabilities to your data lake, including schema evolution, time travel, and incremental processing. By adopting Apache Iceberg, your data lake has a range of features usually only found in a data warehouse. Some of these features include transactional writes, giving the ability to make concurrent modifications without losing data integrity. The Expressive SQL commands supported by Apache Iceberg should also make implementation and maintenance easier if your team already supports any transaction database as syntax and logic are similar.
As Apache Iceberg is an open table format with a growing number of community users and developers, current limitations of the application may be addressed sooner and new features released faster.
As the practice of data governance, a subset of data management, becomes more mature, the ability to update and delete records cannot be understated. As we collect a continuous and significant amount of data, we need to create mechanisms to safely remove transactions, especially when we can no longer legally store it in our systems. In addition, the concept of `right to be forgotten` which gives individuals the ability to request that a company deletes any personal records that belong to that person is already law in some jurisdictions. In virtue of this, we need to develop data lakes that are capable of performing the required actions without compromising data integrity, business intelligence and reporting. As we have demonstrated in this blog, Apache Iceberg can perform those tasks and it should be considered in your data architecture as it will contribute for your company to achieve good data governance practices.