Apache Kafka Tiered Storage
Apache Kafka is a distributed, event-streaming platform optimized to handle gigabytes of data per second. Due to the incremental nature of events and their typically high production rate, storing data for long periods of time can be challenging.
Currently, Apache Kafka only supports local storage, which could mean limited storage for on-premises clusters or high costs on public clouds. Huge amounts of local storage also make the Apache Kafka cluster maintenance more complicated and risky. For example, operations such as recovering from a broker failure or adding or demoting a broker requires considerable time for copying the data to a new broker.
Considering that in most of the scenarios messages are consumed within seconds after being published, and only in exceptional cases older messages are requested, using the local storage for unlimited message retention is not viable as local storage is an expensive resource.
An alternative and more cost-effective solution is to set up a new application to ingest data from Kafka and store it in cheaper data stores such as AWS S3. However, the ingestion application adds one more point of failure and complexity because the client applications will consume data from different places and sources, which have different interfaces to access the data.
For a couple of years, the Kafka community has discussed under the Kafka Improvement Proposal 405 a new Kafka storage layer that is able to offload data from the local storage transparently to remote storage. It has been available since 2020 in beta status by Confluent.
Below is a simplified architecture overview of the tiered storage where the colored items and arrows are introduced as part of the new feature to Apache Kafka 3.0.
The RemoteLogManager (RLM) is the internal component, which holds the logic of transferring data from the local storage to the remote storage using the Remote Storage Manager public interface.
Tiered storage overview
A LogSegment is an internal broker concept, which defines how a partition is actually split into multiple files and written to disk. The partition contains an active segment and rolled over segments. New records are appended to the active segment. Rolled over segments are already complete (greater detail here). Once a segment is rolled over, the segment is eligible for transfer by the RLM.
The RemoteLogMetadataManager component maps and caches which data is available locally and which data is available remotely. The RemoteLogMetadataManager also provides the information necessary to retrieve data from the remote storage.
When the application requests data that is only available on the remote storage, the broker will copy the remote data locally and transparently provide it to the application.
Enabling the tiered storage can be done system wide, and it can also be overwritten per topic. New properties for local log retention (in millisecond and bytes) were added in order to determine how long the data will remain on the local storage; similar to the existing local log retention properties. For maximum reliability, local logs are not cleaned up until those segments are copied successfully to the remote storage, even though their retention time/size is reached. The image below exemplifies the data copy and retention between the two different storages.
Tiered storage in practice
The tiered storage will impact multiple areas from Kafka maintenance to how applications are designed. The following section will focus on the impact on application design.
Apache Kafka as a primary data store
Apache Kafka introduced a disruptive architecture by challenging the premise of IO/disk writes being slow and relying on sequential writes and OS page cache. This architecture introduced the append-only log structure as a central piece, and removed several limitations present on traditional message systems.
Apache Kafka 0.8 introduced the concept of data replication back in 2013, whereby the writes are replicated over the network, providing resilience to the messages still on the OS page cache in case of node failure before flushing to disk. Here again, the append-only log structure simplifies the replication process compared to in-memory mutable message queues.
Combining all these characteristics, the durability provided by append-only storage, resilience and scalability provided by the network replication, Apache Kafka can be considered a strong candidate to be the primary data store in a system architecture as previously described by Jay Kreps.
However, there are several concerns to be raised here. Not only technical, but also related to the way we design and understand systems. Let me take a step back to explain it.
Storage has always been a limitation from databases to streaming applications. These limitations have driven databases and applications designs for more than 30 years in the direction of highly normalized and storage efficient data structures. In this scenario, databases have only the current state, a reduced write ahead transaction log and specific features such as database indices.
These optimized structures drove how we software developers perceive problems, and how we created the abstractions to represent the real world in software. From universities, where relational model lessons are taught prior to or at the same time as other programming lessons, to our day to day work, where we many times start with the relational models. It is a concretization of the old saying “We shape our tools and thereafter our tools shape us”.
One alternative approach that is increasing in popularity - driven especially by event-driven architectures and event sourcing - is to use domain events to represent the business as a series of immutable facts as they happen in the real world. For instance, imagine representing a driver position as a continuous unbounded series of immutable driver locations that happened during the day, instead of focusing on the state where the driver is now. Another common example is to represent a chess match as a list of movements performed from the beginning of the game until the end.
This approach is especially useful in distributed and complex systems, in which, for instance, an order checkout has different representations and perspectives when seen from a driver willing to deliver the package containing the order than from a back office agent replying to an angry customer.
The ability of listening to domain events as immutable facts that have happened in the real world allows each component to build their own perspective autonomously. This approach is incredibly scalable and flexible as it tends to result in more decoupled solutions when compared to traditional and canonical modes. This is exactly what tools such as Apache Kafka are for; and limitations such as the local storage are reduced with the recently introduced Kafka feature of remote tiered storage.
It is understandable that there is a big gap in technological solutions when it comes to storing immutable events efficiently and fully featured; especially considering that the idea is relatively new, compared to the maturity of databases and traditional applications.
Apache Kafka tiered storage is a big step in the direction of long retention of events, allowing it to be used as a primary data storage. Once Apache Kafka 3.0 is released, more cloud providers will probably support tiered storage out of the box, and it will also be available for managed on-premises services as most of the on premise services have a S3 compatible storage.