Skip to main content

Compression

Many topics carry messages with repeated structures. In order to save on disk space and bandwidth, it is requested that producers use compression where appropriate.

Compression is enabled by setting the compression.type configuration of your producer to the desired compression type (see table below). The snappy compression type typically provides a decent balance between CPU usage, speed and compression ratio. Compression type becomes part of your message’s headers, which keeps your consumers from having to know beforehand which compression type (if any) is in use.

Compression TypeCompression RatioCPU UsageCompression SpeedNetwork bandwidth usage
GzipHighestHighestSlowestLowest
SnappyMediumModerateModerateMedium
Lz4LowLowestFastestHighest
ZstdMediumModerateModerateMedium

How do I use compression effectively?

Compression is best used for topics with repeated structures in its data. When compressing data, you will get better results if you simultaneously use batching by setting the linger.ms property. This property tells the producer to wait for at most this many milliseconds before sending a message, batching any subsequent messages along with it. Since compression works better with repeating patterns, compressing several messages simultaneously helps improve your compression ratio.

When not to use compression?

Compression comes at a cost in message latency and CPU usage. The CPU cost incurred is typically minimal, and Entur’s Kafka clusters consistently have low CPU utilization. CPU usage does increase for producers and consumers as well, although the amount should not be noticeable in most cases. The main drawback thus becomes message latency. If your application is reliant on very low latencies, compression may not be right for your use case. Compression also relies on repeated structures being present in your data. If your data is binary or otherwise very high in entropy, compression may not be appropriate.

An exact measure of the latency and CPU costs incurred will depend on such factors as the size and structure of your messages, as well as whether you use batching.