Best Practices for Data Ingestion with Snowflake Part 2

Welcome to the second blog post in our series highlighting Snowflake’s data ingestion capabilities. In Part 1 we discussed usage and best practices of file-based data ingestion options with COPY and Snowpipe. In this post we will cover ingestion capabilities with Snowflake Connector for Kafka.

Figure 1. Data ingestion and transformation with Snowflake.

Ingesting Kafka topics into Snowflake tables

Enterprise data estates are growing exponentially and the frequency of data generation is rapidly increasing, resulting in the need for lower latency ingestion for faster analysis. 

Customers who need to stream data and use popular Kafka-based applications can use the Snowflake Kafka connector to ingest Kafka topics into Snowflake tables via a managed Snowpipe. The Kafka connector is a client that communicates with Snowflake servers, creates files in an internal stage, ingests those files using Snowpipe, and then deletes the files upon successful load. 

Data ingestion with the Kafka connector is an efficient and scalable serverless process on the Snowflake side, but you still need to manage your Kafka cluster, the connector installation, and various configurations for optimal performance, latency, and price.  

Ingest via files or Kafka

Files are a common denominator across processes that produce data—whether they’re on-premises or in the cloud. Most ingestion happens in batches, where a file forms a physical and sometimes logical batch. Today, file-based ingestion utilizing COPY or auto-ingest Snowpipe is the primary source for data that is ingested into Snowflake. 

Kafka (or its cloud-specific equivalents) provides an additional data collection and distribution infrastructure to write and read streams of records. If event records need to be distributed to multiple sinks—mostly as streams—then such an arrangement makes sense. Stream processing (in contrast to batch processing) typically allows for lower data volumes at more frequent intervals for near real-time latency.

Although Snowflake currently supports Kafka as a source of data, there is no additional benefit for using Kafka to load data to Snowflake. This is specifically true for the current Kafka connector implementation, which uses Snowpipe’s REST API behind the scenes for buffered record-to-file ingestion. If Kafka is already part of your architecture, Snowflake provides support for it; if Kafka is not a part of your architecture, there is no need to create additional complexity to add it. For most simple scenarios, ingesting files using COPY or Snowpipe provides an easier and less expensive mechanism for moving data to and from Snowflake. 

Recommended file size and cost considerations

In the case of the Snowflake Connector for Kafka, the same file size consideration mentioned in our first ingestion best practices post still applies due to its use of Snowpipe for data ingestion. However, there may be a trade-off between the desired maximum latency and larger file size for cost optimization. The right file size for your application may not fit the above guidance, and that is acceptable as long as the cost implications are measured and considered. 

In addition, the amount of memory available in a Kafka Connect cluster node may limit the buffer size, and therefore the file size. In that case, it is still a good idea to configure the timer value (buffer.flush.time)to ensure that files smaller than the buffer size are less likely. 

5 best practices for ingesting with Snowflake Connector for Kafka

1. The Kafka connector creates files based on configuration properties, which a customer can control on their end. Upon hitting any of the buffer limit properties, the file will be flushed and sent for ingestion through Snowpipe, and subsequent offsets will be buffered in memory. Tuning these configurations gives you the most control over your data ingestion price and performance. Just be aware of your Kafka cluster’s memory settings when changing the default buffer values.
Our current defaults are:

  • Buffer.count.records = 10000
  • Buffer.flush.time = 120 seconds
  • Buffer.flush.size = 5 MB

2. The Kafka connector needs to create a file per partition per topic, so the number of files is a multiple of the total number of partitions from which the connector is loading data. This is an architectural aspect of your Kafka configuration that you may change later with Snowpipe Streaming. However, if Snowflake is your only sink for Kafka topics, we urge you to reconsider the value of having a lot of partitions if there is not a lot of data per minute in each partition. Buffer.count.records and Buffer.flush.size are configured per partition, thus affecting the file size and number of files per minute.

3. Two elements—Buffer.flush.time and Buffer.flush.size—decide the total number of files per minute that you are sending to Snowflake via the Kafka connector. So tuning these parameters is very beneficial in terms of performance. Here’s a look at two examples:

  • If you set buffer.flush.time to 240 seconds instead of 120 seconds without changing anything else, it will reduce the base files/minute rate by a factor of 2 (reaching buffer size earlier than time will affect these calculations).
  • If you increase the Buffer.flush.size to 100 MB without changing anything else, the base files/minute rate will be reduced by a factor of 20 (reaching the max buffer size earlier than the max buffer time will affect these calculations).

4. You can leverage Java Management Extensions (JMX) to monitor the Snowflake Connector for Kafka and Snowflake’s resource monitors to optimize your Kafka connector configuration. Just note that of the below three items, you can control two; the third will be determined by the output of the two you choose:

  • Latency (buffer.flush.time): Low latency typically means smaller files and higher costs.
  • Total number of partitions (topics * avg partitions/topic): This may depend on other sinks and existing Kafka topic configuration, but typically more partitions would result in multiple small files. Minimize your partitions per topic unless you currently have a large message flow rate (in which case, it will be cost-efficient) to justify more partitions.
  • Cost of ingestion: Larger files lower the cost by reducing the total number of files/TB (Kafka connector uses Snowpipe with cost = file charge + warehouse time). Snowpipe guidance for cost efficiency is to go for files of 10 MB or more. Costs decline even further with 100 MB files and then don’t change as much once above 100 MB.

5. Using the Avro format for Kafka messages gives you the flexibility to leverage schema registry and take advantage of native Snowflake functionality, such as future support for schematization.

To conclude, Kafka is a great tool to use if it is already part of your architecture for high-volume, distributed message processing and streaming. Since Snowflake’s Connector for Kafka supports self-hosted Kafka or managed Kafka on AWS MSK and Confluent, Snowflake is a great platform for many Kafka streaming use cases. 

We continue to make improvements to our Kafka support to enhance manageability, latency, and cost; and those benefits can be realized today with Snowflake Connector for Kafka’s support for Snowpipe Streaming (currently in private preview). Stay tuned for part three of our blog series, which will go over Snowpipe Streaming in depth.

Source