Sunday 17 September 2017

Adapting Hive partitions to different data sizes

Summary

Hive's table partitioning allows the user to have better querying performance as it avoids costly scans over data that is not relevant for the user. This partitioning is based on the data structure to be found in HDFS, as partitions match with directories.

One quite common usage for partitions is time series, which can be modeled like this:

Our TimeSeries entity

Note
: The time-stamp has been decomposed in several columns that will form the partitions. Month and day are TinyInt in order to save space (we might end up having millions of records).

This can be easily partitioned by using this folder structure in HDFS, if we take a look we will see these directories (data shown for our table being partitioned by year and month):


As per the Hive documentation, the most optimal partition size is 2 Gb. What does it happen if you have defined a partitioning system that ends up in production with too big or too small partitions? Here some examples on how can you adjust your partitions using Spark.

Where can I test? Get a Cloudera Quickstart!

To begin with, you will need a Hadoop distribution running somewhere to test the code shown in this post. If you don´t have a proper cluster at hand, you can always create a test cluster by using the QuickStart Cloudera Docker Image. It is quite handy as you have just to pull it and fire it up:


Of course, it will not be very powerful or fast, but it will allow you to perform some tests using the different components in the stack (HDFS, Spark, Hive, etc.).

Create a test table and fill it with some data

Let's start by creating the database and table in Hive:


Now, we will generate some dummy time-series to have some data to play with:


How does this data look partition-wise? Let's ask Hive about it:


OK, we are dealing with a really small amount of data in this example, but I am sure you get the picture. All the data is in the same folder, and you need to scan the table completely (the whole year of data) each time you want to get the data of a particular month.

Increase the partition granularity

With the previous setup, all 2017 data is stored in the same folder. If the user wants just to load data from February, the whole table must be scanned. That's fine as long as the data remains small but, with larger datasets, the performance will suffer heavily.

The solution? Add an additional partition level (month). The problem, apart from the modification in the table DDL, is that we have to reorganize the Parquet files in sub-folders in HDFS.


Note: The "msck repair" command makes really easy to add all new missing partitions. Otherwise you will have to add them manually, one by one, using "alter table add partition..."

The resulting in HDFS files and directories are these:


Reduce the partition granularity

Of course, if the partition size is too small, we end up spending as much time going recursively through partitions as scanning the whole table. In those cases, is much better to reduce the number of partitions and increase their size.

In this case, we can use Spark's coalesce method, as the final number of partitions is smaller than the original. In our example we will reduce the partitioning from (year,month) to (year) only.


In all the examples shown, we are not compressing the data, which is a handy technique in order to save space and reduce I/O times. However, I hope you could get a broad picture on how can you get the right partition you need for getting the job done.

Resources

No comments:

Post a Comment