

Another thing you can do, is using the dataframe RDD API, you can print out the number of records per partition to confirm that there is skew. On the right side, they look much more evenly distributed. If you look at your RDD partitions sizes in the logs, you’ll notice that on the left side, the same RDD, some partitions only have 16 bytes of data, while others have hundreds of megabytes. If you think you have skew, it’s really easy to confirm as well. The second screen capture is looking at system metrics, and notice that some of the executors are taking way more memory and CPU than others for the same stages in the job. This is a great way to identify skew, is some stages completing quickly, while other stages take way more time. Notice that the vast majority of stages completed quickly, while the remaining stages took significantly longer, in some cases, six times longer. The first example is us looking at Spark UI. One extreme example is if you want to do a filter on some data, and one partition had two records, while the other had a million, the stage with a million is going to take significantly longer than the stage with only two. One thing to note, is that your applications will always initially have skew issues, especially if you data ingestion has skew, then the rest of the application will as well. Skew just means an uneven distribution of data across your partitions, which results in your work also being distributed unevenly. Skew is one of the easiest places you can see some room for performance improvements.

We used Spark UI, Sysdig, and Kubernetes metrics. What you wanna do, is anytime you’re changing one, see how it affects the others, and use the monitoring you have available to you to verify all of your changes. If you give additional CPU, you’ll increase your parallelism, but sometimes you’ll see scheduling issues and additional shuffles. For example, if you increase the amount of memory per executor, you will see increased garbage collection times. One thing that you wanna make a note of, is that when you tweak one, you are going to affect others. So these are all different configurations you can tune and tweak.

In tuning, we dealt with many different facets, including CPU, parallelization, garbage collection, and so on. One thing to note, is we developed all of our applications using Scala, so all of the examples will be in Scala. Then we’re going to talk about some of the other tricks as well. Some of the tricks we did were, we moved all of the processing to in memory, which greatly improved our runtimes.
#Finetune file system performance full#
With that challenge, we did months of research, and got the same application running in around 35 minutes, for a full year of data. We had a challenge to get our runtimes down to a acceptable level and also have our stable runs. Quite often, we’d see out of memory, or other performance issues. Initially, our job was taking over four hours to run, for nine months of data, and that was if it ran to completion. Those checks are on completeness, data quality, and some others. It’s running over two billion checks across 300 different data sets. So a little bit of background, all this work was done for a data validation tool for ETL. We’re gonna touch on a wide range of topics for our presentation, that we tuned to get our application running as quickly as possible. We all come from IBM, and are data engineers.

– Our presentation is on fine tuning and enhancing performance of our Spark jobs.
