We started with our event ingestion queue. Lars made suggestions to us at each step of the process, and his suggestions were generally spot-on. This was done in an iterative fashion, tracking changes to the WLM config in a spreadsheet which Intermix provided. They also suggested that we take one of those categories at a time and try to reduce queue times for that queue to zero, while keeping an eye on disk-based queries (another metric that Intermix makes it easy to track) to make sure we weren’t starving queries for memory. Lars and Paul suggested that we aim to have just three custom WLM queues: one for event ingestion, one for ETL processes, and one for adhoc and dashboard queries. The folks at Intermix believed that we had taken this principle overboard, and encouraged us to reduce the number of queues we used. This setup stemmed from the principle that it’s best to separate out different workloads into different queues. We also had a dedicated queue for fast readonly queries used to check on ETL task completion, separated out from the main queue used for executing actual ETL tasks. This included two different queues for adhoc queries (the default one, with a 15-minute timeout, and a special queue with no timeout that could be used for legimitately time-consuming queries), and two different queues for queries from our dashboard system (a high-concurrency, low-memory queue for most queries, and a low-concurrency, high-memory queue for queries that timed out of the first queue). Our WLM config: beforeīefore the intermix trial, we had 7 custom WLM queues. Total query times (queue + exec) at peak hours increased from an average of 10-15 seconds before the downsize, to 35-50 seconds after the downsize. After the node reduction, we saw queue wait times of 20-35 seconds on average during those peak periods. Before the node reduction, queue wait times during peak periods averaged around 5 seconds. Intermix’s UI made it easy to see that queue times had increased markedly. We scaled Redshift from 16 to 12 nodes over the weekend, and watched performance the next few days. In consultation with Lars and Paul at Intermix, we decided to first scale the cluster down, watch how performance was impacted, and then try to tune the WLM config. Intermix’s co-founders, Lars Kamp and Paul Lappas, said they could help us us tune our WLM configuration to eliminate most queue wait times. Also, one of Intermix’s key features is their “Throughput Analysis” which shows a timeseries graph of queue and execution times for each Workload Management (WLM) queue. The Intermix UI would give us easy visibility into our cluster’s performance before and after the downscaling. We decided that the Intermix trial would be a perfect time to try scaling down the cluster. We’d since solved the disk space issue by removing one of our largest tables, and we’ve been working continuously on performance, so it was on our list of Q1 goals to scale the cluster back down to 12 nodes and we were hopeful that we could do so while maintaining an acceptable level of performance. Some months before, we had scaled our Redshift cluster from 12 to 16 nodes (ds.xlarge) as a quick-and-dirty way of improving performance, as well as to give us more headroom on disk space (every year at back-to-school time, our traffic – and thus our data volume – increases markedly). We decided to do a two-week trial with them in January 2018. Late last year we came across Intermix, a company that specializes in helping folks like us optimize their Redshift performance. True, recent efforts have shifted some of our data processing onto other tools like Amazon Athena and EMR/Spark, but Redshift remains critical to data at Remind, and the performance of our Redshift cluster is therefore one of the primary concerns of the DataEng team. Team members across the company, in a variety of roles, query Redshift. Event data is continuously ingested into Redshift. Here at Remind, Amazon Redshift is a centerpiece of our data infrastructure.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |