visit
Note: This article assumes some basic familiarity with the big data landscape and associated technologies such as Hadoop, YARN, Spark, Presto etc.
The internet is filled with articles about usage of data processing and analytics frameworks. But, there is a shortage of commentary about the auxiliary systems that augment the power of big data technologies. These systems are force multipliers for a better user experience in the big data ecosystem. They provide infrastructure cost attribution, data set discovery, data quality guarantees, audit user access and extract insights (aka gain observability) into your analytical applications which run on the big data stack (henceforth called big data applications).
This article covers some key ideas to gain observability into big data applications. We would define the scope of requirements, present some known challenges and discuss high level solutions.Layers in the full stack for a big data application — ()Data application observability is the ability to extract insights on the characteristics of big data applications. These insights lead to:
End User
End user is a person who runs queries on the data platform. They could be running an ad hoc query on data using technology such as Presto, Hive, Impala, Spark SQL etc. or could be building a Business Intelligence (BI) dashboard. The job family for this user could be data scientist, software engineer or analyst. This category of user requires metrics & data (henceforth referred as observability statistics) for troubleshooting failures or optimizing performance of their jobs. Some examples of questions that this user persona asks are:User persona: “End user” writing hive queries to extract insights from data — ()
What is the execution status of my query?
How can I make my query run faster?
What is the most ideal time of the day to run my query based on the traffic in my team’s YARN queue?
How much does running this query cost my organization?
Is the behavior of this query different from the past? Why?
What is the internal execution plan for this query?
How can I debug and fix my failed query?
Is the allocated capacity to this query sufficient enough to satisfy the query SLA?
Where can I find the results of a query that I ran in the past?
Local Resource Administrator
Local Resource Administrator is the person who is the administrator of infrastructure resources at org/team/project level. In a large organization, it is important to perform cost attribution to various projects/teams so that the return of investment (ROI) could be analyzed. Local resource administrators ensure that the ROI is favorable for their org/team/project.
This user persona requires observability statistics for planning infrastructure capacity and maintaining cost efficiency for their organization. They are interested in finding answers to following types of questions:
What is the utilization of the allocation YARN queue for my organization?
Who (user) is running the most expensive queries on a YARN queue?
How much is the total cost to my org for usage of the big data infrastructure?
Which app is creating a large number of small files in HDFS?
How much capacity should we reserve for this new project?
What is the right amount of capacity reservation to meet the SLA?
How many applications have failed in the past month because of resource limitations?
I am running out of space in HDFS. Do I have any unused data that I can delete?
What are the top 10 applications consuming the most CPU/memory in my YARN queue?
What are all the applications running in the YARN queue?
How many queries did this user run in the last 24 hours?
Which are the most accessed tables from this YARN queue?
What are the applications waiting for resource allocation in this YARN queue?
What are the queries running in this queue sorted by longest time?
Oncall operator
User persona: “Oncall operator” fixing high priority critical incidents - ()An oncall operator is responsible for responding to critical incidents which might impact the availability or performance or reliability of their service. Their objective is to mitigate the problem as early as possible (reduce MTTR) with minimal impact to the end users.To effectively troubleshoot and debug the incident, oncall requires visibility into the current state of the systems at a service level granularity, cluster level granularity and the ability to drill into a query level granularity. They are interested in anomaly metrics to detect the potential cause of inconsistency in the system. Some examples of questions that they ask are:
The system is running out of memory. What is the rogue query that is consuming the majority of memory?
This application didn’t make any progress in the last one hour. Where is it stuck?
What is the SLA miss rate per Hadoop cluster?
How many applications are failing per Hadoop cluster?
What are all the queries running in this cluster sorted by longest time?
What are all the queries running in this cluster?
Users rely on the observability insights to dive deeper into the characteristics of their applications and analyze the causes for failures. These insights help them understand the complete picture of what is happening with the workload from the top level viewpoint of a workload.
Users rely on the observability insights to dive deeper into the characteristics of their applications to find sub optimal configurations or performance bottlenecks. Optimizing these bottlenecks result in faster applications which in turn improves the cost efficiency of the underlying hardware resources.
Users of other auxiliary big data systems such as such as user access auditing or data lineage rely on the observability insights to determine the application lineage across the depth of the big data stack.As an example, to find the users which have accessed a particular Hive table, mapping is required between the HDFS NameNode API calls to the Spark Application ID.
Observability into the utilization of YARN queues and Hadoop clusters is essential to ensure a fair distribution of resources across users.It can prevent scenarios such as critical jobs suffering from resource starvation, resource hogging by a single user, runaway query consuming majority of resources etc.
Collection and publishing of metrics is not a standard feature in big data systems. The degree of support for collecting and emitting metrics varies across a spectrum. There are systems which collect the metrics and expose an API to expose them such as YARN, systems which provide hooks into the execution lifecycle of a query such as Hive & Spark and systems like HDFS that do not provide any significant monitoring feature out of the box. Depending on the set of systems used in your organization, getting end to end holistic observability into an application might require devising custom solutions for collecting and publishing the metrics. Writing these custom solutions requires an in-depth understanding of the otherwise opaque (it-just-works) open source system since an inefficient metric collection code can significantly hamper the execution performance of the system.
Processing the metrics collected from these heterogeneous systems becomes complex due to lack of a common vocabulary to define the metrics. Each system across the depth of the stack emits the metrics as loosely typed strings or integers. Extensive raw data processing and normalization is required to provide holistic cross stack visibility to the users.
The scale of the metrics growth is not linear to the growth in the number of jobs. Each job emits various kinds of metrics which can fan out the scale to an exponential growth trajectory. The observability system should be able to handle the scale with realtime latencies.
Depending on the open source platform, the metrics retention on the server varies from a few minutes to a few hours. If the metrics are not collected within this duration, they are lost forever. The observability system should be highly available and robust in its collection mechanism as an availability drop of a few minutes could lead to permanent data loss.
Traditional observability systems designed for web services do not satisfy the requirements of providing observability for big data applications. Cardinality of metrics, kinds of metrics and the type of signals other than metrics for a big data application varies as execution moves from one sub-system to another. Traditional observability systems support limited cardinality (distinct set of values) and limited max number of dimensions for a metric. In case of big data, as an example Spark alone has more than 350 configuration parameters which could be used as dimensions. An ideal big data application observability system would be able to correlate the metrics emitted across multiple stacks to provide a unified view of the execution model for the application.
Lastly, tracing the application across the stack layers is a challenge due to the lack of universal tracing amongst open source applications. Each layer in the stack creates its own unique ID and it is left to an external system to map the IDs with each other. As an example, a hive query might have an ID such as hive_123 while the YARN application ID would be application_234 and Livy batch ID might be batch_789. But all thsse IDs refer to the same hive query which uses Apache Livy & Spark behind the scenes for execution.
There are some recent efforts to integrate with Hadoop in the open source world but the universal tracing hasn’t covered the entire depth of the stack yet.The user facing component should allow users to consume information both visually (UI) and programmatically(APIs). An intuitive user interface ties together different metrics collected from across the stack to provide holistic observability into the application. A user of the interface might not necessarily know under the hood details of the application and thus, the user experience should cater to users of different technical acumen. As an example, a marketing executive running a Hive SQL query to get results for a campaign does know that the Hive query was executed on a YARN cluster using the Spark processing engine. They are only interested in checking whether their query has completed from the interface.
On the other end of the spectrum, the engineer from the Hive team would be interested in exploring the hive query execution plan or looking at the metrics of YARN containers when the query was being executed. Some open source systems ship with their own in-built UI such as or or . These interfaces are cater to the users who have in-depth knowledge of these systems.I do not recommend taking user experience inspiration from these UIs for observability systems which cater to the user personas discussed earlier in the article.Open source Spark web UI —
Collectors perform realtime collections of metrics across different layers of stack. They guarantee ordering of lifecycle events and at-least-once collection of events. We have discussed the challenges associated with collectors earlier in the article. Custom implementation is required to leverage hive hook or spark hook which could provide basic lifecycle metadata for an application, you could write custom periodic collectors to collect information from , devise custom agents to parse and publish the and writing custom agents to collect the extract CPU/Memory utilization for spark executors.
Although multiple software as a service (SaaS) providers provide functionality described here, in the open source domain, very limited support is available for these collectors which could be leveraged as example implementations such as for hive hooks, for basic cluster monitoring or for resource consumptions by spark executors.Milestone 1: Components of Data Application Observability system — (Image created by author)
Pipelines for metrics enrichment bridge the gap between the user interface and the metric collectors. This component ingests the metrics published by the collectors, processes the metrics and stores them so that the user interfaces could read them. This component should be able to handle the scale of metrics being ingested into the system and should have the ability to process each metric without impacting the metric propagation time to the end user.
The requirement for metric propagation time to end user aka metric freshness varies from use case to use case. For some use cases such as alerting when an application is consuming 90% of YARN queue resources, the metric freshness should be in the order of seconds while use cases such as performance optimization could be solved by a metric freshness in the order of minutes. The processing phase of the metrics involves cleaning the metrics, normalizing the metrics into a standard vocabulary and enriching the data by gathering auxiliary information from alternate sources. Some examples of processing & enrichment tasks are:A new “insight extraction” component is added to the metric processing pipelines which is triggered on receiving critical events such as lifecycle state change. This pipeline runs a set of rules to determine if a metric is anomalous or not. These rules could either be static or artificially generated using a Machine Learning model.
The full implementation for such a complex component is out of the scope for this article but to demonstrate a proof of concept, let us revisit the example we discussed earlier in the article (Use Cases section) where an application might run slower than usual or time out due to YARN queue overload. On receiving periodic events associated with queued time for an application, this new component would compare it against the average queue time for an application in the YARN queue which is derived based on historical information. If the current queue time is higher, this system could utilization to derive the top 5 longest running applications on the queue and top 5 applications consuming the most resources. Based on this data, the user could be entered into a workflow in the interface where they would be given options to either kill one of the top 5 apps in the queue or schedule their job to run at a different time suggested by the component based on historical queue usage pattern.
If you like reading about tech musings, follow me on Twitter .
Also published behind a paywall on: