Using OpenTelemetry has become a popular method for tracking, collecting, and analyzing telemetry data for applications that are built using microservice architecture. It helps us understand how software performs and behaves. One of the key aspects we focus on during our product development at Oxeye is distributed tracing.
Tracing involves following the journey of tracking a process, like an API request or system activity, from start to finish. It shows how different services are connected. When we trace, we gather important information called span data. This includes things like unique IDs, operation names, timestamps, logs, events, and indexes. Span data helps us to build a trace which gives us valuable insights into the behavior of our environment.
On certain occasions, traces that have been instrumented and processed by OpenTelemetry may not be complete. There are various reasons for this, such as:
These factors can contribute to traces becoming "broken" or partial, where the entire sequence of activities or connections within a system is not fully captured.
In this blogpost, I will present an approach that leverages Kubernetes configuration data to address "broken" traces. We will explore how Kubernetes data can be utilized to complete partial traces and transform them into full traces.
By following these steps, we can effectively detect broken traces, and ultimately complete the trace, providing a more accurate understanding of the application's behavior.
While analyzing the broken traces, we have identified two main use-cases:
It is important to distinguish between these two primary use cases because each case requires a distinct approach for filling in the missing information and completing a full trace.
Whenever we encounter a trace with a missing client span, It means that we receive a trace that has a root span of server span.
When there is no instrumentation on the client-side, the trace starts at the server span because it is the first point where the distributed system receives an incoming request. The server span will capture the duration of the entire request processing and response generation.
In such cases, although the client-side actions are not instrumented as spans, the trace can still provide valuable information about the end-to-end flow and performance of the request by focusing on the server span as the root of the trace.
To handle this use-case and reconstruct the trace, we can utilize other available spans and their attributes, such as analyzing the server span's tags and leveraging Kubernetes data, to gather relevant information and piece the trace together.
In such situations, we can extract valuable information from certain details within the span.
Examining the "net.peer.ip" attribute within the span's tags can provide us with the IP address from which the request originated, representing the sender’s IP. This information is valuable for trace analysis and understanding. By utilizing the IP address obtained from the server's span tags, we can search for a matching IP within the existing Kubernetes workloads’ IPs. This allows us to identify the specific Kubernetes service associated with the client and complete the trace by connecting the missing client span to the corresponding service span.
By combining the span's tags and Kubernetes data, we can successfully rebuild the trace.
In the second scenario, we have a trace that is missing a server span. It concludes with a client's request span, but without the corresponding server span.
In this scenario, we can utilize the data within the client's request span, specifically the attribute "http.url" which contains the URL of the HTTP request. By leveraging Kubernetes data once again, we can search for Kubernetes services whose URL hostname matches the value extracted from the span's "http.url" attribute.
By making this association between the HTTP request's URL and the matching Kubernetes service, we can gain insights into the missing server span and reconstruct the trace, enabling a comprehensive view of the entire transaction within the distributed system.
Despite encountering broken traces, we can still extract valuable insights into the behavior of Kubernetes-based applications. By piecing together fragmented traces, we gain a comprehensive understanding and knowledge within the system. This enables us to uncover valuable information about application behavior and make informed decisions for optimization and improvement.