Using Kubernetes Data to Enhance Incomplete HTTP OpenTelemetry Traces

Using OpenTelemetry has become a popular method for tracking, collecting, and analyzing telemetry data for applications that are built using microservice architecture. It helps us understand how software performs and behaves. One of the key aspects we focus on during our product development at Oxeye is distributed tracing.

Tracing involves following the journey of tracking a process, like an API request or system activity, from start to finish. It shows how different services are connected. When we trace, we gather important information called span data. This includes things like unique IDs, operation names, timestamps, logs, events, and indexes. Span data helps us to build a trace which gives us valuable insights into the behavior of our environment.

On certain occasions, traces that have been instrumented and processed by OpenTelemetry may not be complete. There are various reasons for this, such as:

Network issues
‍The trace data may not be fully transmitted or received due to network interruptions or packet loss.
Resource limitations
‍In resource-constrained environments, such as high-load systems, the collection and processing of traces may be restricted, leading to partial or fragmented traces.
Instrumentation gaps
‍If certain components or services in the system are not properly instrumented with OpenTelemetry, the trace may be incomplete or missing information.
Error conditions
‍Errors or exceptions occurring during the trace collection process can disrupt the continuity of the trace, resulting in a broken or partial trace.
3rd party components
‍When some of the services in the chain are not built within the organization and cannot be modified in a way that allows us to install OpenTelemetry in them.

These factors can contribute to traces becoming "broken" or partial, where the entire sequence of activities or connections within a system is not fully captured.

In this blogpost, I will present an approach that leverages Kubernetes configuration data to address "broken" traces. We will explore how Kubernetes data can be utilized to complete partial traces and transform them into full traces.

‍

Brief Overview of Solution Steps:

Collect spans
‍gathering all spans into a trace once a specific timer has elapsed, usually done by a component called OtelCollector.
Detect Broken Trace
Identify traces that are incomplete or broken within the application, indicating where data is missing as we have only one side of a connection.
Classify the Broken Trace type
‍Determine whether the broken trace pertains to the server span or the client span. This distinction helps identify which part of the application requires attention for data completion.
Retrieve Data from Kubernetes
‍Utilize Kubernetes data sources to gather the necessary information that is missing from the broken trace.
Complete the Trace
‍Incorporate the retrieved missing data into the broken trace to restore its integrity and ensure a comprehensive trace from the initial request to the final response.

By following these steps, we can effectively detect broken traces, and ultimately complete the trace, providing a more accurate understanding of the application's behavior.

While analyzing the broken traces, we have identified two main use-cases:

Missing client span -We only have the HTTP receive span
Missing server span - We collected only the HTTP send data

‍

It is important to distinguish between these two primary use cases because each case requires a distinct approach for filling in the missing information and completing a full trace.

‍

Whenever we encounter a trace with a missing client span, It means that we receive a trace that has a root span of server span.

When there is no instrumentation on the client-side, the trace starts at the server span because it is the first point where the distributed system receives an incoming request. The server span will capture the duration of the entire request processing and response generation.

In such cases, although the client-side actions are not instrumented as spans, the trace can still provide valuable information about the end-to-end flow and performance of the request by focusing on the server span as the root of the trace.

To handle this use-case and reconstruct the trace, we can utilize other available spans and their attributes, such as analyzing the server span's tags and leveraging Kubernetes data, to gather relevant information and piece the trace together.

In such situations, we can extract valuable information from certain details within the span.

‍

Examining the "net.peer.ip" attribute within the span's tags can provide us with the IP address from which the request originated, representing the sender’s IP. This information is valuable for trace analysis and understanding. By utilizing the IP address obtained from the server's span tags, we can search for a matching IP within the existing Kubernetes workloads’ IPs. This allows us to identify the specific Kubernetes service associated with the client and complete the trace by connecting the missing client span to the corresponding service span.

By combining the span's tags and Kubernetes data, we can successfully rebuild the trace.

In the second scenario, we have a trace that is missing a server span. It concludes with a client's request span, but without the corresponding server span.

In this scenario, we can utilize the data within the client's request span, specifically the attribute "http.url" which contains the URL of the HTTP request. By leveraging Kubernetes data once again, we can search for Kubernetes services whose URL hostname matches the value extracted from the span's "http.url" attribute.

By making this association between the HTTP request's URL and the matching Kubernetes service, we can gain insights into the missing server span and reconstruct the trace, enabling a comprehensive view of the entire transaction within the distributed system.

Despite encountering broken traces, we can still extract valuable insights into the behavior of Kubernetes-based applications. By piecing together fragmented traces, we gain a comprehensive understanding and knowledge within the system. This enables us to uncover valuable information about application behavior and make informed decisions for optimization and improvement.