Judging Telemetry Quality using Weaver
"Is my telemetry good?" In this post I'll show you how to answer that question using the Weaver tool from the OpenTelemetry project.
Watch
If you'd prefer to watch rather than read, I have the following content as a YouTube video:
Overview
I regularly get asked a variation of: "is my telemetry any good?" and "what can we do to improve our telemetry data to better assist us during incidents?"
Assuming the asked actually has the correct telemetry to assist them, then what these questions usually boil down to, is the fact that their telemetry does not carry the correct metadata.
When telemetry signals (metrics, logs, events, traces, profiles etc.) are used during an incident, you need to quickly identify the issue, who is affected, who those entities belong to and thus you can quickly notify and begin the remediation process.
So What is "the right metadata"?
Easy to say, harder to achieve, right? Defining standards across an organisation is hard - doing so across multiple organisations is even harder.
Well, a great place to start are semantic conventions. These are agreed-upon standards that define what metadata is included on the signals. Semantic conventions usually use the words MUST
, SHOULD
and COULD
to indicate how important each piece of metadata is and thus how important
Let's make this more concrete: The OpenTelemetry semantic conventions defines attributes for all kinds of things - from how to provide a host name (Recommended
with host.name
) to how to indicate the type of database(eg. redis
) (Required
with db.system.name
).
The OpenTelemetry semantic conventions even tell us how to report certain metrics. For example, if you and I both wanted to report CPU usage, we MUST use http.server.request.duration and report a histogram.
At first glance this seems like uneccessary control and overhead, but this kind of standardisation pays dividends at scale because the backend system (wherever we send telemetry data) can now understand our signals as like-for-like and this know that it can treat things equivalently.
This standardisation is also the key to answering our question: How good is our telemetry?
Roll Your Own Semantic Convention?
Your first instinct might be to develop your own in-house convention. However I would caution against that, for a few reasons:
- It is not usually required - a lot of thought (and expertise from across the industry) has gone into the OpenTelemetry SemConv.
- Third party suppliers, SaaS vendors you rely upon, and others won't be following your in-house semantic conventions. Meaning you're creating downstrea work to "map" one metric to another.
- Following the standard can be a sales tool. If you supply metrics for another entity to consume, you're creating work for them as they need to map your custom "stuff" to their "stuff". If I was evaluating vendors and was given the choice between two equal products - one which followed a widely adopted standard and one which "rolled their own" - I'd pick the standard-following one every time.
Defining an in-house standard can be appropriate - if you're doing something so incredibly specialised that no other company is doing it. Let's be honest though, you're probably not (and even if you were, you should submit it to the OpenTelemetry spec, make a lot of noise about it and use it as a marketing / PR activity that you're a thought leader).
All of that said, the Weaver tool we'll use below can work with your custom telemetry spec, if you still choose to go down this path.
Semantic Conventions Aid Quality Judgments
Now that there's a way of defining things against a standard (OpenTelemetry, an in-house one or both) the question can now legitimately be asked: Is my telemetry any good?
Tip
All you need to do is find out "how far away" from fully meeting the specifications you are.
- If the telemetry is a 100% match, then yes, your telemetry is perfect
- If you are only barely matching the agreed-upon rules, your telemetry has room for improvement
Weaver
As you've probably guessed by now, Weaver, from the OpenTelemetry project, is tool which allows you to evaluate your live realtime telemetry against the semantic conventions you choose.
Weaver acts as an endpoint. Telemetry is sent to Weaver and Weaver produces an output report.
Step 1: Start Weaver
Start Weaver, informing it to wait for 60
seconds without receiving telemetry before shutting down. The command below also sets the output format to JSON
, creates a new folder called weaver-output
and saves a new file in there called live_check.json
when Weaver closes.
- Port
4317
is used to receive telemetry data (on the gRPC port). - Port
4320
is optional and should be included if you wish to stop Weaver via thecurl
command (see later).
Weaver Binary
There is also a standalone Weaver binary available on GitHub
docker run --rm \
-p 4317:4317 \
-p 4320:4320 \
-u $(id -u ${USER}):$(id -g ${USER}) \
--env HOME=/tmp/weaver \
otel/weaver:v0.15.2 registry live-check \
--inactivity-timeout=60 \
--output=weaver-output \
--format=json
or (for standalone binary):
./weaver registry live-check \
--inactivity-timeout=60 \
--output=weaver-output \
--format=json
Info
Notice that when Weaver starts, it defaults to reading the OpenTelemetry semantic convention:
Weaver Registry Live Check
Resolving registry `https://github.com/open-telemetry/semantic-conventions.git[model]`
You'll know Weaver is ready when you see:
The OTLP receiver will stop after 60 seconds of inactivity.
Step 3: Send a Trace
Use docker / podman and the OpenTelemetry telemetrygen tool to send a single trace to Weaver.
The command sends the trace to the special host.docker.internal
address (which means localhost
on the host machine where docker is running). The --otlp-insecure
flag is also set because Weaver is listening insecurely (that's OK because this is test data + we're on localhost).
docker run --rm \
ghcr.io/open-telemetry/opentelemetry-collector-contrib/telemetrygen:v0.128.0 \
traces --traces=1 \
--otlp-insecure=true \
--otlp-endpoint=host.docker.internal:4317
The container will run and exit. If it works, the last message should be:
INFO traces/traces.go:64 stopping the exporter
Step 4: Stop Weaver
Stop Weaver either by pressing Ctrl + C
or sending a curl
command to the /stop
endpoint of Weaver:
curl -X POST http://localhost:4320/stop
Step 5: Inspect Weaver Output Report
Open weaver-output/live_check.json
and inspect the output report.
In this report Weaver has checked 8
entities: 1
resource, 2
spans and 5
attributes. It has found 6
advisories.
...
"total_advisories": 6,
"total_entities": 8,
"total_entities_by_type": {
"attribute": 5,
"resource": 1,
"span": 2
}
...
Scrolling back up, each "thing" gets an array of all_advice
given for that "thing". For example, here's the output for 1
of the span attributes.
Weaver has identified one violation
and one improvement
and thus, for this span attribute, the highest_advice_level
(ie. the worst thing about it) is a voilation
.
{
"live_check_result": {
"all_advice": [
{
"advice_level": "violation",
"advice_type": "deprecated",
"message": "Replaced by `network.peer.address`.",
"value": "renamed"
},
{
"advice_level": "improvement",
"advice_type": "stability",
"message": "Is not stable",
"value": "development"
}
],
"highest_advice_level": "violation"
},
"name": "net.sock.peer.addr",
"type": "string",
"value": "1.2.3.4"
}
There are so many more statistics that Weaver produces and this is a gold-mine for things like preventing CICD pipelines from progressing if telemetry is not "good enough".
Watch this in action
If you want to see this in action, check out this YouTube video.
Summary
In this post you've discovered how Weaver, a tool from the OpenTelemetry project, can help you to understand the realtime quality of your telemetry data against semantic conventions.
Weaver is a new tool and there is a lot more it can do. I'm only just starting to learn it now, so stay tuned and I'll bring you more when I understand more about it. It certainly seems very useful for DevOps / SRE folk!