Today I released clj-async-profiler 0.3.0. This version contains a few new features and possibilities that warrant a dedicated blog post clarifying them. I will try to give you the taste of what you are able to do with this profiler and what mistakes should be avoided.
If this is the first time you hear about clj-async-profiler check this introduction. It explains the motivation behind this profiler and how to start using it.
Version 0.3.0 boasts a new color scheme specifically tailored to Clojure programs. It paints Java frames in green (like in the default color scheme) and Clojure frames in blue.
Another visual improvement is that both Java and Clojure frames are now
demunged. What looked like clojure/core$filter.invoke
now correctly displays
as clojure.core/filter
. Furthermore, duplicate frames are removed (Clojure
function call usually involves two method invocations, .invoke
and
.invokeStatic
).
Here's a side-by-side comparison of the same program profiled with 0.2.0 and
0.3.0. The burn-cpu
code is taken from the introduction
post, profiled as (prof/profile (burn-cpu 10))
.
You'll see (probably not on this page but if you open the images in a separate tab) that the new version displays fewer stackframes due to deduplication, has nicer-looking method and function names, and more eye-friendly coloring.
clj-async-profiler profiles CPU usage by default, but it can do other things.
You may call list-event-types
function to know which events are supported on
your machine. On MacOS, you'll probably see something like this (Linux users
will get some extra native events):
user=> (prof/list-event-types)
Basic events:
cpu
alloc
lock
wall
itimer
For example, to figure out which code does the most allocations in your program,
you can pass :event :alloc
to any profiling function (profile
,
profile-for
, start
). Let's see that in action. I downloaded this big
twitter.json
file and will try to parse it with Cheshire and understand where the allocations
happen.
user=> (require '[cheshire.core :as json])
user=> (prof/profile
{:event :alloc}
(dotimes [_ 500]
(json/decode (slurp "twitter.json"))))
This is the profile I got:
This flame graph looks similar to a regular CPU graph, except the upper frame
is always the class of the objects that are allocated in that stack most often.
You will notice in the profile above that Cheshire is responsible only for 36%
of the allocation volume; meanwhile, slurp
makes 63% of the allocations. Let's
modify the code so that it doesn't slurp in the loop and just profile the actual
parsing:
user=> (let [s (slurp "twitter.json")]
(prof/profile
{:event :alloc}
(dotimes [_ 500]
(json/decode s))))
Now Cheshire is a sole allocator as expected, and its allocations consist of creating internal Jackson objects and modifying transient maps.
You can learn how allocation profiling works in async-profiler here.
A prominent new feature in 0.3.0 is the ability to transform and filter stacks using regular Clojure code. This allows to post-process the profiling results and make their interpretation easier for your exact usecase.
First, let's take a look at the stacks file. You can open any file
profile-***.txt
in clj-async-profiler's results/
directory and see that each
line has the following format:
frame1;frame2;frame3;frame4 <number-of-samples>
Since 0.3.0, clj-async-profiler provides an API to transform this data before
the flamegraph is generated. You can pass :transform <my-fn>
to profile
or
start
, where <my-fn>
is a function which accepts a string that represents
the full stacktrace without the number of samples. Your function then can return
a modified string (to alter the stack on the flamegraph) or nil
(to remove
this stack from the results).
There is also a separate function generate-flamegraph
that accepts a
profile-***.txt
file and an options map and produces a flamegraph. It is
convenient when you just want to experiment with post-processing and don't want
to re-run the profiling code each time.
Let's look back at our results for profiling allocations in Cheshire. Because
cheshire.core/parse*
is recursive, it is hard to observe the actual
distribution of objects allocated by this function. See how there are multiple
assoc!
calls in the flamegraph, all originating from parse*
but separate
because they appeared on a different stack level. We can solve this by
collapsing consecutive parse*
calls into just one. Another thing you might
want is to collapse all frames preceding clojure.core/eval
since they are
non-informative.
(require '[clojure.string :as str])
(prof/generate-flamegraph
"/tmp/clj-async-profiler/results/profile-2019-02-01-13-10-16.txt"
{:transform (fn [s]
(-> s
(str/replace #"(cheshire\.parse/parse\*;)+" "cheshire.core/parse*...;")
(str/replace #"^.+clojure\.core/eval;" "START;")))})
We used a little bit of regexp-fu to massage the stacks into what we wanted. Now, it's much easier to understand that the transients-related code does 60% of allocations, and Jackson objects make up the remaining 40%. The flamegraph itself has become much lower because we collapsed uninteresting frames.
For another, more convoluted example, we'll launch an Aleph server:
(require '[aleph.http :as http]
'[manifold.deferred :as md]
'[clojure.string :as str]
'[clj-async-profiler.core :as prof])
(defn handler [req]
;; Our handler will wait 10 milliseconds and return tangent of the URI.
(let [num (Double/parseDouble (subs (:uri req) 1))]
(-> (md/deferred)
(md/timeout! 10 {:body (format "The tangent of %s is %s"
num (Math/tan num))}))))
(def srv (http/start-server handler {:port 8080}))
A quick curl to sanity-check:
$ curl 'localhost:8080/12345'
The tangent of 12345.0 is -8.917885942518717
Then, we'll start wrk to bombard our small server at 10000 RPS:
$ wrk -c 500 -d 60 -R 10000 'http://localhost:8080/12345'
Finally, meanwhile wrk is working, we'll start the profiler in the same REPL where out HTTP server is running:
user=> (prof/profile-for 10)
We got a flamegraph, but it's not very convenient to read it. Many samples that were collected are spent in parking threads and waiting for I/O events. We can't do much about them right now, but they make other stacks smaller. Another thing that can be improved is that Netty has quite many frames on its stacks, and we also don't care much about them. So, to address these points, we can profile with the following transform function:
user=> (prof/profile-for
10 {:transform (fn [s]
;; This transform removes stacks that contain `Parker::` or
;; `kevent`, and collapses consecutive Netty frames.
(when-not (re-find #"(Parker::|kevent)" s)
(str/replace s #"(io\.netty\..+?;)+" "io.netty...;")))})
Again, the flame graph is now much shorter and more balanced. It probably still won't tell you much if you aren't familiar with Aleph implementation, but it's a useful aid in learning Aleph internals.
By default, clj-async-profiler lumps stacks of all threads together into a
single flamegraph. In some cases, you might want to see a separate profile for
every thread. For that, add :threads true
to the options map. If you want to
look at the profile of just one thread, use the following trick:
;; Suppose, something you want to profile is already running in one of the
;; background threads.
user=> (prof/profile-for
10 {:threads true
:transform #(when (> (.indexOf ^String % "my-thread-name") -1) %)})
If you accumulate a massive profile with per-thread profiling enabled, the
resulting SVG file might be too hard for your browser to render. If you notice
that the flame graph takes many seconds to render in the browser, try passing
:min-width 1
(or a larger number) in the options. This will drop the stacks
that have too few samples to occupy even 1 pixel in the SVG.
The actual profiling library under the hood of clj-async-profiler, async-profiler, is a sampling profiler. It means that it is not 100% accurate; instead it derives the profile picture statistically. Whenever it comes to statistics, you need enough data samples to be confident about the results. So, make sure that when you profile anything with clj-async-profiler, you get at least 1000 samples. The empirically optimal number of samples to have is 5000-10000. This roughly corresponds to 5-10 seconds of running time; so when you want to measure something that finishes faster, then repeat it in a loop that runs for a few seconds at least.
Note that these exact numbers are given only for CPU profiling. When profiling other events (e.g., alloc, lock), you might not be able to get that many samples.
You can learn how many samples the profiler collected by hovering over the bottom stackframe in the flamegraph.
Efficient and thoughtful profiling is not the most obvious and straightforward process. And we didn't even mention the correct interpretation of the results! Learning to do all of this properly takes some practice. But I hope that the right tools can make it easier for you and, most importantly, fun. Please, try out the new clj-async-profiler in your projects and let me know how it went at @ClojureGoesFast. Thanks for reading!