Automatic memory management is probably JVM's biggest selling point. You don't need to remember to clean up the stuff you've allocated — the garbage collector will take care of it for you. You can "leak" memory if you leave live references to allocated objects (e.g. store objects in static fields), making them unreclaimable by the GC. Thus, as long as you don't do that, you can treat memory as infinite and never run out of it. Right?
Definitely not. Memory is a limited resource, and you can't fill it with more data than its capacity allows if all of that data is needed at the same time. So, it becomes crucial not only to know how much memory your data structures occupy but also the access patterns to those data structures. Does the algorithm process one item at a time or expect the whole list to be in memory at once? Is the result gradually written to disk, or is it accumulated in memory first? These questions may not matter when the data size is small and doesn't ever get close to memory limits. But when they do...
Running out of memory is a nasty failure mode on JVM because it's often not immediately apparent that it happens. Remember how the GC exists to clean up unused objects? GC is a relatively expensive process, so the JVM usually runs it when the remaining memory space is low. As we occupy more and more heap with live (unreclaimable) objects, JVM starts to run the GC much more often. But those GC cycles cannot free much memory if most of the allocated stuff is "needed". So, at a certain point, JVM will just run the GC back-to-back, each cycle freeing some measly bytes of heap. As the runtime is occupied with constantly running GC, our application makes almost no progress. From the outside, the program looks very busy (100% CPU usage), but no useful work is being done. This may even last for minutes until the JVM finally graces us with an OutOfMemoryError. Oh, the humanity!
Clojure programmers have different memory-debugging tools at their disposal. Let's briefly mention these tools:
:alloc
mode to understand and reduce how much garbage the program
allocates.Despite such a variety, none of these tools answer the question: "at most, how much memory does the program require to run"?[1] That's why I came up with yet another instrument that I want to present to you today.
Version 0.4.0 of
clj-memory-meter
introduces a new namespace,
clj-memory-meter.trace,
to facilitate figuring out the memory requirements of an algorithm. It allows
you to instrument functions in your program to report the memory usage before
and after the invocation. There are no hard rules or suggestions on which
functions you should trace, but it's a good idea to go from top to bottom. Note
that this tracing significantly slows down your program as every invocation of
the traced function will, by default, force the GC twice. It will also measure
arguments and the returned value with clj-memory-meter.core/measure
.
We'll learn how it works by solving a toy problem. The task is the following:
given a set of integers, count the total number of subsets such that
∏(subset) mod 7 = 1
. Let's get cracking:
(ns example
(:require [clj-memory-meter.trace :as cmm.trace]))
(defn matches? [s]
(= (mod (reduce * s) 7) 1))
;; Attach 10 megabytes of useless metadata to a subset to make it more
;; pronounced when it is kept in memory.
(defn add-deadweight [s]
(with-meta s {:deadweight (byte-array (* 10 1024 1024))}))
We'll call add-deadweight
on generated subsets to inflate them for didactic
purposes. Now, our first attempt will be a naive eager solution:
(defn subsets-eager [[x & rst :as coll]]
(if (empty? coll)
[#{}]
(let [rest-subsets (subsets-eager rst)]
(into rest-subsets
(map #(add-deadweight (conj % x)) rest-subsets)))))
(count (filter matches? (subsets-eager (range 2 7))))
=> 6
It finishes in a reasonable time. You can repeatedly run this till the cows come home, and it will work perfectly fine. So, is the eager version good enough? Try running it with a slightly larger input set:
(count (filter matches? (subsets-eager (range 2 11))))
=> java.lang.OutOfMemoryError
Nah, not good. Let's trace a few functions and see how much heap is used at different steps of the algorithm:
(cmm.trace/trace-var #'matches?)
(cmm.trace/trace-var #'subsets-eager)
(cmm.trace/with-relative-usage
(count (filter matches? (subsets-eager (range 2 7)))))
Initial used heap: 444.3 MiB (10.8%)
│ (example/subsets-eager <56 B>) | Heap: -1.8 KiB (-0.0%)
│ │ (example/subsets-eager <56 B>) | Heap: -13.0 KiB (-0.0%)
│ │ │ (example/subsets-eager <56 B>) | Heap: -11.9 KiB (-0.0%)
│ │ │ │ (example/subsets-eager <56 B>) | Heap: -11.5 KiB (-0.0%)
│ │ │ │ │ (example/subsets-eager <56 B>) | Heap: -11.0 KiB (-0.0%)
│ │ │ │ │ │ (example/subsets-eager <0 B>) | Heap: -9.7 KiB (-0.0%)
│ │ │ │ │ │ └─→ <320 B> | Heap: -9.7 KiB (-0.0%)
│ │ │ │ │ └─→ <10.0 MiB> | Heap: +12.0 MiB (+0.3%)
│ │ │ │ └─→ <30.0 MiB> | Heap: +36.0 MiB (+0.9%)
│ │ │ └─→ <70.0 MiB> | Heap: +84.0 MiB (+2.1%)
│ │ └─→ <150.0 MiB> | Heap: +180.0 MiB (+4.4%)
│ └─→ <310.0 MiB> | Heap: +372.0 MiB (+9.1%)
│
│ (example/matches? <10.0 MiB>) | Heap: +372.0 MiB (+9.1%)
│ └─→ <16 B> | Heap: +372.0 MiB (+9.1%)
│
│ (example/matches? <10.0 MiB>) | Heap: +372.0 MiB (+9.1%)
│ └─→ <16 B> | Heap: +372.0 MiB (+9.1%)
│
│ ...
│
│ (example/matches? <10.0 MiB>) | Heap: +372.1 MiB (+9.1%)
│ └─→ <16 B> | Heap: +372.1 MiB (+9.1%)
Final used heap: +75.6 KiB (+0.0%)
You can see how each returned recursive call to subsets-eager
increases heap
usage (12MB, then 36MB, 84MB, and so on). Once all subsets are constructed
(occupying 372MB), all that memory is retained to the very end of the algorithm.
All those lines saying +372.0 MiB
mean the heap difference compared to the
beginning of the algorithm (established by with-relative-usage
) macro. It
doesn't mean that those 372 megabytes are added between each call to matches?
.
The final line says that the used heap did not increase after the algorithm
had run — which is to be expected because we didn't store any data anywhere.
It is easy to make this naive implementation lazy — just by swapping into
for
concat
. Let's do it and run it with the tracer:
(defn subsets-lazy [[x & rst :as coll]]
(if (empty? coll)
[#{}]
(let [rest-subsets (subsets-lazy rst)]
(concat rest-subsets
(map #(add-deadweight (conj % x)) rest-subsets)))))
(cmm.trace/trace-var #'subsets-lazy)
(cmm.trace/with-relative-usage
(count (filter matches? (subsets-lazy (range 2 7)))))
Initial used heap: 444.5 MiB (10.9%)
│ (example/subsets-lazy <56 B>) | Heap: +544 B (+0.0%)
│ │ (example/subsets-lazy <56 B>) | Heap: +1.8 KiB (+0.0%)
│ │ │ (example/subsets-lazy <56 B>) | Heap: +3.2 KiB (+0.0%)
│ │ │ │ (example/subsets-lazy <56 B>) | Heap: +4.5 KiB (+0.0%)
│ │ │ │ │ (example/subsets-lazy <56 B>) | Heap: +5.4 KiB (+0.0%)
│ │ │ │ │ │ (example/subsets-lazy <0 B>) | Heap: +6.7 KiB (+0.0%)
│ │ │ │ │ │ └─→ <320 B> | Heap: +6.7 KiB (+0.0%)
│ │ │ │ │ └─→ <576 B> | Heap: +8.1 KiB (+0.0%)
│ │ │ │ └─→ <832 B> | Heap: +15.2 KiB (+0.0%)
│ │ │ └─→ <1.1 KiB> | Heap: +2.9 KiB (+0.0%)
│ │ └─→ <1.3 KiB> | Heap: +9.9 KiB (+0.0%)
│ └─→ <1.6 KiB> | Heap: -16.7 KiB (-0.0%)
│
│ (example/matches? <10.0 MiB>) | Heap: +12.0 MiB (+0.3%)
│ └─→ <16 B> | Heap: +12.0 MiB (+0.3%)
│
│ (example/matches? <10.0 MiB>) | Heap: +24.0 MiB (+0.6%)
│ └─→ <16 B> | Heap: +24.0 MiB (+0.6%)
│
│ (example/matches? <10.0 MiB>) | Heap: +36.0 MiB (+1.2%)
│ └─→ <16 B> | Heap: +48.0 MiB (+1.2%)
│
│ ...
│
│ (example/matches? <10.0 MiB>) | Heap: +180.0 MiB (+4.4%)
│ └─→ <16 B> | Heap: +180.0 MiB (+4.4%)
│
│ (example/matches? <10.0 MiB>) | Heap: +192.0 MiB (+4.7%)
│ └─→ <16 B> | Heap: +192.0 MiB (+4.7%)
│
│ (example/matches? <10.0 MiB>) | Heap: +180.0 MiB (+4.4%)
│ └─→ <16 B> | Heap: +180.0 MiB (+4.4%)
│
│ (example/matches? <10.0 MiB>) | Heap: +168.0 MiB (+4.1%)
│ └─→ <16 B> | Heap: +168.0 MiB (+4.1%)
│
│ ...
│
│ (example/matches? <10.0 MiB>) | Heap: +36.0 MiB (+0.9%)
│ └─→ <16 B> | Heap: +36.0 MiB (+0.9%)
│
│ (example/matches? <10.0 MiB>) | Heap: +24.0 MiB (+0.6%)
│ └─→ <16 B> | Heap: +24.1 MiB (+0.6%)
Final used heap: +26.4 KiB (+0.0%)
This implementation exhibits peculiar behavior. First of all, subsets-lazy
alone didn't produce anything that takes memory — because it is lazy. As
matches?
gets called, the memory usage starts to grow upwards to the maximum
of 192 megabytes, and then it linearly drops all the way to zero. Still, it uses
only ~50% memory of the eager solution at its maximum which is an improvement.
How about a linear lazy implementation that doesn't use recursion? The next algorithm generate the subsets according to bit patterns. We'll also use a larger input set to demonstrate some laziness effects.
(defn subsets-lazy2 [coll]
(let [v (vec coll)
n (count v)]
(map (fn [i]
(->> v
(keep-indexed (fn [idx item]
(when (bit-test i idx)
item)))
set
add-deadweight))
(range (bit-shift-left 1 n)))))
(cmm.trace/trace-var #'subsets-lazy2)
(cmm.trace/with-relative-usage
(count (filter matches? (subsets-lazy2 (range 2 9)))))
Initial used heap: 432.5 MiB (10.6%)
│ (example/subsets-lazy2 <56 B>) | Heap: +544 B (+0.0%)
│ └─→ <624 B> | Heap: +42.2 KiB (+0.0%)
│
│ (example/matches? <10.0 MiB>) | Heap: +384.0 MiB (+9.4%)
│ └─→ <16 B> | Heap: +384.0 MiB (+9.4%)
│
│ (example/matches? <10.0 MiB>) | Heap: +384.0 MiB (+9.4%)
│ └─→ <16 B> | Heap: +384.0 MiB (+9.4%)
│
│ (example/matches? <10.0 MiB>) | Heap: +384.0 MiB (+9.4%)
│ └─→ <16 B> | Heap: +384.0 MiB (+9.4%)
│
│ ...
│
│ (example/matches? <10.0 MiB>) | Heap: +456.0 MiB (+11.1%)
│ └─→ <16 B> | Heap: +456.0 MiB (+11.1%)
│
│ (example/matches? <10.0 MiB>) | Heap: +456.0 MiB (+11.1%)
│ └─→ <16 B> | Heap: +456.0 MiB (+11.1%)
Final used heap: +8.5 KiB (+0.0%)
As expected, subsets-lazy2
doesn't take any memory by itself, but what about
those whopping 384 megabytes that appear after the very first call to
matches?
? You observe the effect of
chunking
— Clojure realizes chunked lazy sequences not one-by-one, but 32 elements at a
time. 32*10MB = 320MB; I'm not sure why it is 384MB and even grows to 456MB
sometime later, but chunking certainly has to do with it.
Nowadays, Clojure provides a more predictable deferred evaluation mechanism — transducers and reducible contexts. Let's have one final implementation that returns an eduction instead of a lazy sequence:
(defn subsets-eduction [coll]
(let [v (vec coll)
n (count v)]
(eduction
(map (fn [i]
(->> v
(keep-indexed (fn [idx item]
(when (bit-test i idx)
item)))
set
add-deadweight)))
(range (bit-shift-left 1 n)))))
(cmm.trace/trace-var #'subsets-eduction)
(cmm.trace/with-relative-usage
(count (filter matches? (subsets-eduction (range 2 9)))))
Initial used heap: 432.5 MiB (10.6%)
│ (example/subsets-eduction <56 B>) | Heap: -4.5 KiB (-0.0%)
│ └─→ <568 B> | Heap: -3.3 KiB (-0.0%)
│
│ (example/matches? <10.0 MiB>) | Heap: +396.0 MiB (+9.7%)
│ └─→ <16 B> | Heap: +396.0 MiB (+9.7%)
│
│ (example/matches? <10.0 MiB>) | Heap: +396.0 MiB (+9.7%)
│ └─→ <16 B> | Heap: +396.0 MiB (+9.7%)
│
│ ...
│
│ (example/matches? <10.0 MiB>) | Heap: +456.0 MiB (+11.1%)
│ └─→ <16 B> | Heap: +456.0 MiB (+11.1%)
│
│ (example/matches? <10.0 MiB>) | Heap: +456.0 MiB (+11.1%)
│ └─→ <16 B> | Heap: +456.0 MiB (+11.1%)
Final used heap: +20.2 KiB (+0.0%)
Wait, what? Why does it use even more memory? Turns out, we we've fallen into
a trap. filter
doesn't know how to consume an eduction efficiently, so it
calls seq
on it, turning it into a regular lazy sequence and losing all the
benefits. To properly consume an eduction, we need to use reduce
or
transduce
:
(cmm.trace/with-relative-usage
(reduce #(if (matches? %2) (inc %1) %1) 0 (subsets-eduction (range 2 9))))
Initial used heap: 432.5 MiB (10.6%)
│ (example/subsets-eduction <56 B>) | Heap: +568 B (+0.0%)
│ └─→ <568 B> | Heap: +1.7 KiB (+0.0%)
│
│ (example/matches? <10.0 MiB>) | Heap: +12.0 MiB (+0.3%)
│ └─→ <16 B> | Heap: +5.9 KiB (+0.0%)
│
│ (example/matches? <10.0 MiB>) | Heap: +12.0 MiB (+0.3%)
│ └─→ <16 B> | Heap: +2.6 KiB (+0.0%)
│
│ (example/matches? <10.0 MiB>) | Heap: +12.0 MiB (+0.3%)
│ └─→ <16 B> | Heap: +2.8 KiB (+0.0%)
│
│ ...
│
│ (example/matches? <10.0 MiB>) | Heap: +12.0 MiB (+0.3%)
│ └─→ <16 B> | Heap: +5.3 KiB (+0.0%)
│
│ (example/matches? <10.0 MiB>) | Heap: +12.0 MiB (+0.3%)
│ └─→ <16 B> | Heap: +5.4 KiB (+0.0%)
│
│ (example/matches? <10.0 MiB>) | Heap: +12.0 MiB (+0.3%)
│ └─→ <16 B> | Heap: +4.5 KiB (+0.0%)
Final used heap: +4.1 KiB (+0.0%)
At last, we got what we wanted. Calling matches?
on the eduction realizes one
element at a time, so our high-memory mark constantly sits at 12 megabytes. I
can't tell why it's 12MB and not 10, but otherwise, the algorithm behaves as
expected.
Working with memory ain't easy, even in a runtime that promises to take care of it for you. We saw it first-hand in this post by running into surprises with recursive vs. linear laziness, chunking, interaction between lazy functions and eduction. It's always good to have an extra instrument in your toolbelt that helps you study and understand those behaviors, look inside the black box, and become wiser from that. I hope this new tool makes at least one day of your life happier. Till next time!
I thank Clojurists Together for sponsoring my open-source work in 2025 and making this release possible.
-XX:+HeapDumpOnOutOfMemoryError
creates a heap
dump on OOM which you can feed to heap analyzer. Studying the dominator object
trees may then reveal the largest data structures that filled up the memory. But
it is often hard to associate those data structures with the exact functions and
lines of code that produced it.↑