(clj 9) How to figure out what a function does
Chapter 4 of “Clojure for the Brave and True” concludes with a Vampire Data Analysis Program, serving as a summary of the chapter. The book shows some code, explains it, moves on to the next bit of code, etc. I’m not sure why, but I decided I wanted to figure out the code on my own and then read the explanation to see if I got it right. Afterwards I realized it might make a good blog post: both explaining what the program does and what techniques (for lack of a better word) I used to figure it out.
With most of the work in the program being done by a function called mapify
, this blog post will focus on that single function. As it turns out, it took me more than 2000 words (footnotes not included) to describe what this function consisting of only 9 lines does. So feel free to read all of it, skim through it, or skip straight ahead to the techniques and some reflections.
The function
This is mapify
, the main function of the Vampire Data Analysis Program program:
(defn mapify "Return a seq of maps like {:name \"Edward Cullen\" :glitter-index 10}" [rows] (map (fn [unmapped-row] (reduce (fn [row-map [vamp-key value]] (assoc row-map vamp-key (convert vamp-key value))) {} (map vector vamp-keys unmapped-row))) rows))
Breaking up the function in its parts
To figure out what this mapify
function does, we need to break it up into parts. For that we need to know how you write a function in Clojure, i.e. the syntax of a function in Clojure. So let’s go through the function step-by-step and identify its parts.
Clojure function syntax
The function definition starts and ends with a parenthesis, making it a list. Clojure is part of the Lisp-family of languages and that means almost everything is a list and thus you get lots of parentheses. Next we have defn
which tells Clojure we want to define a function. Then there’s mapify
, the name of the function. The part between double quotes is a docstring, describing what the function does. The line after that, [rows]
, is the input parameter of the function. So if we want to call the mapify
function, we need to provide it with an argument for this rows
parameter1. And everything after that is the function body, the part that defines what the functional actually does.
Breaking up the function body
To figure out how to break up the function body, we need to do two things. We need to match opening parentheses with closing parentheses, so we know the overall structure. We need to look up the syntax of the functions being used in the different part, so we can make sense of the different parts - similar to what we did with mapify
and the function syntax.
The function body starts with a map
. On map
‘s ClojureDocs page we can see that map
takes two ‘groups’ of parameters: a function and one or more collections. Collections are things like lists and maps, so things that collect multiple values. To apply this to our code snippet, we need to take a good look at which opening parenthesis match with which closing ones. The opening parenthesis after map
has its matching closing parenthesis at the end of the second-to-last line of the snippet. So (fn ... unmapped-row)))
is the first argument of map
. The rows
on the last line is the second argument, followed by two closing parentheses: one for the function body and one for the whole function definition.
In summary, our mapify
function is a function that performs map
on rows
using a function that spans most of the lines in the code snippet. So now we have to figure out what happens inside this function:
(fn [unmapped-row] (reduce (fn [row-map [vamp-key value]] (assoc row-map vamp-key (convert vamp-key value))) {} (map vector vamp-keys unmapped-row)))
You might have noticed that this function looks a bit different than the mapify
function. That’s because this is a an anonymous function, while mapify
is a named function. An anonymous function does not have a name2, so it’s used in-place instead of being called from somewhere else in the code. Most of the syntax is the same though. [unmapped-row]
is the function parameter and all that follows is the function body.
The function body start with a reduce
. We can also see it takes three parameters (because matching brackets): an anonymous function (first two lines), an empty map (the {}
), and whatever is returned by the last line (map vector ...
). reduce
‘s ClojureDocs page tells us that reduce
either takes the form of (reduce f coll)
or (reduce f val coll)
, so that last line must be returning a collection. And map
‘s ClojureDocs confirms this: map
returns a lazy sequence, which is a collection3.
So what have we figured out so far? We have the mapify
function, which does a map
, which does a reduce
. And this reduce
does something with a function and two pieces of data. Let’s figure out what exactly that is.
Understanding what the reduce
does
The reduce
ClojureDocs almost are a piece of code by itself:
(reduce f coll) (reduce f val coll)
f should be a function of 2 arguments. If val is not supplied, returns the result of applying f to the first 2 items in coll, then applying f to that result and the 3rd item, etc. If coll contains no items, f must accept no arguments as well, and reduce returns the result of calling f with no arguments. If coll has only 1 item, it is returned and f is not called. If val is supplied, returns the result of applying f to val and the first item in coll, then applying f to that result and the 2nd item, etc. If coll contains no items, returns val and f is not called.
We can ignore most of it, though. We do supply a val
and our coll
should not be empty. That leaves us with a single relevant sentence: “If val is supplied, returns the result of applying f to val and the first item in coll, then applying f to that result and the 2nd item, etc.” Moreover, the val
is an empty map and the book’s explanation of reduce
is two examples with an empty map as val
. Both examples are about taking an existing map and creating either a map with updated values, or filtering an existing map. So this reduce
must be doing something similar.
That leaves us with two questions:
- What does the function of the
reduce
do? - What data does the
reduce
operate on?
Let’s start with the second one, figuring out what data the reduce
is operating on and then see how the function transforms that data.
What data does the reduce
operate on?
The data provided to the reduce
(the coll
in the paragraph above), is whatever is returned by:
(map vector vamp-keys unmapped-row)
Ideally we’d break this up in parts as we have been doing. That’s now what I did however. I first looked into the last two things, vamp-keys
and unmapped-row
, and then moved to the map
and vector
.
The vamp-keys
and the unmapped-row
vamp-keys
was defined earlier in the example code as:
(def vamp-keys [:name :glitter-index])
For unmapped-row
we need to track back a bit to the map
in the mapify
function:
(map (fn [unmapped-row] (reduce (fn ...) {} (map vector vamp-keys unmapped-row))) rows)
As mentioned above, this map
has two parameters: an anonymous function and a collection called rows
. What map
will do in case of a single collection argument, is apply the function to the first item in the collection, then to second, etc. So the anonymous function will get one row from rows
at a time from map
, i.e. the unmapped-row
parameter on the first line in the snippet above.
That still leaves the question what these rows
are. The answer to that is vampire names and their glitter index:
Edward Cullen,10 Bella Swan,0 Charlie Swan,0 Jacob Black,3 Carlisle Cullen,6
Or rather, that’s what’s in the .csv
file, which after parsing4 is provided to the mapify
function like this:
(["Edward Cullen" "10"] ["Bella Swan" "0"] ["Charlie Swan" "0"] ["Jacob Black" "3"] ["Carlisle Cullen" "6"])
The vector
and the map
vector
takes whatever you give it and puts it in a vector (a vector is basically a list or array):
(def unmapped-row ["Edward Cullen" "10"]) (vector vamp-keys unmapped-row) ; => [[:name :glitter-index] ["Edward Cullen" "10"]]
However that’s not what (map vector vamp-keys unmapped-row)
does. Hence my comment earlier that I should have broken down this function based on its syntax and not simply started at the end and work my way to the front. What’s not happening here is that map
is applied to what’s returned by vector
. Then the code should have been (map (vector vamp-keys unmapped-row))
. What’s happening instead is that vector
, vamp-keys
, and unmapped-row
are arguments for the map
function.
Earlier we had a map
with a function and one collection as parameters. Now we have a function (vector
) and two collections (vamp-keys
, and unmapped-row
). How map
works in this case is that it will take the first item in vamp-keys
and the first item in unmapped-row
and apply vector
to it. Then it’ll do the same for the second items, etc. until one of the two collections is exhausted:
(map vector vamp-keys unmapped-row) ; => ([:name "Edward Cullen"] [:glitter-index "10"])
Recap
The mapify
function is using a reduce
to create a new map based on some data it’s getting and a function. We now know what that data looks like. It’s a set of rows with a single row looking like this: ([:name "Edward Cullen"] [:glitter-index "10"])
. Next step is figuring out what the function provided to the reduce
does with these rows.
What does the function of the reduce
do?
The function of our reduce
is:
(fn [row-map [vamp-key value]] (assoc row-map vamp-key (convert vamp-key value)))
To understand what this function does, it’s helpful to know what arguments will be provided to it, i.e. what will be used for [row-map [vamp-key value]]
. With this function being used in the reduce
that means its arguments are provided by that reduce
. So we’re a bit stuck in a loop: to understand this function we need to understand the reduce
and to understand the reduce
we need to understand this function.
To escape that loop, we’re not going into how reduce
works just yet. Instead, I’m going to say (simplified so not entirely correct) that row-map
is the second argument provided to the reduce
, i.e. the empty map {}
. And that [vamp-key value]
is the first item in our collection (see above), i.e. [:name "Edward Cullen"]
.
That leaves us with two words to explain in this anonymous function: assoc
and convert
.
The convert
function was defined earlier in the code example5. What it does, is that if vamp-key
is :glitter-index
, it converts the value
from a string to an integer, e.g. from "10"
to 10
.
assoc
is a way to add or update an existing map6. For example:
(assoc {} :name "Edward Cullen") ; => {:name "Edward Cullen"} (assoc {:name "Edward Cullen"} :glitter-index 10 ) ; => {:name "Edward Cullen" :glitter-index 10}
Now we have all the elements to figure out what this function used by the reduce
does. It takes an existing map (row-map
) and a key-value pair (e.g. :name "Edward Cullen"
). It converts the value of the key-value pair to an integer if the key equals :glitter-index
. Then it creates a new map that contains both the existing map and the post-conversion key-value pair.
Putting the pieces of the reduce
back together
Now that we understand the parts of the reduce
, we should be able to put them together again to understand the whole:
(fn [unmapped-row] (reduce (fn [row-map [vamp-key value]] (assoc row-map vamp-key (convert vamp-key value))) {} (map vector vamp-keys unmapped-row)))
To summarize, those parts are:
-
reduce
takes anf
, aval
, and acoll
. It returns the result of applyingf
toval
and the first item incoll
, then applyingf
to that result and the 2nd item, etc. - the
f
converts the:glitter-index
value from a string to an integer - the
val
is an empty map,{}
, at least initially - the
coll
is a parsed row from the.csv
file, e.g.([:name "Edward Cullen"] [:glitter-index "10"])
This means reduce
will execute the function twice. The first time it adds the name
key-value pair to the empty map7. The second time it adds the glitter-index
key-value pair (including converting the value to an integer) to that same map. So we end up with both the name
and glitter-index
key-value pairs in the map.
We can check our understanding, by isolating the reduce
function and providing it with an unmapped-row
:
(def unmapped-row ["Edward Cullen" "10"]) (reduce (fn [row-map [vamp-key value]] (assoc row-map vamp-key (convert vamp-key value))) {} (map vector vamp-keys unmapped-row)) ; => {:name "Edward Cullen", :glitter-index 10}
The mapify
function, map
ping the reduce
Now that we understand the reduce
, we can actually guess what the map
that uses this reduce
, does. The reduce
works on a single row, but our data consists of several rows. So the map
must be the thing that loops through the rows, providing them one-by-one to the reduce
.
Of course, we want to do better than guessing. So let’s revisit map
. It takes a function and at least one collection. It will apply the function to the set of first items in the collections, then to the second, etc. The function is the reduce
we just figured out. And as we saw when breaking up the function body, the map
in the mapify
has only one collection-parameter, called rows
. So that confirms our guess: map
loops through the rows
8 and transforms them via the reduce
.
In conclusion, all this figuring out let’s us look at this:
(defn mapify "Return a seq of maps like {:name \"Edward Cullen\" :glitter-index 10}" [rows] (map (fn [unmapped-row] (reduce (fn [row-map [vamp-key value]] (assoc row-map vamp-key (convert vamp-key value))) {} (map vector vamp-keys unmapped-row))) rows))
and think of it like this:
(defn mapify "Return a seq of maps like {:name \"Edward Cullen\" :glitter-index 10}" [rows] (map (do-stuff) rows))
Where we know that do-stuff
changes the rows into maps with keys :name
and :glitter-index
and that it converts the value of the :glitter-index
from a string to an integer. Looking at the function fully written out, we’ll probably remember that the (map vector vamp-keys unmapped-row)
does the former and that the function of the reduce
does the latter. And if we want to know how exactly it does these things, we’ll have to re-figure it out. Or scroll back up a bit and read it there.
Having done all this work (and writing) I can’t help but feel a bit disappointed. In the end, the function does not seem to do a lot. To illustrate, in Python you’d write this to do basically9 the same thing:
def mapify(rows): return [ { vamp_keys[0]: row[0], vamp_keys[1]: int(row[1]) } for row in rows ] rows = [["Edward Cullen", "10"], ["Bella Swan", "0"], ["Charlie Swan", "0"], ["Jacob Black", "3"], ["Carlisle Cullen", "6"]] vamp_keys = ["name", "glitter-index"] mapify(rows) # => [{'name': 'Edward Cullen', 'glitter-index': 10}, {'name': 'Bella Swan', 'glitter-index': 0}, # {'name': 'Charlie Swan', 'glitter-index': 0}, {'name': 'Jacob Black', 'glitter-index': 3}, # {'name': 'Carlisle Cullen', 'glitter-index': 6}]
edit 13 July 2022
The Clojure and Python versions are not as equivalent as I thought, see my follow-up post (clj 10) The mapify function of (clj 9) revisited.
Techniques I used in figuring all of this out
- break things down in parts, understand the parts, put it all back together again
- focus on the current level and ignore the specifics of the levels above and below
- read ClojureDocs
- decide what the argument provided to a function is, because I know what the function expects as parameter
- make an educated guess to fill in a gap (this
map
must be how the code loops trough the rows) - understand the code in context, i.e. with the data provided in the example, instead of in a more general way
- not worry too much about abstract data structures (collections and sequences)
- figure out the first iteration of a loop (for the
reduce
) before looking into the looping - simplify what a function does based on the number of parameters, e.g. think of
map
with onecol
as a loop - recognize patterns, e.g. a
reduce
with an empty map asval
is a way to transform an existing map - take part of the function and run it to see what it does
- rewrite the code in a different language
- write this blog post (arguable also a rewrite in a different language)
Reflections
Writing this blog post was harder and took a lot more time (and words) than I thought. Luckily, I also learned a lot.
The Clojure code is very nested with functions inside of functions inside of functions. It looks and feels different from the Python code I’m used to. The explanation must be the difference in programming paradigms: functional (Clojure) versus procedural (Python). In procedural programming you define a sequence of steps, in functional programming you have functions nested in other functions. (Note: Python supports more paradigms than procedural, but I’ve mostly used in a procedural paradigm.)
Another related difference between the two versions is that to me the Clojure version looks more complicated than the Python version10. Part of the explanation is the difference in paradigm I mentioned above. Another part is a difference in how the syntactical work happens. Not just for example parentheses (Clojure) and whitespace (Python), but also filter
and map
(Clojure) and list comprehension (Python). A great illustration of this is Guido van Rossum arguing to remove reduce()
, filter()
, and map()
from Python’s standard library.
A third part of the explanation - and I think the most interesting part, is that I read them in a different way. For example, when I came across (map vector vamp-keys unmapped-row)
in Clojure, I had to figure it out step-by-step. In Python on the other hand, I can read and understand e.g. a list comprehension (the [ { ... } for row in rows]
) without much effort. I immediately see the whole, I recognize the pattern of a list comprehension. So the main difference might be that I need more practice reading Clojure, until there too I can see the patterns.
-
It probably does not aid in readability, but I’ve decided to maintain the distinction between parameters and arguments. A parameter is a variable that’s part of the function definition. An argument is a variable or expression that’s used when calling a function. Bonus points to you if you find a place where I missed this up in this blog post. ↩
-
Although it turns out you can name an anonymous function. This can be useful for readability and in stack traces. ↩
-
Writing all of this out makes me appreciate how complicated and multi-layered all of this is. As “Clojure for the Brave and True” says in chapter 4:
“The collection abstraction is closely related to the sequence abstraction. All of Clojure’s core data structures - vectors, maps, lists, and sets - take part in both abstractions. The sequence abstraction is about operating on members individually, whereas the collection abstraction is about the data structure as a whole.”
So we’re dealing with an abstract data structure, the collection, which tells us something about what we can do with it as a whole. We’re also dealing with a different abstract data structure, the lazy sequence, which tells us something about what we can do with its individual members, i.e. sequencey things, while it also tells us something about how this is implemented, i.e. lazily. So now I’m wondering if I am correct in calling a lazy sequence a collection…
I wrote about the sequence abstraction in (clj 7) and about the lazy sequences in (clj 8). ↩ -
If you want to know how the parsing works, you can read it in the online version of the book. ↩
-
See the online version of the book, if you’re curious about the
convert
. ↩ -
For completeness sake: data in Clojure is immutable, so
assoc
will not change the existing map, it returns a new one. ↩ -
Data is immutable in Clojure, so to be correct I should say that Clojure creates a new map. I don’t think this matters here and conceptually it’s easier to think of things being added to the map. ↩
-
In this sense
map
with onecoll
parameter feels different to me than with multiplecoll
parameters. With a single one, it acts like a for-loop. Take the first thing and apply the function, take the second thing and apply the function, etc. With multiplecoll
s it takes all the first things and applies the function to them, then all the second things, etc. until one of thecoll
s is exhausted. Described like this, these are two different things. However, from a more abstract viewpoint, I also see that the singlecoll
is not fundamentally different, it’s a kind of special case of the multiplecoll
situation. ↩ -
The Python code does do the
convert
in a separate function. It would also return a different result than the Clojure code if the data would have more than two items per row. ↩ -
It makes me wonder if the example is fully idiomatic Clojure. It probably is, though, because it’s from a book teaching Clojure. ↩