(by w01fe)
In Clojure, there are a potentially daunting number of ways to represent a slice of data that would have been an Object in an OO-land. For example, we could represent a Person
with a firstName
and lastName
as any of:
- tuple:
["joe" "schmoe"]
- plain old map:
{:first-name "joe" :last-name "schmoe"}
- struct-map: (Clojure 1.0, basically subsumed by records, forget I mentioned them.)
- defrecord:
(defrecord Person [first-name last-name])
- deftype:
(deftype Person [first-name last-name])
- reify:
(defn person [first last] (reify Human (first-name [this] first-name) (last-name [this] last-name)])
At first, I thought this session would just cover these data representations. But, the whole reason we care about data representation is because we want to make it easy to do the operations we want on our data -- thus, it makes no sense to think about data in the absence of functions. A complicating factor is that we sometimes want these functions to be polymorphic -- that is, work (differently) across a variety of different data types. Again, we are provided with a family of options:
- plain old functions (with
instance?
and explicit conditional logic for polymorphism) - multimethods
- protocols
- raw Java interfaces
We'll start by surveying these ingredients and their pros and cons independently, and then discuss some "recipes" for combining them in ways I've found fruitful in various circumstances.
While data abstractions are arguably more fundamental, we'll start by reviewing the options for operations on data. This way, we'll have full context when we get to the data types. Here's a table describing major features of the four options mentioned above.
defn |
defmulti |
defprotocol |
definterface |
|
---|---|---|---|---|
JVM representation | field lookup + virtual method call | hand-rolled hierarchical dispatch | instanceof check, then: - if so, virtual method call - otherwise, polymorphic inline cached call |
virtual method call |
Dispatch on? | static (not polymorphic) | any function of arguments | class of first argument | class of first argument |
Open (extensible to existing data types)? | N/A | yes | yes | no |
Bundled (by data type)? | N/A | no | yes | yes |
Efficient? | very good | okay | good | best possible |
Primitives? | yes, but up to 4 args, only long/double | no | no | full support |
Repl redefinition | great | meh | meh [2] | meh [2] |
Documentation | docstrings [1] | docstrings | docstrings | static types |
Ease of use | great | good | good | okay [2] |
Moving on to data types, we'll cover the features of the four main contenders above. We intentionally omit
- tuples, which in our experience are rarely the right choice, and certainly aren't well-suited for polymorphism
struct-map
s, which are deprecatedproxy
andgen-class
, which exist primarily for Java interop.
hash-map |
defrecord |
deftype |
reify |
|
---|---|---|---|---|
JVM representation | hash array mapped trie | named class with public fields + hash-map for extra keys | named class with public fields | anonymous class with private fields |
Memory usage | up to ~10x contents | compact | compact | compact |
Field access | map get | static (.field), fast (:key), slower but dynamic (get); direct access in protocol/interface implementations | static (.field); direct access in protocol/interface implementations | no |
Lookup performance | hash lookups | field access (or slightly slower, but optimized keyword lookup) |
field access | N/A (protocol/interface methods only) |
Extensibility (can add arbitrary mappings) | a map | behaves like a map | no | N/A |
Primitive support | no | yes (on base members) | yes | yes (as supported by interfaces) |
Typed Object subclass members | no | no(t yet) | no(t yet) | N/A |
Equality | value | value+type (just value for Java .equals/.hashCode) |
identity (overridable) |
identity (overridable) |
Mutable fields | no | no | yes (private only) | no |
Serialization | great | good (pr-str works, but json encoding loses types, etc.) |
ok (custom) |
no |
Type for dispatch | not really (ad-hoc :type field) |
yes (generates named class) |
yes (generates named class) |
sort of (generates anonymous class) |
Works with protocols/interfaces? | no | yes | yes | yes |
Repl redefinition | great | meh [2] | meh [2] | good |
Documentation | meh [1] | ok [1] | ok | ok |
Ease of use | great | good (supports map lookups, but no map-vals, etc) |
ok (Java field access only, you bring the sugar) |
good (no ceremony around names) |
Now I understand how all the pieces work. But how should I represent my Widget?
This section will aim to provide some rules and heuristics for selecting appropriate data and operation types. Of course, please keep in mind that the answer is not not always clear-cut, since it involves the interactions of the above features, which can be even more complex and nuanced than the individual differences in data types and operations.
You have a single data type, and one or more operations you want to perform on it.
This is a relatively easy case, since you can mostly refer to the 'data types' table to figure out your best course of action. If you need primitive support (etc.), you should probably use a defrecord. If you (really) need mutable fields, you're stuck with deftypes. If efficiency is not a concern, maps are the simplest and easiest option.
Regardless, if your data is shared across many namespaces, or serialized and stored or shared across process boundaries, you should probably have a concrete schema to refer back to. deftypes and defrecords give you some of this, but don't capture field constraints beyond primitive values, so it's prudent to use a schema library [1] to precisely describe your data for documentation and safety, which can work just as well with plain maps as more structured types.
For operating on your data, plain old functions are the simplest and typically best option. We recommend that you use safe-get
[3] to access fields, use docstrings and schemas to document your code, and organize your namespace into clear public and private sections.
With this discipline under your belt, the only real benefits of using records and interfaces or protocols are primitive support, and the appearance of lexical scope for your data members. The price you pay for these features is the extra ceremony around declaring interfaces and data classes, plus (in my opinion) slightly decreased ease of use.
If you do need polymorphism, however, ordinary functions are usually not a great choice. A single instance?
check or case
on :type
isn't the end of the world, and sometimes is the simplest and cleanest solution -- but once these conditionals start appearing in multiple places, there's a good chance you're doing it wrong.
If you need polymorphism of an exotic form, where you're not just conditioning on the class of the first argument, then you need multimethods. In our experience, this is a pretty rare occurrence. Multimethods are so general that you can pick the data format best suited to your application on its own merits. If you want to use maps for your data representation, a simple :type
field mapping to a keyword can be used for dispatch.
Similarly, if you want an extensible method without a corresponding concrete data type, multimethods give you a way to declare open dispatch without tying you down to a concrete data representation. For example, you can make a function that dispatches on its first argument value (not class), which anyone can extend.
Beyond these cases, you should probably think hard before using a multimethod. This is especially true if you have multiple polymorphic methods, since you'll need to repeat your dispatch logic in each multimethod if you choose this option.
We've almost reached the end of the road for plain old maps as well. But before we get there, it's probably worth mentioning one more way to achieve polymorphism: storing functions as fields, ala JavaScript (or many languages that came before it).
(def my-obj
{:foo (fn [this y] (bar (:baz this) y))
:baz 12})
;; caller
((:foo my-obj) my-obj 12)
This gives you flexibility to do crazy things that are difficult to achieve with more rigid interfaces and records; you can merge
"objects", assoc
new "methods", and bring all the other tools you usually use to manipulate data to bear on constructing your polymorphic objects.
That said, this method is rather clunky and hard to understand. (Where the hell is the :foo function of my-other-obj defined?). We've only had one or two cases where we felt that we needed this power, and they've all been replaced with Graph [3], which has a similar model but abstracts away some of the complexity of this approach, when your object is really trying to represent a flexible computation process with many steps.
And now, we've hit the end of the road for plain old maps.
If you need maximal memory efficiency and/or unboxed primitives, you must use defrecord, deftype, or reify. If you need mutable members, you must use deftype. If you need custom equality semantics or map semantics, you must use deftype or reify. If you want efficiency and (less efficient) extensibility, you probably want defrecord.
If you need complex logic that returns primitives, you need to use Java interfaces to work with these objects.
If you need none of these things, you don't want deftype
or definterface
.
We're left with a common case, where you do need polymorphism but don't require extreme performance or complex dispatch.
We've already covered the cases where you want to use plain old functions, multimethods, and interfaces; and plain old maps and deftype
s on the data type side. This leaves us with a single operation type, protocols (or definterface+
[2], if you're concerned about memory/perf implications of protocol dispatch), and two data types, defrecords
and reify
.
At this point, the decision is pretty simple, based on your use case.
reify
is simpler -- you don't have to explicitly name your fields since Clojure automatically captures things referenced in lexical scope, and you don't need to define a separate constructor function if you want constructor-like logic. The price you pay for this simplicity is that the objects you create are opaque: if you want access to any 'fields' you will have to create protocol methods for this purpose.
On the other hand, defrecord
is transparent -- people can examine your object, pull out fields, reason about the Class
of your data, and so on.
Thus, the choice of reify
or defrecord
primarily comes down to what you are trying to represent; if you're primarily concerned with data, you probably want defrecord
, whereas if you only care about behavior then reify
may be a simpler choice.
We're basically done with our tour, but there's one issue we haven't touched on yet: abstract data members. What if you have multiple data types (posts and URL documents, employees and customers, etc.) and want a data-centric interface (all documents have titles, all people have names, etc.). None of Clojure's data types allow for implementation inheritance, so if your employers and customers are separate records, you're out of luck for getting the static checking of Java-style field access (.first-name r).
In this case, there are three options at your disposal, none of which is really ideal:
- Use an informal interface (a.k.a docstring): "All people have :first-name and :last-name keys". This should probably be backed by schemas [1] and liberal use of
safe-get
[3] to ensure your data measures up. - Be oh-so-formal: declare a protocol full of 'getter' functions, and fill each of your records with methods like
(first-name [this] first-name)
. This pain can sometimes be alleviated by defining a single record or reify that goes through this ceremony, and letting each of your 'objects' share this single constructor -- they can even pass in functions that the single implementation delegates to, if you need limited polymorphism. - If you're not concerned with polymorphism but just a hierarchy of data types, you may be able to flip things around so that all data types are represented with a single
defrecord
, that has a:type
and:type-info
fields to allow extensibility.
- Schema is a library for declaring data shapes, and annotating functions with input and output schemas. Besides their other benefits, Records have the documentation advantage of having a concrete description that is type-hintable; among other things, schema brings these same benefits to ordinary Clojure maps.
- Potemkin provides some great tools for dealing with interfaces, protocols, records, and so on. In particular, it provides variants of
defprotocol
anddefrecord
that are more repl-friendly, and an implementation ofdefinterface
that's a drop-in replacement fordefprotocol
, allowing full primitive support with automatic wrapper functions (but without the open-ness of protocols, of course). - Plumbing is a library of Clojure utility functions, including Graph, a tool for declarative description of functional processes.