Skip to content

Latest commit

 

History

History
343 lines (276 loc) · 15.6 KB

20130926-data-representation.md

File metadata and controls

343 lines (276 loc) · 15.6 KB

"Objects" in Clojure: design choices for data and (polymorphic) operations

(by w01fe)

In Clojure, there are a potentially daunting number of ways to represent a slice of data that would have been an Object in an OO-land. For example, we could represent a Person with a firstName and lastName as any of:

  • tuple: ["joe" "schmoe"]
  • plain old map: {:first-name "joe" :last-name "schmoe"}
  • struct-map: (Clojure 1.0, basically subsumed by records, forget I mentioned them.)
  • defrecord: (defrecord Person [first-name last-name])
  • deftype: (deftype Person [first-name last-name])
  • reify: (defn person [first last] (reify Human (first-name [this] first-name) (last-name [this] last-name)])

At first, I thought this session would just cover these data representations. But, the whole reason we care about data representation is because we want to make it easy to do the operations we want on our data -- thus, it makes no sense to think about data in the absence of functions. A complicating factor is that we sometimes want these functions to be polymorphic -- that is, work (differently) across a variety of different data types. Again, we are provided with a family of options:

  • plain old functions (with instance? and explicit conditional logic for polymorphism)
  • multimethods
  • protocols
  • raw Java interfaces

We'll start by surveying these ingredients and their pros and cons independently, and then discuss some "recipes" for combining them in ways I've found fruitful in various circumstances.

Operations

While data abstractions are arguably more fundamental, we'll start by reviewing the options for operations on data. This way, we'll have full context when we get to the data types. Here's a table describing major features of the four options mentioned above.

defn
defmulti
defprotocol
definterface
JVM representation field lookup + virtual method call hand-rolled hierarchical dispatch instanceof check, then:
- if so, virtual method call
- otherwise, polymorphic inline cached call
virtual method call
Dispatch on? static (not polymorphic) any function of arguments class of first argument class of first argument
Open (extensible to existing data types)? N/A yes yes no
Bundled (by data type)? N/A no yes yes
Efficient? very good okay good best possible
Primitives? yes, but up to 4 args, only long/double no no full support
Repl redefinition great meh meh [2] meh [2]
Documentation docstrings [1] docstrings docstrings static types
Ease of use great good good okay [2]

Data types

Moving on to data types, we'll cover the features of the four main contenders above. We intentionally omit

  • tuples, which in our experience are rarely the right choice, and certainly aren't well-suited for polymorphism
  • struct-maps, which are deprecated
  • proxy and gen-class, which exist primarily for Java interop.
hash-map
defrecord
deftype
reify
JVM representation hash array mapped trie named class with public fields + hash-map for extra keys named class with public fields anonymous class with private fields
Memory usage up to ~10x contents compact compact compact
Field access map get static (.field), fast (:key), slower but dynamic (get); direct access in protocol/interface implementations static (.field); direct access in protocol/interface implementations no
Lookup performance hash lookups field access
(or slightly slower, but optimized keyword lookup)
field access N/A (protocol/interface methods only)
Extensibility (can add arbitrary mappings) a map behaves like a map no N/A
Primitive support no yes (on base members) yes yes (as supported by interfaces)
Typed Object subclass members no no(t yet) no(t yet) N/A
Equality value value+type
(just value for Java .equals/.hashCode)
identity
(overridable)
identity
(overridable)
Mutable fields no no yes (private only) no
Serialization great good
(pr-str works, but json encoding loses types, etc.)
ok
(custom)
no
Type for dispatch not really
(ad-hoc :type field)
yes
(generates named class)
yes
(generates named class)
sort of
(generates anonymous class)
Works with protocols/interfaces? no yes yes yes
Repl redefinition great meh [2] meh [2] good
Documentation meh [1] ok [1] ok ok
Ease of use great good
(supports map lookups, but no map-vals, etc)
ok
(Java field access only, you bring the sugar)
good
(no ceremony around names)

Recipes

Now I understand how all the pieces work. But how should I represent my Widget?

This section will aim to provide some rules and heuristics for selecting appropriate data and operation types. Of course, please keep in mind that the answer is not not always clear-cut, since it involves the interactions of the above features, which can be even more complex and nuanced than the individual differences in data types and operations.

No polymorphism

You have a single data type, and one or more operations you want to perform on it.

This is a relatively easy case, since you can mostly refer to the 'data types' table to figure out your best course of action. If you need primitive support (etc.), you should probably use a defrecord. If you (really) need mutable fields, you're stuck with deftypes. If efficiency is not a concern, maps are the simplest and easiest option.

Regardless, if your data is shared across many namespaces, or serialized and stored or shared across process boundaries, you should probably have a concrete schema to refer back to. deftypes and defrecords give you some of this, but don't capture field constraints beyond primitive values, so it's prudent to use a schema library [1] to precisely describe your data for documentation and safety, which can work just as well with plain maps as more structured types.

For operating on your data, plain old functions are the simplest and typically best option. We recommend that you use safe-get [3] to access fields, use docstrings and schemas to document your code, and organize your namespace into clear public and private sections.

With this discipline under your belt, the only real benefits of using records and interfaces or protocols are primitive support, and the appearance of lexical scope for your data members. The price you pay for these features is the extra ceremony around declaring interfaces and data classes, plus (in my opinion) slightly decreased ease of use.

If you do need polymorphism, however, ordinary functions are usually not a great choice. A single instance? check or case on :type isn't the end of the world, and sometimes is the simplest and cleanest solution -- but once these conditionals start appearing in multiple places, there's a good chance you're doing it wrong.

Extreme polymorphism, or polymorphism without data types

If you need polymorphism of an exotic form, where you're not just conditioning on the class of the first argument, then you need multimethods. In our experience, this is a pretty rare occurrence. Multimethods are so general that you can pick the data format best suited to your application on its own merits. If you want to use maps for your data representation, a simple :type field mapping to a keyword can be used for dispatch.

Similarly, if you want an extensible method without a corresponding concrete data type, multimethods give you a way to declare open dispatch without tying you down to a concrete data representation. For example, you can make a function that dispatches on its first argument value (not class), which anyone can extend.

Beyond these cases, you should probably think hard before using a multimethod. This is especially true if you have multiple polymorphic methods, since you'll need to repeat your dispatch logic in each multimethod if you choose this option.

Maximum mungeability

We've almost reached the end of the road for plain old maps as well. But before we get there, it's probably worth mentioning one more way to achieve polymorphism: storing functions as fields, ala JavaScript (or many languages that came before it).

(def my-obj 
  {:foo (fn [this y] (bar (:baz this) y))
   :baz 12})

;; caller 
((:foo my-obj) my-obj 12)             

This gives you flexibility to do crazy things that are difficult to achieve with more rigid interfaces and records; you can merge "objects", assoc new "methods", and bring all the other tools you usually use to manipulate data to bear on constructing your polymorphic objects.

That said, this method is rather clunky and hard to understand. (Where the hell is the :foo function of my-other-obj defined?). We've only had one or two cases where we felt that we needed this power, and they've all been replaced with Graph [3], which has a similar model but abstracts away some of the complexity of this approach, when your object is really trying to represent a flexible computation process with many steps.

And now, we've hit the end of the road for plain old maps.

Maximum efficiency

If you need maximal memory efficiency and/or unboxed primitives, you must use defrecord, deftype, or reify. If you need mutable members, you must use deftype. If you need custom equality semantics or map semantics, you must use deftype or reify. If you want efficiency and (less efficient) extensibility, you probably want defrecord.

If you need complex logic that returns primitives, you need to use Java interfaces to work with these objects.

If you need none of these things, you don't want deftype or definterface.

Simple polymorphism

We're left with a common case, where you do need polymorphism but don't require extreme performance or complex dispatch.

We've already covered the cases where you want to use plain old functions, multimethods, and interfaces; and plain old maps and deftypes on the data type side. This leaves us with a single operation type, protocols (or definterface+ [2], if you're concerned about memory/perf implications of protocol dispatch), and two data types, defrecords and reify.

At this point, the decision is pretty simple, based on your use case.

reify is simpler -- you don't have to explicitly name your fields since Clojure automatically captures things referenced in lexical scope, and you don't need to define a separate constructor function if you want constructor-like logic. The price you pay for this simplicity is that the objects you create are opaque: if you want access to any 'fields' you will have to create protocol methods for this purpose.

On the other hand, defrecord is transparent -- people can examine your object, pull out fields, reason about the Class of your data, and so on.

Thus, the choice of reify or defrecord primarily comes down to what you are trying to represent; if you're primarily concerned with data, you probably want defrecord, whereas if you only care about behavior then reify may be a simpler choice.

Abstract data members.

We're basically done with our tour, but there's one issue we haven't touched on yet: abstract data members. What if you have multiple data types (posts and URL documents, employees and customers, etc.) and want a data-centric interface (all documents have titles, all people have names, etc.). None of Clojure's data types allow for implementation inheritance, so if your employers and customers are separate records, you're out of luck for getting the static checking of Java-style field access (.first-name r).

In this case, there are three options at your disposal, none of which is really ideal:

  • Use an informal interface (a.k.a docstring): "All people have :first-name and :last-name keys". This should probably be backed by schemas [1] and liberal use of safe-get [3] to ensure your data measures up.
  • Be oh-so-formal: declare a protocol full of 'getter' functions, and fill each of your records with methods like (first-name [this] first-name). This pain can sometimes be alleviated by defining a single record or reify that goes through this ceremony, and letting each of your 'objects' share this single constructor -- they can even pass in functions that the single implementation delegates to, if you need limited polymorphism.
  • If you're not concerned with polymorphism but just a hierarchy of data types, you may be able to flip things around so that all data types are represented with a single defrecord, that has a :type and :type-info fields to allow extensibility.

  1. Schema is a library for declaring data shapes, and annotating functions with input and output schemas. Besides their other benefits, Records have the documentation advantage of having a concrete description that is type-hintable; among other things, schema brings these same benefits to ordinary Clojure maps.
  2. Potemkin provides some great tools for dealing with interfaces, protocols, records, and so on. In particular, it provides variants of defprotocol and defrecord that are more repl-friendly, and an implementation of definterface that's a drop-in replacement for defprotocol, allowing full primitive support with automatic wrapper functions (but without the open-ness of protocols, of course).
  3. Plumbing is a library of Clojure utility functions, including Graph, a tool for declarative description of functional processes.