Algebraic Datatypes (ADTs)

We’ve seen how lists are built with [] and (:), and deconstructing with patterns using [] and (:). Now we’ll see a general way for users to define new types that have one or more kinds of values.

Example

As an example, currency converter. For simplicity, only two kinds of currency.

data Currency
  = USD Float  -- US Dollars
  | JPY Float  -- Japanese Yen
      deriving (Eq, Show)

The data keyword defines a new algebraic data type (ADT), or simply datatype, in this case called Currency. Values of type Currency are either the data constructor called USD and a single Float or the data constructor JPY and a single Float. The smallest denomination of US currency, one cent, is one-hundredth of a dollar; the smallest denomination of Japanese currency is one yen. We choose to track JPY with Floats to provide finer-grained precision; these will be rounded to Integer values later.

The deriving (Eq, Show) clause instructs Haskell to automatically implement a function called show :: Currency -> String which is described by the Show type class, and an equality function (==) :: Currency -> Currency -> Bool which described by the Eq type class. We’ll talk more about these later. Try the interactions below with and without the deriving clause.

> USD 11.99
> JPY 1353
> USD 11.99 == USD 11.99
> USD 11.99 == JPY 1353

To extract data wrapped by data constructors, we use patterns involving the data constructors and variables. For example, the following function allows adding two values in either currency.

yenPerDollar  = 112.86         -- as of 9/27/17
dollarsPerYen = 1 / yenPerDollar

add                    :: Currency -> Currency -> Currency
add (USD d1) (USD d2)  =  USD (d1 + d2)
add (JPY y1) (JPY y2)  =  JPY (y1 + y2)
add (USD d)  (JPY y)   =  USD (d + y * dollarsPerYen)
add (JPY y)  (USD d)   =  JPY (y + d * yenPerDollar)

Note that any amount has two valid representations; the last two equations choose to retain the base currency of the first argument.

Data Constructors

In general, a datatype definition contains one or more data constructors C, each beginning with a capital letter (e.g. A, B, C, D below). The name of the new type T (e.g. T below) must also start with a capital letter.

data T
  = A
  | B Int
  | C Bool String
  | D (Char, String)
      deriving (Eq, Show)

Each data constructor wraps zero or more values:

A does not carry any additional data,
B carries a single Int,
C carries one Bool and one String, and
D carries one pair containing a Char and a String.

Data constructors are functions that, when applied to arguments of the specified types, produce values of the datatype (in this case, T).

> :t A
> :t B
> :t C
> :t D

Pattern Matching

Three kinds of patterns p can be used to match and destruct values of a datatype:

a variable x (starting with a lowercase letter);
the wildcard pattern _; and
the data constructor of a datatype applied to an appropriate number of (sub)patterns for the values it carries C p_1 … p_n

For example:

tToInt             :: T -> Int
tToInt A           =  0
tToInt (B i)       =  1
tToInt (C b s)     =  2
tToInt (D (c, s))  =  3

At run-time, a variable pattern x (of some type S) successfully matches any value v (of type S); v is then bound to x in the right-hand side of the equation.

The wildcard pattern _, like a variable, matches any value but does not introduce a name for it.

A data constructor pattern C p_1 … p_n (of some type S) matches only those values v (of type S) that are constructed with C and whose n values match the patterns p_1 through p_n, respectively. If C carries no data, then there are no subsequent patterns (i.e. n= 0).

Even though a datatype constructor may take multiple arguments (e.g. C in our example) and, thus, may be partially applied like other functions, all data constructors must be “fully applied” to enough patterns according to its definition.

Case Expressions

The definition of tToInt above defines multiple equations — one for each data constructor — to handle all possible argument values of type T. Another way to destruct values of type T is to use a case expression.

tToInt :: T -> Int
tToInt t =
  case t of
    A        -> 0
    B i      -> 1
    C b s    -> 2
    D (c, s) -> 3

Each branch of case expression must return the same type of value (in this case, Int). The equations in the original version of tToInt above are syntactic sugar for this version using case.

Polymorphic ADTs

A data definition can refer to refer to one or more (lowercase) type variables (a, b, etc. below).

data T a b c ... = ...

The type variables can be referred to in the types of the data constructors.

Handling Errors

As an example to motivate polymorphic ADTs, recall our head and tail functions from before.

head :: [a] -> a
head (x:xs) = x

tail :: [a] -> [a]
tail (x:xs) = xs

By omitting equations for the empty list, these functions implicitly fail: they crash with inexhaustive pattern match errors at run-time.

Versions with more explicit failure:

head :: [a] -> a
head [] = undefined
head (x:xs) = x

tail :: [a] -> [a]
tail [] = undefined
tail (x:xs) = xs

> :t undefined

Whoa! undefined has every type? This is only okay because undefined throws an exception at run-time, halting program execution at that point.

This isn’t much better than before. Using undefined is, however, a cool trick in the type system that allows exceptions to be raised anywhere in the code, while still accurately typechecking other expressions. Using undefined is often helpful during program development, allowing you to put “placeholders” for expressions that you haven’t finished yet while still allowing the rest of the program to type check and run.

Now, a better way than crashing to represent failure.

data Maybe a
  = Nothing
  | Just a
      deriving (Eq, Show)

Notice that Maybe is polymorphic; it takes one type variable a. A value of type Maybe a is either Nothing with no additional data, or Just with a single value of type a.

> :t Just True
> :t Just 1
> :t Just (+)

> :t Nothing
> :t Nothing :: Maybe Bool
> :t Nothing :: Maybe (Maybe (Maybe Bool))

We can use Maybe values to deal with “errors” much more effectively.

maybeHead :: [a] -> Maybe a
maybeHead []     = Nothing
maybeHead (x:xs) = Just x

maybeTail :: [a] -> Maybe [a]
maybeTail []     = Nothing
maybeTail (x:xs) = Just xs

This allows (requires) callers of these functions to handle each of the two cases explicitly.

The Maybe type is extremely useful and is defined in the standard Prelude. Its cousin Either is also quite useful.

Recursive ADTs

Datatypes can also be recursive.

data List a
  = Nil
  | Cons a (List a)
      deriving (Eq, Show)

What does this type remind you of?

Built-In ADTs

Lists

The built-in lists are just like the List type above, but with special syntax:

[] and (:), rather than names starting with capital latters, are data constructors,
the expression [1,2,3,4] is syntactic sugar for nested calls to (:),
the type is called [], rather than a name starting with a capital letter, and
the type [a] is syntactic sugar for [] a.

For example:

> :t []
> :t [] :: [a]
> :t [] :: [] a

> :t "abc"
> :t "abc" :: [Char]
> :t "abc" :: [] Char

> let (x:xs) = [1, 2, 3]
> let ((:) x xs) = [1, 2, 3]

Tuples

data Pair a b = Pair a b deriving (Eq, Show)
data Triple a b c = Triple a b c deriving (Eq, Show)
data Tuple4 a b c d = Tuple4 a b c d deriving (Eq, Show)

Built-in pairs are just like these datatypes, again with special syntactic support.

> :t ('a', 2 :: Int)
> :t (,) 'a' 2 :: (,) Char Int
> :t (,)
> :t (,,)
> :t (,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,)

> let (a, b, c) = (1, 2, 3)
> let ((,,) a b c) = (1, 2, 3)

Unit

Sometimes it is useful to have a dummy value.

data Unit = Unit deriving (Eq, Show)

The built-in unit value and type are just like this datatype, again with special syntactic support.

> :t ()
> :t () :: ()

Booleans

Even booleans can be (and are) defined as a datatype.

data Bool
  = False
  | True
      deriving (Eq, Show)

If-expressions, multi-way if-expressions, and guards are syntactic sugar for pattern matching on Bools.

absoluteValue n
  | n >= 0    = n
  | otherwise = -n

-- source file                {-# LANGUAGE MultiWayIf #-}
-- ghc command-line option    -XMultiWayIf
-- ghci                       :set -XMultiWayIf
absoluteValue n =
  if | n >= 0    -> n
     | otherwise -> -n

absoluteValue n =
  if n >= 0 then n
  else if otherwise then -n
  else undefined

absoluteValue n =
  case n >= 0 of
    True  -> n
    False ->
      case otherwise of
        True  -> -n
        False -> undefined

Type Aliases vs. Wrapper Types

Recall the currency example above. We were careful to use ds for dollar Floats and ys for yen Floats, but it’s easy to accidentally, for example, add these two kinds of Floats, leading to a non-sensical, and probably hard-to-detect-and-debug, result.

One approach might be to define type aliases.

type Dollars = Float
type Yen = Float

But remember, type aliases do not create new types, just synonyms. So, Dollars and Yen, which are synonymous with Float and with each other, can still be misused.

Another option is to define wrapper (or “box”) types with single data constructors, for the purpose of forcing the programmer to construct and destruct the different Float types, with the usual support from the typechecker to prevent errors.

data Currency
  = USD Dollars -- US Dollars
  | JPY Yen     -- Japanese Yen
      deriving (Eq, Show)

data Dollars = DollarsFloat Float deriving (Eq, Show)
data Yen     = YenFloat Float deriving (Eq, Show)

Note, it is common to see definitions like the following:

data Dollars = Dollars Float deriving (Eq, Show)
data Yen     = Yen Float deriving (Eq, Show)

Notice how the type and data constructor names are the same; this is not a problem, because types and expressions “live” in different places in the language syntax, so there’s never any ambiguity.

Now, we redefine add as follows, in terms of several helper functions that significantly reduce the chances that we mistreat the actual Float values.

add                    :: Currency -> Currency -> Currency
add (USD d1) (USD d2)  =  USD (addDollars d1 d2)
add (JPY y1) (JPY y2)  =  JPY (addYen y1 y2)
add (USD d)  (JPY y)   =  USD (addDollars d (convertYenToDollars y))
add (JPY y)  (USD d)   =  JPY (addYen y (convertDollarsToYen d))

addDollars :: Dollars -> Dollars -> Dollars
addDollars (Dollars d1) (Dollars d2) = Dollars (d1 + d2)

addYen :: Yen -> Yen -> Yen
addYen (Yen y1) (Yen y2) = Yen (y1 + y2)

convertYenToDollars :: Yen -> Dollars
convertYenToDollars (Yen y) = Dollars (y * dollarsPerYen)

convertDollarsToYen :: Dollars -> Yen
convertDollarsToYen (Dollars d) = Yen (d * yenPerDollar)

Record Types

When data constructors carry multiple values, it can be hard to remember which components are meant to represent what (especially when multiple components have the same types). Furthermore, it can be tedious to write accessor functions that retrieve particular components carried by a data constructor.

Haskell offers an embellishment of data constructors that addresses these two issues. Consider the following definition where the data constructors A and B use record syntax to give names to its data values.

data T'
  = A' 
  | B' Int
  | C' { foo :: Bool, baz :: String }
  | D' { bar :: Char, baz :: String }
      deriving (Show)

All data constructors can be used to construct values as before.

> :t A'
> :t B'
> :t C'
> :t D'

But now, record construction can also be used to create A and B values.

> C' { foo = True, baz = "hello" }
> D' { bar = 'a', baz = "hello" }
> D' { baz = "hello", bar = 'a' }

In addition, accessor functions have been automatically generated.

> :t foo
> :t bar
> :t baz

Furthermore, there is a way to create a new record by copying all of its fields except for some.

> let c' = C' { foo = True, baz = "hello" }
> c' { baz = "goodbye" }

When multiple data constructors, update-copy expressions may crash:

> c' { bar = 'b' }

Should the following be an acceptable data definition?

data Data = One { data :: Int } | Two { data :: Bool }

Language Extensions

NamedFieldPuns and RecordWildCards can be handy when programming with record syntax. And punny code can spark joy.

`newtype`

We will start seeing the keyword newtype popping up in code examples later on. This is a detail we don’t need to worry about for now.

Data constructors tag, or label, the values they carry in order to distinguish them from values created with different data constructors for the same type. As we have seen, it is sometimes useful to define a new datatype even with only one data constructor.

In such cases, tagging and untagging (or constructing and destructing, or boxing and unboxing) values is useful for enforcing invariants while programming, but these operations add unnecessary run-time overhead: there is only one kind of value, so they ought to be free of labels at run-time.

Haskell allows datatypes with exactly one constructor (such as Dollars and Yen), and which carries exactly one value, to be defined with the keyword newtype in place of data, such as

newtype Box a = Box a

For the purposes of programming, newtype is almost exactly the same as data. But it tells the compiler to optimize the generated code by not including explicit Box labels at run-time. We will get into the habit of using newtype whenever we define a datatype with one constructor, without delving into the subtle differences between using newtype and data.

Furthermore, we will often write expressions of the form

Box . doSomething . unbox

to unwrap (unbox), transform (doSomething), and rewrap (Box) values. So, we will use records so that unwrapping functions are generated automatically.

newtype Box a = Box { unbox :: a }

Source Files

Datatypes.hs