What makes implementing efficient data structures harder in a purely functional setting compared to an imperative one?
The lack of imperative, or destructive, updates leads to copying possibly large parts of the data structure, and
All versions of a data structure are persistent and can be used even after updates.
Any practical compiler for a purely functional language aggressively looks for opportunities to share the representation of data that is common to multiple data structures. Nevertheless, space overhead is unavoidable when compared to similar data structures in imperative settings. However, as we will see later in the course, with clever design and a dash of lazy evaluation, many efficient data structures can be designed for purely functional settings. But first, some basics...
The following function updates the element in xs
at index i
, if one exists, with the value y
:
update : List a -> Int -> a -> List a
update xs i y =
case (xs, i) of
([], _) ->
[]
(x::rest, 0) ->
y :: rest
(x::rest, _) ->
x :: update rest (i-1) y
As with the find
function that walks through a list one element at a time, so too does update
by keeping track of how many more elements i
to traverse. The worst-case running time of update
is O(n), where n is the size of xs
. (Why isn't the worst-case running time infinite, when i
is negative?)
In addition to running times, we also need to pay attention to the space overheads incurred. That is, how much additional data needs to be represented during and after a call to the function.
In the second case, notice that the output list y :: rest
refers to rest
, which is identical to the tail of the input list. This is an opportunity for the language implementation, which will of course use pointers and pointer manipulation, to share the underlying representation of rest
among the representations of the input and output lists.
In the third case, on the other hand, there is no such opportunity for sharing; the resulting list is built from x
and some unknown list that results from the recursive call to update
. As a result, a new memory cell must be allocated for x
in the output list and concatenated to the list that results from the recursive call. This case embodies the worst case with respect to space (in addition to time). If update
needs to update the last element in the list, then the first n-1 elements of xs
are copied and cons-ed on to the singleton list y :: []
. Thus, update
allocates O(n) space in the worst case. (Why don't the sizes of the input elements play a role in this space bound?)
Another list operation to consider is append
:
append : List a -> List a -> List a
append xs ys =
case xs of
[] ->
ys
x::rest ->
x :: append rest ys
Let n be the size of xs
and m be the size of ys
. Notice that the elements of ys
are not traversed at all (so its number of elements cannot contribute to execution time) and ys
is referred to by name in the first case, where the language implementation can share its representation in the input and output lists. The second case allocates one list cell for every non-empty list cell in xs
. Thus, the worst-case time and space cost of append
is O(n).
In practice, it is common to use a single variable n to describe the combined size of the lists, or the size of the larger list. In either case, the worst-case time and space costs for append
are O(n).
Recall the findBST
function that searches for an element in a binary-search-ordered tree, making a recursive call only in the subtree where the element can possibly appear.
The insertion algorithm that preserves the binary search property is similar:
insert : comparable -> Tree comparable -> Tree comparable
insert x t =
case t of
Empty ->
Node x Empty Empty
Node y left right ->
if x == y then
t
else if x < y then
Node y (insert x left) right
else {- x > y -}
Node y left (insert x right)
As with findBST
, insert
runs in O(h) time, where h is the height of the tree. The height h of a tree t is equal to its size n in the worst case, where every Node
has only one element (that is, t is a crooked list). If t is (roughly) balanced, then h is O(log(n)). (Note: Even if the input tree is balanced, the tree produced by insert
is not necessarily balanced. Insertion will have to be more clever to preserve balance.)
Regarding space, the insert
function copies into the output tree all of the node values y
that are observed until an Empty
is reached. But the right
subtree in the second case and the left
subtree in the third case are common to both the input and output trees, so the implementation has the opportunity to share this representation. Because at most h nodes are traversed, insert
allocates O(h) space in the worst case.
Persistence.elm
contains the List
and Tree
examples above.