Thursday, June 16, 2011

A Generic Quicksort in Scala

So, I decided to create a quicksort algorithm in Scala that showcases how to write 'generic' collection methods. That is, how can we write an external method that works across many types of collections *and* preserves the final type.

Well, here's how you do it:

import scala.collection.SeqLike
import scala.collection.generic.CanBuildFrom
import scala.math.Ordering

object QuickSort {
  def sort[T, Coll](a: Coll)(implicit ev0 : Coll <:< SeqLike[T, Coll],
                             cbf : CanBuildFrom[Coll, T, Coll],
                             n : Ordering[T]) : Coll = {
    import n._
    if (a.length < 2)
      a
    else {
      // We pick the first value for the pivot.
      val pivot = a.head
      val (lower : Coll, tmp : Coll) = a.partition(_ < pivot)
      val (upper : Coll, same : Coll) = tmp.partition(_ > pivot)
      val b = cbf()
      b.sizeHint(a.length)
      b ++= sort[T,Coll](lower)
      b ++= same
      b ++= sort[T,Coll](upper)
      b.result
    }
  }
}

I've chosen a somewhat imperative approach to the problem. The quick sort algorithm is split into two parts: The first checks for small collections and returns them, the second picks a pivot and decomposes the collection into three pieces. Theses pieces are sorted (if necessary) and pushed into a builder, cleverly named "b". This "b" is given a hint to expect the entire collection to eventually wind up in the built collection (hopefully this helps performance). Finally, after passing the three partitions to the builder, the result is returned.

The magic here is in the rather confusing type signature:
def sort[T, Coll](a: Coll)(implicit ev0 : Coll <:< SeqLike[T, Coll],
                             cbf : CanBuildFrom[Coll, T, Coll],
                             n : Ordering[T]) : Coll

Let's decompose this a bit. T is the type parameter representing elements of the collection. T is required to have an Ordering in this method (the implicit n : Ordering[T] parameter in the second parameter list). The ordering members are imported on the first line of the method. This allows the < and > operations to be 'pimped' onto the type T for convenience.

The second Type parameters is Coll. This is the concrete Collection type. Notice that *no type bounds are defined*. It's a common habit for folks new to Scala to define generic collection parameters as follows: Col[T] <: Seq[T]. Don't. This type does not quite mean what you want. Instead of allowing any subtype of sequence, it only allows subtypes of sequence that *also* have type parameters (which of course, is most collections). Where you can run into issues is if your collection has (a) no type parameters or (b) more than one type parameter. For example:
object Foo extends Seq[Int] {...}
trait DatabaseResultSetWalker[T, DbType] extends Seq[T] {...}

Both of these will fail type checkking when trying to pass them into a method taking Col[T] >: Seq[T].

To get the compiler to infer the type parameter on the lower bound, we have to defer the type inferencer long enough for it to figure this out. To do that, we don't enforce the type constrait until implicit lookup using the >:> class.

The type parameter:
Coll <:< SeqLike[T, Coll] 
Ensures that the type Coll is a valid Seq[T]. You may be asking why this signature uses SeqLike rather than Seq.

GOOD QUESTION!

SeqLike differs from Seq in that it *retains the most specific type of the sequence*. This one of the magic tricks behind Scala's collections always returning the most specific type known. That type is embedded in SeqLike. To ensure that we can return the most sepcific type, we can Capture Coll as a SeqLike with Coll as the specific type. This means that filters, maps, flatMaps, partitions should all try to preserve the type Coll.

The last implicit parameter is the cbf CanBuildFrom. Because we don't know how to construct instances of type Coll (because we don't know the type Coll at all), we need to implicitly receive evidence for how to construct a new Coll with sorted data.

Let's look at the result:
scala> QuickSort.sort(Vector(56,1,1,8,9,10,4,5,6,7,8))
res0: scala.collection.immutable.Vector[Int] = Vector(1, 1, 4, 5, 6, 7, 8, 8, 9, 10, 56)

scala> QuickSort.sort(collection.mutable.ArrayBuffer(56,1,1,8,9,10,4,5,6,7,8))
res1: scala.collection.mutable.ArrayBuffer[Int] = ArrayBuffer(1, 1, 4, 5, 6, 7, 8, 8, 9, 10, 56)

scala> QuickSort.sort(List(56,1,1,8,9,10,4,5,6,7,8))
res18: List[Int] = List(1, 1, 4, 5, 6, 7, 8, 8, 9, 10, 56)

scala> QuickSort.sort(Seq(56,1,1,8,9,10,4,5,6,7,8))
res: Seq[Int] = List(1, 1, 4, 5, 6, 7, 8, 8, 9, 10, 56)

You may be asking why I've chosen Seq instead of GenSeq or GenTraversable or even GenIterable. No particular reason, besides I wanted a reasonable assurance that the collection author expects the .length and indexed access methods to be called.

So, what are the lessons to be learned here?

(1) Use *Like subclasses to preserve the specific collection type
(2) Defer inference using <:< to give the type checker a hope of succeeding
(3) Provide @usecase comments for scaladoc so users won't get distracted by typesafe details.

At lest, IMHO, this is the current way of creating generic collection code.

16 comments:

Daniel said...

You are not picking the first value as the pivot, contrary to what is said in the comment.

And it would be a bit better if you did -- the performance will be completely destroyed on collections with O(n) size. Which are likely to have O(i) apply as well.

Of course, that nitpicking, but I think it is a relevant nitpick when talking about generic methods.

J. Suereth said...

@Daniel -

The real point of this article was one the type signature. The method itself (which I forgot to mention in the article) was stolen from here

The unfortunate bit of copying things from the book Scala In Depth is that certain points get dropped, and my copy-paste error shows up.

In the book, I describe using implicits for poly-morphic dispatch. That is, we can have a sort algorithm optimised for IndexedLinearSeq (where indexing is O(1) or at least 'fast'). And another version for slower collection types. In fact, we can even detect the SortedSeq trait and just *not* sort in that instance if the comparison functions are the same.

Again, the focus of this blog post is the method signature based on the traffic I've seen recently on scala-users and stack overflow. If you'd like to contribute a better general quicksort algorithm please do! I'll revert this back to using head since I don't really reference the Scala examples anyway.

Anonymous said...

>>> I've chosen a somewhat imperative approach to the problem.

Why am I reading this?

J. Suereth said...

@Anonymous - Because this is a blog about the type signatures.

Feel free to provide the functional version. IIRC, it works well with Vector.

Pretty soon Scala In Depth Chapter 8 source will be available. In there, you will see all the different sorting implementations defined. Think of the imperative approach as a by-hand compiler optimisation :)

(The method signature is still has referential transparency).

kbloom said...

One small suggestion: You should probably have 3 parameter lists on this function, putting the Ordering by itself, in the second parameter list, and leaving the <:< and the CanBuildFrom for the third. That way a caller can specify a custom ordering without having to specify the <:< or the CanBuildFrom.

J. Suereth said...

@kbloom Not a bad idea, but you can only have one implicit parameter list per method.

Two options:

scala> val x = new Ordering[Int] {
| def compare(x : Int, y : Int) : Int = if (x > y) 1 else if (y < x) -1 else 0
| }

scala> QuickSort.sort[Int, Seq[Int]](Seq(56,1,1,8,9,10,4,5,6,7,8))(n=x, ev0=implicitly,cbf=implicitly)
res28: Seq[Int] = List(56, 10, 9, 8, 8, 7, 6, 5, 4, 1, 1)


OR You turn sort into a pimped function so you can capture the types *first*.

implicit def addSort[T, Coll](col : Coll)(implicit ev0 : Coll <:< SeqLike[T, Coll]) = new {
def sort()(implicit cbf : CanBuildFrom[Coll,T,Coll], ordering : Ordering[T]) = ...
}


Which isn't quite as nice IMHO. In reality, creating local implicit values to override ordering in a limited scope is not such a bad thing, and my preferred way to change the 'context' of a method.
}

Anonymous said...

Great article! Thank you.

"In the book, I describe using implicits for poly-morphic dispatch."

Is this a simulation of multi-methdods than?

J. Suereth said...

@anonymous -

Yep, the type-class/implicit solution is sort of an encoding of multi-methods in scala.

The resolution of behavior mostly happens in the compiler. It sends a new argument to the function that polymorphically stores the appropriate behavior for the known types of the arguments. There are no runtime-type based checks going on, but otherwise it's the same concept.

marc said...

Could you kindly comment of the reasoning behind the import n._ line?

J. Suereth said...

@marc the import n._ imports, locally, the 'pimp' implicits to add <, > etc. to the type T. If we didn't import this, we'd have to make calls like:

n.gt(left,right) instead of left > right.

See the OrderingOps class in the type trait. It's essentially the boilerplate we use in Scala to encode type-classes. I hope one day there will be a slightly simpler syntax.

marc said...

Thanks! I am still getting my head around implicits and thought that that would be automatically used.

marc said...

Josh,

I thought I would post this here as opposed to the Scala mailing list, but I am happy to move the conversation there.

I am trying to generalize this approach such that the function that we are applying to the sequence can be passed to the function.You can imagine this being useful if you first restructure the data into a nice format and then do a bunch of analytics on it. I know I could write a class and go about it that
way, but I thought this would be more educational.

Basically, I am working on method of the form

processData(data, functionToBeApplied).

Using your notation, I have defined the function as follows (choosing Int just for convenience)

def processData[T,Coll](data:Coll,
function: Coll=>Int)(
implicit ev0: Coll <:< SeqLike[T,Coll], cbf: CanBuildFrom[Coll,T,Coll]) =
{
function(data)
}

However, when we use this,
say calling processData(List(1,2,3),x=>x.length)
we get the dreaded
missing type parameter on x=>x.length.

The trick I have used in the past for this, is to curry
the function; however, this does not play nice with the implicit functionality.

One approach I did get to work was to define processData as follows:

def processData[T,Coll](data:Coll)( implicit ev0: Coll <:< SeqLike[T,Coll], cbf: CanBuildFrom[Coll,T,Coll]) = {
(function:Coll => Int) => function(data)
}

then
val b = processData(List(1,2,3))
b(x=>x.length)

but this feels clunky, as we cannot put everything
into one line as processData(List(1,2,3)(x=>x.length)

due to the implicit not coming last.

I know you are busy with your awesome book (I bought the pdf already and am going through it
very thoroughly), but any help would be greatly appreciated.

marc said...

And of course as soon as I posted this, I solved the issue!!!!

def processData[T,Coll](data:Coll)(function:Coll=>Int)( implicit ev0: Coll <:< SeqLike[T,Coll], cbf: CanBuildFrom[Coll,T,Coll]) = {
function(data)
}

marc said...

The only issue I have now is that if I do want to curry this function,

val b = processData(Vector(1,2,3))_

I have to provide type data otherwise we get an
error: Cannot prove that scala.collection.immutable.Vector[Int] <:< scala.collection.SeqLike[Nothing,scala.collection.immutable.Vector[Int]].

I could just provide this type data via

val b = processData[Int,Vector[Int], ???](Vector(1,2,3)

But I do not know what we should put into the ???, as it will depend on the return type after function application. Is there a way around this?

I have to say, this is fun!

Yang said...

Josh, why's it important to bound by `SeqLike` instead of `Seq`?

J. Suereth said...

@Yang - Because `SeqLike` is the *true* parent for sequences. While it is easier to extend from `Seq` to create a sequence, it's not necessary.