Tuesday, November 29, 2011

Macro vs. Micro Optimisation

So there's recently been a bit of hype about another Colebourne article: http://blog.joda.org/2011/11/real-life-scala-feedback-from-yammer.html

I'd like to respond to a few points he makes.

First - You should evaluate Scala and pay attention to its benefits and flaws before adopting it.  Yes, there are flaws to Scala.   Working at typesafe makes you more aware of some of them.  We're actively working to reduce/minimize/get rid of these.   In my opinion, the negatives of using Scala are peanuts compared to the postives of choosing Scala over Java.  I think everyone should make up their own mind about this.   Not everyone is going to choose Scala.  I feel bad for those who don't, but I make no effort to convince you further than showing you the 40+ Open source Scala projects I have on github.  It's a language with a lot to like and a bit to dislike.

Now, to the meat of what I want to say.   Don't get lost in micro optimization when discussing programming.

The blog article discusses writing high-performance code in the critical path that has crazy performance needs.   This is not your every day development.   Scala loses a lot of benefits in this world, because features like closures have overhead on the JVM.  Hopefully when Java adopts closures, this overhead can be alleviated, but it is there right now.   The set of rules from the email is known to a lot of us Scala devs when writing a performance intensive section of code.

I'll reiterate a few to agree with:

(1) Avoid closure overhead.   If you can't get the JVM to optimize away and need to allocate a closure constantly in a tight loop, this can slow down a critical section.  This is correlated with
(1a) Don't use a for loop.  For expressions (and loops) in Scala are implemented as closures.   There are some efforts underway to *inline* these closures as an optimization.  This performance hit isn't a permanent one.   As the compiler matures, you'll see a lot of optimization work happen.  

(2) Use private[this] to avoid an additional method.   Scala generated methods and fields for val/var members.   Using private[this] informs the compiler that it can optimize away the method.   Again, while hotspot can optimize this away if you're in a very critical section, it may be a good idea to optimize.   In fact, the whole promotion to fields and methods aspect of Scala classes deserves attention from anyone writing performance critical code.   The rules are also there in Java, it's just that you probably have people that already know them.

(3) Avoid invokevirtual calls.  (Coda lists this as avoid Scala's collections).  The true issue here is that invoke-virtual can make a difference in performance critical sections of code.  Again, this is one I think we can improve with a few final declarations and maybe an annotation or two.


Here's the big *missing* point in all that feedback.   This is for performance critical sections of code, not for general purpose applications.   I think when it comes to performance bottlenecks, you need to pull out the stops and optimize the heck out of your apps.   Look at the Akka framework from typesafe. Akka uses a lot of "dirty" scala code to achieve high non-blocking concurrent performance.  It's one of the most amazing libraries, and the code is pretty low level.   It also supports a very high level of abstraction when writing your application.   It uses PartialFunctions (which are sort of a combination of pattern matching and closures) and traits, both of which have some overhead.   However the resulting *application* is fast.  Why?   The inner loops are fast and optimized and the application *architecture* can be optimised.  In a high level language, you can take advantages of designs that you would *never* execute in a lower-level language because the code would be unmaintainable.

You see, most of us like to be able to read and understand what our code does.   Scala has some features where you can write very expressive code in a few lines.   After getting over the initial hump, this code is pretty easy to maintain.   If I were to write that same code in Java, it would look odd, confusing as hell, and no one would want to maintain it.  

I learned this lesson pretty hard core at Google.

Google has a pretty stringent C++/Java style guide.  One that is tuned for high performance servers.   The C++ style guide is public.  The style guide frowns on things like the use of 'smart' pointers that 'garbage collect' because you can't be sure that GC will happen at a critical moment in the code.

Google also has a high performance Map-Reduce library.  This thing is pretty amazing, with all sorts of crazy cool features like joining data on the *map* part of the map-reduce.  The library followed all the google coding conventions and is generally held up as a piece of awesome software, which its.

However, writing applications with the Map-Reduce library was less than idea.   I could write pretty clean code with the MR library.  My Mappers were pretty light weight and minimalistic.  Same with my reducers.   You'd then string together a series of map-reduces in this crazy guitar inspired "patch it together" configuration and the thing would be off to the races.  The downside is that the throughput of this kind of processes was *sub optimal*.

It wasn't anything inherent in the libraries.  The libraries were fast and good.  The APIs were optimised for high speed performance.  However, writing optimal *architectures* in the framework was tough.   If I wanted to do any crazy performance features, like combining map functions and reduce functions in a single map reduce run, the code got ugly *fast*.  Not only that, it was very difficult to maintain because of all the odd bookkeeping.  I have 10 outputs ordered by key, this is the one writing to that file right?  KRAP.

I spent 6 months writing applications of this nature and feeling like there had to be something better.   I started on a venture to write something, and as usual found that someone else already had.  That's when I found out about Flume and met Craig Chambers.

FlumeJava was a map reduce library that was gaining a lot of traction at Google, but I had heard a lot of complaints about its API.  Around the time I was looking at it, Flume C++ was coming into existence.  My team was one of the alpha users of the C++ variant.

The C++ API was a thing of beauty.   Like its Java cousin, it treats data a set of Parallel Distributed Collections.   You can see Daniel and my talk on the Scala equivalent here.  Converting Mappers and Reducers into this API was pretty simple.  You could even do it directly by just annotating your Mappers and Reducers with types.

My team saw a 30-40% reduction in code size for some of our map-reduce pipelines.  We had an 80% reduction in code for unit tests (which was by far the most amazing benefit).   Not only that, the Flume library optimizes the pipeline by performing all sorts of dirty map-reduce tricks for you.  In some cases we dropped a map-reduce call or two.  In the worst case, we had the same number we had started with.

What was wrong with the library?  It violated *almost every* style rule for C++ at google.  That's right, smart pointers, classes with inline member definitions the works.   I loved every second of it.  Why?  Because I was getting stuff done *faster* than before with *less* code and the pipelines ran *faster*.   It was a crazy win.

The startup time might not have been as optimal as it could be.  Flume ran an on-the-fly optimization before running your pipeline.  That was being improved all the time with neat tricks.   Things I didn't have to write to watch my app speed up, both runtime and startup time.

The key here is that the designers of Flume weren't focused on micro optimization but *macro* optimization.  The inner guts of the library used the very fast and efficient Map-Reduce library and the wrappers they had were as efficient as they could make them.   My code did not have to follow these rules, because the core loops were fast.  When it came down to it, my code used high level concepts.   Something akin to a closure and for-expressions in Scala (Note: The Scala equivalent *did* use for-expressions and closures with no noticeable performance hit).

There are times when writing Scala code requires care and optimization in the micro level.  However, don't lose the forest for the trees.  Think about the entire application architecture.  Scala will open up possibilities for writing code that you'd never dream of trying to maintain in C++ or Java.  Take Akka as a shining example.

And when you need the performance, listen to those techniques.   Viktor Klang can probably give you a *ton*more.  I know I learn more every time I talk with him.

12 comments:

Ismael Juma said...

Hi Josh.

I agree with a lot of what you say and would like to add one point.

The problem with mutable.HashMap (mentioned by Coda) is not invokevirtual. After all, the Java collections rely on this a lot (and invokeinterface too). There have been several discussions about improving the performance of that particular class in the mailing list in the past and I posted an implementation that performed much better a few years ago.

I have been using that implementation at work for a long time actually. I was planning to try and contribute to this area once again once the Git move is completed.

Best,
Ismael

Brian Clapper said...

These are all extremely good comments, Josh. But I would hate for Coda's valid observations to become nothing more than fodder for yet another pissing contest between people who unreasonably hate Scala and those of us who really see its benefits.

He makes some good points, ones that those of us who advocate using Scala (and I include myself in that camp) would do well to read and understand fully. Quite aside from the tuning problems, his comments on versioning and SBT strike me as on the mark. I view them as symptoms of growth in those areas, but they're real problems. Every new release of Scala sets off another round of "when will all the libraries I need be rebuilt for Scala 2.x.y?" and "OMG, I have to rebuild and rerelease all my libraries now!"

And, I have personally found the transition from SBT 0.7 to 0.10 to be a lot more effort than I expected. Porting my plugins was a hassle--and, several of them are now broken again, because of 0.11. (You and I have had these discussions offline, in the past.)

These are very real issues. For companies (such as at least one Java-using startup I worked for), the Scala versioning and SBT issues, alone, would've had serious impact. The more open source Scala libraries one incorporates into one's infrastructure, the more likely these issues are to arise.

Again, I see these problems as, largely, growing pains issues. But we cannot simply handwave them away.

J. Suereth said...

@Brian - I don't mean to ignore those points at all. I agree with you wholeheartidly. Those are issues we are actively trying to address. I have a few project (you can see them on github) that should be coming live early next year that will help. We're taking the comments seriously.

@Ismael - Github move should happen shortly, so I very much look forward to your patches! Thanks for the correction.

Anonymous said...

It was informative post, thank you for that. However, sentences like the following take away a lot from what you say:

"Scala will open up possibilities for writing code that you'd never dream of trying to maintain in C++ or Java."

This kind of fuzzy, generalizing piece of text feels very marketing language. As a programmer that does not care what language is better as long as there is good ones, I shudder at these attempts to put one's favorite language above others, and trying to persuade others to be convinced.

Jason Zaugg said...

I'm writing a (medium) Scala application that is performance sensitive. It does take a bit of work with the profiler to carefully avoid boxing on the critical paths, and to use while loops in some places. That said, we've matched the raw performance of the (large) C++ codebase we're replacing.

There are definitely plenty of little things the compiler/library could do to help out here: optimize away implicit classes, use AnyRef specialization so (A => Double) was unboxed, specialize Option, optimize for-loops over arrays to while loops.

On the macro scale, we've had the confidence to implement algorithmic improvements: we found ways to reuse parts of our simulations for many calculations, and we parallelize more of the code that was conceivable before. This means we can do 20x less work for certain problems, and keep 24 core servers busy.

Jesper Nordenberg said...

Good post, but we shouldn't forget that there is a real problem here. The performance hit when writing "functional" code in Scala is real and can be an issue for some types of applications. With the correct compiler optimizations there's no reason why functional code should perform worse than corresponding imperative code. Without these optimizations programmers will resort to good old imperative loops and the benefit of using Scala is nullified.

Unfortunately many of the optimizations needed can't be performed by the Scala compiler, they must be performed by Hotspot or by extending the JVM spec. JVM bytecode is a bad target for an optimizing compiler because it lacks some crucial constructs (like value types) which are found in other intermediate representations like LLVM and even CLR. So, the ball is really in Oracle's hands, and the progress on JVM improvements so far hasn't exactly been stellar.

J. Suereth said...

@Anonymous That fuzzy general piece of advice would require a lot of backing, more than necessary for this article. It's not just a Scala vs. Java thing, although that was the focus for this article.

I feel Clojure offers the same kind of benefit. Lots of new power to express your design at a higher level with the ability to "drill down" into Java when you need higher performing sections.

Note: I even like haskell and ML for native code. If ML was a bit more stable and had an industry focused, I would have heavily pushed it at google as a replacement for C++. It's about powerful high-level langauges replacing low-level languages. Imperative Java/C++ etc. are the Assembly of today (Yes that's an overly strong statement to which I only agree in part).

@Jason + @Jesper - EXACLTY! Those of us who use Scala aren't surprised by some these comments, as we've been living them. We're taking these all very seriously and working in that direction. I can't wait for the future of Scala, since I think optimisations and performance will just improve over time. @Jason you bring up a lot of good points. I hope to see such things in the future. My part in this is getting our test suite to the point where we can check these optimisations on an ongoing basis to ensure that we have both micro-throughput on certain classes of optimisation and *macro* (total-runtime) throughput improvements...

Josh Marcus said...

As someone who has worked on a high performance Scala application for over a year now, and who has brought up performance of this kind as a pain point with Typesafe folks before, I have to say that I've never gotten the impression that generating optimized byte code was a core priority of the core Scala developers. To the contrary, my sense from the mailing list is that the general stance has been that HotSpot will optimize away the pain points, which more often than not turns out to not really be a sufficient answer (and is frustrating).

It's not surprising that optimizing core loops takes low-level magic and hackery, but I think it's fair to say that optimization and performance has not appeared as a significant priority of those moving Scala forward. And that can be very painful to those of us working through optimization pain on our own projects.

Another good question: where are these optimization pointers listed, and are they really consistent (or tested) from version to version?

All of that said, your main point is absolutely true. And while I need to iterate over large collections (which, in the end, have to be arrays) quickly, it's true that most programming doesn't require that.

But, for all of its flaws, micro-optimization could use some love.

Anonymous said...

To be honest, this laundry list of pitfalls and caveats brings bad very ugly memories of similar lists for C++ a decade ago.

Not a good sign...

I root for you, guys, and I hope you can pull Scala out of the hole it's slowly sinking into, but I'm growing more and more disillusioned that Scala can be salvaged at this point.

Anonymous said...

@Josh - The fact that the Typesafe folks reached out to Coda @ Yammer (and are continuing to work with him) shows they have at least some interest in the performance problems.

@Anonymous - "...this laundry list of pitfalls and caveats brings bad very ugly memories of similar lists for C++ a decade ago." Hmmm... people were complaining about Java's performance a decade ago. You'll have to go back 2 decades for C++'s dirty laundry. And three decades for C. See a pattern?

Anonymous said...

While some "grow disillusioned," others produce astonishing results that are simply impossible to do in Java. After programming in C (since 1985), then C++, then Java, and now Scala, and having used Scala as much as possible for the past several years, I have to say that working in Java is like eating at McDonald's instead of Ruth's Chris Steak House. You won't starve at either, but one will certainly be a heck of a lot more enjoyable.

Perspectives on Technology, Software Development and Life said...

I think it feels like history repeating itself. People looked at Java 1 and they said Wow that's nice and then there were detractors who would say, hey but C++ is faster, I can get people to code C++ and so on..and then Java improved on performance after until Java 5. I think the fact that people are now finding problems with performance of some of the Scala constructs is next step in the evolution of language because they are now zooming in on specifics. Once these set of problems are fixed for them..they will zoom in again find more problems. Thats where probably Typesafe would have to work hard day and night to fix all these or come up with Effective Scala(?).