?

Log in

No account? Create an account
latest entries related journals calendar about the author My life in pictures older entries older entries more recent entries more recent entries
Not fast enough? - My Thoughts Today
An ill-used association of words and pictures
feyeleanor
feyeleanor
Not fast enough?
Firstly the disclaimer: this is a very geek-heavy post. If this isn't your thing, move along.

Recently there was another outbreak of the Ruby isn't fast enough thread on RubyTalk. This topic is a perennial favourite with visiting Java/Perl/Python hackers and usually focuses on the inadequaces of the current C implementation of the language, which is an old-style interpreter as opposed to a shiny modern (well, modern in 1965) virtual machine.

Personally I don't see the point of these mine is bigger than yours arguments that go on between zealous proponents of various languages as they miss the real issue of speed: which language is most semantically efficient for solving a particular problem, and whether or not that semantic efficiency scales to other problem categories. Anything beyond that is a matter of individual implementation and thus tractable to improvement.


One of the many ways of differentiating languages is to consider whether or not they are converted into the machine code that a computer actually understands, as opposed to a form that must be interpreted by another program. The former are referred to as compiled languages whilst the latter are interpreted languages.

For specific categories of application - generally those revolving around fixed-precision mathematical operations or direct manipulation of memory - an efficient compiled language will perform much better than an interpreted one. Hence the dominance of C/C++ amongst game developers. However for more general programming tasks these advantages may be magnified or nullified by real-world limitations: I/O bottlenecks, context switch latencies, cache hit efficiency, predictive branch failures and all the other arcane considerations that most programmers completely ignore because they're beyond their control.

Whilst compiled languages can often perform much faster than interpreted languages under optimal conditions, they do so at a cost: the encumberance of a pedantic and verbose syntax for memory handling - pointers, static types, word-oriented mathematical operations, etc. deliver amazing performance but they also greatly increase the complexity of development. In addition a program will have to be compiled for each platform that it will be targeting, and may require platform-specific variations in the source code.

Now let's take a look at the other approach to programming.

Interpreted languages generally fall into two categories: those like Java that are actually compiled languages, but generate an interpreted intermediate language called byte code, and those in which the interpreter tokenises the language into a more efficient internal format expressing its parse tree. There is a certain graduation between these two extremes and not all of the languages that use byte code are compiled in the strictest sense.

When people complain about the current C implementation of Ruby they are generally upset because the interpreter does not support a byte code format. It is therefore assumed that it will be too slow to do any task with a significant processing component. Now I've spent quite a lot of my spare time over the years grubbing around inside the guts of various interpreters as the idea of language interpretation fascinates me - in fact it's been my single biggest interest in computers since 1985 or thereabouts.

At one level these critics are raising a valid concern. It's nearly a decade since just-in-time compilation of byte code went mainstream in the wake of the rapid uptake of Java, and SmallTalk has been demonstrating the power of this approach since the 1980s. A well-chosen byte code can be transformed into machine code with relative ease, and as a result we today see a crop of Java runtimes that come close to the speed of C++. In the case of Java though JIT-compilation is only half the story - it also comes with a highly-optimised native run-time which significantly boosts performance when dealing with IO resources.

If we ignore JIT-compilation the speed-up for application code in being compiled to a byte code format comes primarily from stripping the syntactic representation of the program from the interpretation process, focusing purely on a linear stream of simple primitive instructions. This is great if the application needs to perform typical number-crunching tasks with clearly defined immediate data types (data types where their value is identical to their in-memory representation) but becomes less significant when playing with more advanced data types. The richer the semantics of the complex data types provided as standard with a language, the longer the sequence of byte code instructions required to perform these operations and therefore the more rigorous the runtime optimisation of the JIT-compiled code must be.

With a classic interpreter design we see the exact opposite trend - in principle manipulating a complex data type requires no more run-time overhead than performing a basic numeric operation. As a result number-crunching becomes a costly operation compared to a byte code inplementation, but this is counterbalanced by the ability to express many complex systems in a more natural manner. A simple example of this is a Hash data type, which is a way of storing items in memory based upon indexing an identifying key as opposed to requiring sequential layout. In the majority of byte code formats the instructions to add a data item to a Hash would consist of sequences of byte codes, whereas in a classic interpreter a single token would be required.

There is a middle way between the two extremes, as demonstrated by a category of language implementations known as Threaded Interpreters. The classic thread interpretive language is Forth - although not all Forth implementations follow this approach. A TIL generally relies on a very simple virtual machine which allows two distinct types of code to be stored in memory and executed as control moves through an application: primitive operations are sequences of native machine code instructions that are executed directly, whilst definitions are sequences of tokens referencing either primitive operations or other definitions. Whilst this sounds like a potentially complex and inefficient approach to language implementation, TILs have proven themselves to be highly effective languages for embedded and real-time applications.

In principle any byte code format that allows for indirect code execution (executing instructions based upon a reference to the memory location at which they exist) can be used to implement a TIL efficiently, and likewise the minimal virtual machine required by a TIL can be used to implement any other virtual machine. However a TIL can also be used to implement a classic tokenising interpreter relatively efficiently with the aid of a tokenising loader, making the technique a good bridge between two camps that are often mutually antagonistic.


The state of play with Ruby is becoming increasingy complicated. The baseline implementation is known as MRI (Matz's Ruby Implementation) and is a pure C tokenising interpreter which aids portability, then there's JRuby for the Java Virtual Machine, Microsoft has announced IronRuby for the CLR and Queensland University of Technology have their Gardens Point .NET compiler. Getting a genuine feel for the speed of Ruby code is thus becoming increasingly difficult, but I think MRI is still a good place to focus our attention as it provides support for the widest range of platforms and is by far the most widely used at the time of writing.


Here's a naive benchmark I cooked up to test enumeration and basic (big) integer math as I've recently been doing a lot of both and wanted to see the difference across various legacy machines I have lurking in my house:


require 'benchmark'
include Benchmark

bm(10) do |x|
  (1..5).inject([]) { |r, i| r << (1..(10**i)) }.each do |r|
    $stdout.puts "range 1 to #{r.end}"
    x.report("1 + 1\t") { r.each { |i| 1 + 1 } }
    x.report("1 * 1\t") { r.each { |i| 1 * 1 } }
    x.report("sqrt 1\t") { r.each { |i| Math.sqrt 1 } }
    x.report("sum 1\t") { r.inject { |sum, i| sum + 1 } }
x.report("collect 1\t") { r.collect { |i| 1 } }
x.report("array 1") { r.each { |i| Array.new(1,1) } }
    x.report("product 1\t") { r.inject(1) { |sum, i| sum * 1 } }
    x.report("1!\t") { r.inject { |sum, i| 1 == 1 ? 1 : sum * 1 } }

    x.report("1 + i\t") { r.each { |i| 1 + i } }
    x.report("1 * i\t") { r.each { |i| 1 * i } }
    x.report("sqrt i\t") { r.each { |i| Math.sqrt i } }
    x.report("sum i\t") { r.inject { |sum, i| sum + i } }
x.report("collect i\t") { r.collect { |i| i } }
x.report("array i") { r.each { |i| Array.new(i,1) } }
    x.report("product i\t") { r.inject(1) { |sum, i| sum * i } }
    x.report("i!\t") { r.inject { |sum, i| i == 1 ? 1 : sum * i } }
  end
end


This kicks out the following results for r = 1 to 100000:

# system: 1.66GHz Intel Core Duo, 2GB RAM
# MacOS X 10.4.7, ruby 1.8.5
# approx. 99%+ CPU utilisation on 1 core

testusersystemtotalrealoperations per second
1 + 10.0500000.0000000.0500000.0451182,216,410
1 * 10.0400000.0000000.0400000.0457042,187,992
sqrt 10.0800000.0000000.0800000.0739251,352,722
sum 10.1300000.0000000.1300000.135566737,648
collect 10.0600000.0000000.0600000.0631271,584,108
array 10.1300000.0000000.1300000.132463754,927
product 10.1300000.0000000.1300000.140114713,704
1!0.1400000.0000000.1400000.140045714,056
1 + i0.0500000.0000000.0500000.0481422,077,188
1 * i0.0500000.0000000.0500000.0475602,102,607
sqrt i0.0700000.0000000.0700000.0767991,302,100
sum i0.1900000.0000000.1900000.188510530,475
collect i0.0600000.0100000.0700000.0641411,559,065
array i15.58000034.03000049.61000050.2745151,989
product i74.66000016.29000090.95000091.7764741,089
i!74.57000016.34000090.91000091.5795481,091


For comparison this mailing list post uses the same algorithm as my product i test and quotes the following:

# system: processor as below, 1GB RAM
# MacOS X 10.3.8, ruby 1.8.2

processorapproximate timeoperations per second
1 GHz G4540185
2 GHz G5360277


Now there are two sets of things I was interested in finding out with these tests: how good is Ruby at the standard idiom of blocks and iterators, and how well does it handle big integer arithmetic. Obviously they're not comprehensive or definitive, but I think they are quite instructive. In particular the array i test is generating 5000,050,000 array elements in 50 seconds (which is a fraction over 100 million elements per second).

The product i test is performing approximately 1100 calculations per second - this may not sound particularly impressive, but bear in mind that the majority of these calculations (anywhere either of the operands or the result exceeds 231) are using big integer maths and that the final result contains 456,574 digits. Compare this to 1 * 1, a reasonable benchmark for real-world 31-bit integer multiplication, where 2,187,992 calculations are occuring each second. One obvious anomaly in these results is sqrt i, which converts its argument to a floating-point number and for any bignum that exceeds this range a runtime error is generated which the benchmark consumes.


Of course the performance of mathematical operations is lousy compared to a compiled language like C, but that point has already been covered so I won't belabour it: if you want super-fast math with MRI you should code it in a C extension. The important point is that on modern hardware the number of operations that can be performed per second is more than adequate to build interactive applications of some complexity - for example a spreadsheet package suited to everyday business use is a credible concept. That surely makes MRI fast enough?


Well yes... and no. Yes it's fast enough for desktop applications on modern hardware, and adoption of Rails is proving it's fast enough in the web sphere as well, but two key areas where it falls completely short of the mark are the gaming and systems/embedded environments. This is a real pity, given not only the elegance of Ruby code but also the incredible speed boost it gives to development.

Games developers have not shown the enthusiasm for Ruby that they have for Python or Lua, and a large part of their reluctance comes back to performance issues: not only is MRI slower than the current Python and Lua runtimes, it lacks internal thread safety and this is a big concern for projects that are increasingly dependent on threaded code. Ruby 2.0 will introduce Rite, a byte-code interpreter which promises increased performance, but whilst Rite will feature native threading (something that many Rubyists are clamouring for) it appears that this will be greatly constrained so as to maintain backwards compatibility with existing C extensions. Perhaps multi-core processors will make multi-threading less popular amongst performance-hungry developers but until then interfacing with multi-threaded code will remain a stumbling block and Ruby will continue to be excluded from the game market.

More disturbing to me is the lack of traction in systems' coding. Games are cool and would demonstrate Ruby's utility to a broader audience than the web developers and sysadmins who currently use it, but real geek chic is coding embedded systems and core system components. This is territory mostly dominated by C and C++, with the occasional sighting of Forth and real-time dialects of other high-level languages. Ruby should be a real boon to these developers, in the same way that Forth was to a previous generation, but the size of the runtime and the latencies introduced by the MRI design provide a high barrier to entry.

Personally I think both games and systems' coding would be greatly facilitated by a threaded interpreter implementation of Ruby. The most obvious benefit of this would be in reducing the memory footprint of deployed applications, which is essential if a Ruby program is ever to make it into an embedded controller, but it would also bring tricks that we're familiar with from Forth-like languages such as inline assembler and a lightweight thread-safe core. There would also be benefits for execution on PC-class processors as threaded code is traditionally very compact, increasing cache coherence, and various optimisation tricks can be used such as just-in-time compilation to make threaded code very fast indeed.

Now the kind of virtual stack machines that languages such as Java employ offer some of the same advantages, especially with regard to just-in-time compilation, but by modelling a static processor model they also impose limitations: changing the instruction set offered becomes a significant undertaking, and the just-in-time compilation process requires more complex optimisation strategies to produce efficient machine code. Sun have proven with their Hotspot technology that this can produce code in performs in the same league as that generated with modern C++ compilers by focusing the optimisation process on the most heavily executed portions of an application, at the cost of a slightly higher memory footprint. Similar techniques can be brought to bear with threaded code however because threaded interpreters are built on a very simple virtual machine core assisted by a rich micro-code layer it becomes possible to perform both size and performance optimisations.

I'm not saying that a threaded implementation will be the ultimate panacea for Ruby's reputation as a slow language, but for those of us interested in using it close to the metal this could be an interesting line of exploration. Because regardless of how it happens, Ruby will be breaking into the gaming and systems programming arenas at some point: the time savings it offers developers will more than offset the investment required to produce suitable runtime implementations, and the relative brevity and elegance of code compared to C/C++ and assembler will greatly improve the ease of testing and validation.


It's my belief that one day it will be just as plausible to develop an operating system, a device controller or a next-generation 3D game in Ruby as in C/C++/C# or Java. The question is not when that will happen or why, but rather what form appropriate runtimes need to take and how they will be developed. This is a debate that needs to be going on now whilst Ruby is still on the fringe of the mainstream, before the current perceptions of it as a web application and scripting language become firmly entrenched. Such ghettoisation would be a sad fate for a language of such expressive power and beauty, just as previous generations turned away from Smalltalk and Forth in favour of static typing.

Tags: , ,

2 opinions or participate
Comments
vorpal_t From: vorpal_t Date: May 21st, 2007 04:43 am (UTC) (permanent link)

Quick scan

has anybody analysed why Lilith never led to anything mainstream?

V
feyeleanor From: feyeleanor Date: May 21st, 2007 10:59 pm (UTC) (permanent link)

Re: Quick scan

If my own experience of Modula-2 is anything to go by I've a feeling that could have been a large part of the problem. It's a far more verbose language than C and better suited to engineers than hackers - in other words it runs contrary to the culture of the cool kids. It also has a certain ugliness that far exceeds that of Pascal, but that could be a personal bias from the two years when I coded in it!

I also suspect that its European origins weighed against it: Unix, Windows, VMS and Mac OS all have their origins in the USA and it often seems that European efforts in the OS arena come to nothing. Of course Linux runs counter to that trend, but as a Unix derivative it doesn't really count ;p

I suspect the same fate awaits Erlang, which is likewise too European (not to mention functional) for its own good.
2 opinions or participate