Archive

Archive for the ‘IronPython’ Category

IronPython hammers CPython when not mutating class attributes

May 22nd, 2009 Robert Smallshire 6 comments

Earlier today I posted the second article in what is turning out to be a short series in the investigation into why the performance of IronPython is around 100× slower than CPython, when running the front-end of my OWL BASIC compiler.

The most informative comment was from Curt Hagenlocher who works on IronPython in the Visual Studio Managed Languages group at Microsoft.

Curt suggested,

Try storing the counter as a global variable instead of a class-level member of Node — I think you’ll notice a dramatic improvement.

The modified benchmark program looks like this:

counter = 0

class Node(object):

    def __init__(self, children):
        global counter
        counter += 1
        self._children = children

def make_tree(depth):
    if depth > 1:
        return Node ([make_tree(depth - 1), make_tree(depth - 1)])
    else:
        return Node([])

def main(argv=None):
    global counter
    if argv is None:
        argv = sys.argv
    depth = int(argv[1]) if len(argv) > 1 else 10

    root = make_tree(depth)
    print counter
    return 0

if __name__ == '__main__':
    import sys
    sys.exit(main())

A dramatic improvement!

Well, Curt wasn’t wrong. This made a phenomenal difference with IronPython completing in only 12% of the time taken by CPython – over 8× faster with a binary tree depth of 20.

Let’s look in detail at the results. All results are from a dual quad-core 1.86 GHz Xeon with 4 GB RAM, and as before each benchmark was run five times, and the shortest time of the five taken.

The three test environments are:

  1. Python 2.5.2 x86 32-bit
  2. Jython 2.5rc2 on Java Hotspot 1.6 32-bit
  3. IronPython 2.0 on .NET 2.0 x64
The relative performance of the three main Python implementations on a benchmark that uses a global counter, rather than mutating a class attribute.

Figure 1. The relative performance of the three main Python implementations on a benchmark that uses a global counter, rather than mutating a class attribute.

Here we can see how IronPython’s performance has been improved hugely by this simple change. Although startup time dominates for the smaller problem size, now both Jython and IronPython surpass CPython at around half-a-million nodes.

Removing start-up time, which may be irrelevant for long-running processes, gives us the following chart:

The performance of the three main Python implementations excluding start-up time.

Figure 2. The performance of the three main Python implementations excluding start-up time.

Again there is a lot of noise in the data below 1000 nodes, but it is clear that Jython scales better than IronPython, which in turn is scaling better than CPython.

Up until now I’ve been using a log-log scale in the charts because of the wide variation in performance between the different implementations, but now the performance gap is much closer, it’s difficult to get a sense of just how much faster IronPython is on the modified benchmark. Let’s throw in a log-linear plot to help us appreciate what’s going on:

Figure 3. A linear representation of the same data as in Figure 2, to highlight the performance multiple between IronPython and CPython in the larger tests.

Figure 3. A linear representation of the same data as in Figure 2, to highlight the performance multiple between IronPython and CPython in the larger tests.

It’s perhaps easier to see now that IronPython is doing in 14 seconds what takes CPython 114 seconds to achieve!

Finally, let’s plot those results as we did before, as multiples of CPython performance:

Figure 4. Execution time of Jython and IronPython as multiples of CPython performance.

Figure 4. Execution time of Jython and IronPython as multiples of CPython performance.

It is easy to see that in this chart, once we pass half-a-million tree nodes (a tree depth of 19) that both Jython and IronPython are significantly beating CPython.

Explanation

Curt Hagenlocher offers the explanation in a comment in the thread on Reddit.

In this particular case, IronPython is slow because of the update to Node.counter. Currently, any update to a class will increment the version number for the class, which will have the effect of invalidating any rules compiled for that class. Effectively, the same rules are getting compiled over and over again. Moving the counter to a global should result in performance on par with that of CPython.

which is absolutely correct, except that he’s underselling the relative gain. IronPython is not only on a par with CPython, it can outperform it by a factor of eight!

With this knowledge in hand, I can now approach optimization of my OWL BASIC compiler, which lies back at the start of this illuminating tale.

Conclusion

  • Avoiding mutation of Python class attributes can have significant benefits for IronPython performance.
  • Both IronPython and Jython scale better than CPython by this benchmark, and have superior performance for large trees of nodes.
Categories: .NET, computing, IronPython, Jython, Python, software Tags:

IronPython 2.0 and Jython 2.5 performance compared to Python 2.5

My previous post covering the performance problems I’ve been experiencing with IronPython raised some questions about whether the low performance was an effect peculiar to my system, or to my program — the OWL BASIC compiler — where the problem was first noticed. To briefly recap, I’d determined that IronPython was around 100× slower that CPython on the same program.

Since then, I’ve had time to reproduce the results with a small and completely unremarkable Python program, and also to run the tests on a different system. I had suspected that in the OWL BASIC compiler, my Python visitor implementation, which is used in applying transformations to the abstract syntax tree, was to blame. I set about condensing a tree visitor down to a small example, but I never got that far. It is sufficient to simply build a large binary tree to demonstrate the dramatic differences in the performance characteristics of the three main Python implementations.

The benchmark

Here is that test program, which just builds a simple binary tree of objects to the requested depth.

class Node(object):
    counter = 0

    def __init__(self, children):
        Node.counter += 1
        self._children = children

def make_tree(depth):
    if depth > 1:
        return Node ([make_tree(depth - 1), make_tree(depth - 1)])
    else:
        return Node([])

def main(argv=None):
    if argv is None:
        argv = sys.argv
    depth = int(argv[1]) if len(argv) > 1 else 10

    root = make_tree(depth)
    print Node.counter
    return 0

if __name__ == '__main__':
    import sys
    sys.exit(main())

The program builds a binary tree to the depth supplied as the only command line argument, or ten if one is not supplied. It counts the number of nodes as they a built. Remember that the merits or otherwise of this program are not the point! The point is the performance difference between the Python implementations when it is run.

My benchmarking approach has been to run this script five times for each tree depth from a depth of one, upwards to 22, or until my patience was exhausted. I’ve taken the minimum time from each run of five. Since there is a non-linear relationship between the depth of the tree and the number of nodes contained therein, logarithmic axes are used in all the charts that follow.

64 bit Windows Vista x64

Here are the results for the first test machine – with dual quad-core 1.86 GHz Xeons with 4 GB RAM running Vista x64, testing IronPython 2.0.0.0 on .NET 2.0, Jython 2.5rc2 on Java Hotspot 1.6.0 and Python 2.5.2.

Create time for a binary tree including Python virtual machine startup on Windows Vista x64 with 1.86 GHz Xeon processors.

Figure 1. Creation time for a binary tree including Python virtual machine startup on Windows Vista x64 with 1.86 GHz Xeon processors.

In Figure 1 we see that above 1000 nodes or so (tree depth of 10) performance for IronPython begin to degrade rapidly. CPython holds out for another two orders of magnitude before the significant costs begin to be felt . Its interesting to see that although Jython is in the middle of the pack, it scales much better than CPython, surpassing it at around half-a-million nodes (tree depth of 19).

In my application — a compiler — virtual machine (VM) start-up time is important; however, in many long-running applications this is not the case, so it is interesting to subtract VM start-up time from each series, which we see in Figure 2, below.

By subtracting VM start-up time, we get a picture more interesting for long-running processes.

By subtracting VM start-up time, we get a picture more interesting for long-running processes.

Below 100 tree nodes, there is a lot of noise in these measurements. Above 100 nodes its easy to see that the blue IronPython curve is at least two chart divisions above the red CPython curve — that’s two orders of magnitude or 100× slower, and getting relatively worse as the size of the tree increases.

32 bit Windows XP x86

Responses to my earlier article suggested that trying IronPython 2.0.1 with Ngen’ed binaries on x86 may make a difference. Well, to cut a long story short, it doesn’t, but here are the details. These tests were run on a 900 MHz Pentium M Centrino laptop with 768 MB RAM, and so cannot be directly compared with those above, although its notable that a one year old workstation is only twice as fast as a five year old laptop. Moore’s law certainly isn’t delivering here!

The performance profiles are very similar with IronPython 2.0.1 on x86.

The performance profiles are very simular with IronPython 2.0.1 on x86.

On x86, IronPython is still 100× slower than CPython, and Jython still scales better. It seems the essence of this benchmark is not dependent on which hardware or CLR platform it is run.

I’ll close by re-presenting the data in the x86 benchmarks as multiples of CPython performance, because it dramatically demonstrates the different responses to the scale of the problem size for IronPython and Jython. Again we see Jython catching up with CPython at a tree depth of 19, just we saw on x64. and IronPython delivering 6000× worse than CPython at a tree depth depth of 15. A tree of this size with thirty-thousand nodes is very similar in scale to the AST tree sizes found in the OWL BASIC during compilation of large programs.

Performance of IronPython and Jython as multiples of CPython performance.

Performance of IronPython and Jython as multiples of CPython performance.

Conclusions

  • IronPython can be very slow, even on programs in the microbenchmark category, which are doing standard operations such as building trees. Presumably there are still significant optimizations to be made in IronPython to bring its performance closer to that of the other Python implementations. Hopefully, this example and the measurements can contribute to that improvement.
  • Jython may scale better than Python if your application exercises Python in similar ways to this benchmark. Speculatively, that could have implications for projects such as SCons, which build large in-memory graphs.
  • I suppose if nothing else we have demonstrated in passing that Java can be faster than C for some non-trivial programs (like a Python interpreter) running a trivial program, like this benchmark.