Thursday, March 30, 2017

Four new pygame things for slow computers.

There's four things I'd like to work on for faster pygame apps on slower computers (like the Raspberry Pi/Orange Pi).
  • Dirty rect optimizations, to speed up naieve drawing.
  • SDL2, for better hardware support.
  • C/Cython versions of pygame.sprite
  • Surface.blits, a batching API for drawing many images at once.
The general idea with all of these is to take techniques which have proven to improve the performance of real apps on slower computers. Even though, for some of them there are still open questions, I think they are worth pursuing.  Even though more advanced techniques can be used by people to work around these issues, this should be fast even if people do things the easy way on slower computers.

Anyway... this is a summary of the discussions and research on what to do, and a declaration of intent.

Dirty Rect optimizations.

First a couple of definitions. "Dirty Rect" is a technique in graphics where you only update the parts that changed (the dirty parts), and rect is a rectangle encompassing the area that is drawn.

We already have plenty of code for doing this is pretty fantastic ways. LayeredDirty is a particularly good example, which includes overlapping layers and automatically deciding on rendering technique based on what sort of things you're drawing. However, when people just update the whole screen like in pygame zero (the project which has a mission to make things simple for newbies), then there can be performance issues. This change is aimed at making those apps work more quickly without them needing to do anything extra themselves.

So, on to the technique...

Because rectangles can overlap, it's possible to reduce the amount drawing done when using them for dirty rect updating. If there's two rectangles overlapping, then we don't need to over draw the overlapping area twice. Normally the dirty rect update algorithm used just makes a bigger rectangle which contains two of the smaller rectangles. But that's not optimal.

But, as with all over draw algorithms, can it be done fast enough to be worth it?

Here's an article on the topic:

jmm0 made some code here:

DR0ID, also did some with tests and faster code...

So far DR0ID says it's not really fast enough to be worth it in python. However, there are opportunities to improve it. Perhaps a more optimal algorithm, or one which uses C or Cython.

"worst case szenario with 2000 rects it takes ~0.31299996376 seconds"
If it's 20x faster in C, then that gets down to 0.016666. Still too slow perhaps, but maybe not.

For our use case, there might only be 2-10 things drawn. Which seems way under the worst case scenario for performance. Additionally we can use some heuristics to turn it off when it is not worth doing. Like when the whole screen is going to be updated anyway, we don't do it.

Below is one of the cases that benefits the most.
Currently the update method combines two rects like this:
|     |
|   __|____
|___|_|   |
    |     |
    |     |

Into a big rect like this:
|     ####|
|     ####|
|   XX    | # - updated, but no need.
|###      | X - over drawn.
|###      |

Note in these crude diagrams the # area is drawn even though it doesn't need to be, and the XX is over draw. When there are some overlapping rects like this that are large, that can be quite a saving of pixels not needed to be drawn. But what we want is three rects, which do not give us overdraw and do not needlessly update pixels which we do not need to.

|     |
|___|     |
    |     |
    |     |

This can easily save having to draw millions of pixels. But can it be done fast enough, with the right heuristics to be useful?

For a lot of people, and apps, this won't be useful. But I hope it will for those trying to draw things for the first time on slow computers.

There's some more discussion on the pygame mailing list about it, drawbacks, and links to old school Apple region rendering techniques.

SDL 2 hardware support

Version 2 of SDL contains hardware support which makes it faster on some platforms (for some types of apps). This includes the Raspberry PI.

There are actually a few different modules that already have implemented a pygame API subset using SDL2. Which shows me compatibility is important to some.

The approach I want to try going forward is to use a single source SDL1, and SDL2 code base with a compile time option. (like we have single source py2 and py3). There are new SDL2 APIs for hardware acceleration, which can be added later in a way which fits in nicely with our existing API.

Lennard Linstrom has made patches for pygame using SDL2 available here for some years:

The first step is to do an ifdef build for linux, and then do some more testing to confirm areas of compatibility and that the approach is ok.

Additionally we agreed that using Cython is a good idea.

There's quite a long discussion on the pygame mailing list, with more to it than this. But this is the general idea.

C/Cython versions of pygame.sprite

Sprites are high level objects which represent game objects. Graphics with logic.

I've already mentioned things like LayeredDirty, which does all sorts of back ground optimizations and scene management for you.

A lot of this code is in python, and could do with some speed ups. Things like collision detection, and drawing lots of objects are not always bottlenecked by blitting. We've known this ever since the psyco python jit came out. There are other techniques to work around these things, like using something like pymunk for physics, or using a tile based renderer, or using a fast particle renderer. However people still use sprites, for ease of use mainly.

So the plan is to compile them with Pyrex...  err... I mean Cython, and see what sort of improvements we can get that way. I expect naieve collision detection, and drawing will get speed ups.

Surface.blits, a batching API for drawing.

Another area that has been proven to speed up games is by creating a batching API for drawing images. Rather than draw each one at a time, draw them all at once. This way you can avoid the overhead of repeatedly locking the display surface, and you can avoid lots of expensive python function calls and memory allocations.

The proposed names over the years have included blit_list, blits, blit_mult, blit_many call...

I feel we should call it "Surface.blits". To go with drawline/drawlines.
It would take a sequence of tuples which match up with the blit arguments.

The original Surface.blit API.

  blit(source, dest, area=None, special_flags = 0) -> Rect

The new blits API.
    blits(args) -> [rects]
    args = [(source: Surface, dest: Rect, area: Rect = None, special_flags: int = 0), ...]
    Draws a sequence of Surfaces onto this Surface...

    >>> surf.blits([(source, dest),
(source, dest),
(source, dest, area),
(source, dest, area, BLEND_ADD)]
    [Rect(), Rect(), Rect(), Rect()]

One potential option...
  • Have a return_rects=False argument, where if you pass it, then it can return None instead of a list of rects. This way you avoid allocating a list, and all the rects inside it. I'll benchmark this to see if it's worth it -- but I have a feeling all those allocations will be significant. But some people don't track updates, so allocating the rects is not worth it for them. eg. the implementation from Leif doesn't return rects.

It can handle these use cases:
  • blit many different surfaces to one surface (like the screen)
  • blit one surface many times to one surface.
  • when you don't care about rects, it doesn't allocate them.
  • when you do care about update tracking, it can track them.
It can *not* handle (but would anyone care?):
  • blit many surfaces, to many other surfaces.
Areas not included in the scope of this:
  • This could be used by two sprite groups quite easily (Group, RenderUpdates). But I think it's worth trying to compile the sprite groups with Cython instead, as a separate piece of work.
  • Multi processing. It should be possible to use this API to build a multi process blitter. However, this is not addressed in this work. The Surface we are blitting onto could be split up into numberOfCore tiles, and rendered that way. This is classic tile rendering, and nothing in this API stops an implementation of this later.

There's an example implementation by Leif Theden here:

There is always a trade off between making the API fast, and simple enough to be used in lots of different use cases. I feel this API does that. But more benchmarking and research needs to be done first.

There's some discussion of the blits proposal on the pygame mailing list.

Other performance related things.

There was also some discussion of using the quad CPUs on the newer raspberry PI, which are faster than the video hardware on there for some tasks. Unfortunately that won't help the millions of older ones, and even newer ones like the new zero model. I've seen those CPUs outperform the hardware jpeg, and also when used for image classification. However, like taking advantage of METH_FASTCALL in python 3.6, that will have to wait for another time. Separately there's also work going on by other people on other optimization topics. An interesting one is the libjit based graphics blit accelerator.

1 comment:

Luke Codewalker said...

It would be great to see all this happen. Especially the SDL2.