Stage3D stress test (b3d engine) 20,000 primitives ~ 15,000,000 triangles (updated with playable demo)

Click to start.
Mouse move to look
Shift to fly faster
Space to toggle rotation
+ to add 500 doughnuts
– to remove 500 doughnuts
m to change material (4 available)
Once started double click to toggle fullscreen (NOTE all keys bar arrow keys will be disabled)

demo (requires flash player 11 download):

link to play standalone version (better experience):

Let me know how it runs (assuming it does) if you can: frame-rate, number of objects it can handle etcetera.

video 1 (looks like crap so am uploading another one):

video 2 (hopfully looks a bit better..still looks balls, ah well)

Starts to slow down with 2000 – 3000 and above primitives on screen πŸ™ there is still room for some efficiency improvements but not bad for now

Demonstrates how essential good culling is, despite what the stats say whilst recording I am able to cull a full scene of 10,000 objects in under 1 ms on the release player no problems…without that it would probably die (with it you can happily navigate a scene with 50,000 objects in it at 60fps on a good machine as the majority are being culled away leaving perhaps only 500-1500 visible at any one time when they are spread out like they are in the demo)

20 replies on “Stage3D stress test (b3d engine) 20,000 primitives ~ 15,000,000 triangles (updated with playable demo)”

Processor: Intel Xeon CPU E5405 @ 2GHz (2 processors)
Ram: 12GB
OS: Windows 7 (64bit)
Browser: Chrome 17.0.9
Flash Player: 11.1 r102 (release)
GFX: NVIDIA GeForce 9800 GTX

It’s a decent(ish) machine on paper, but in practice I think its pretty crap. I decent i7 would chew through that demo like a fat chick through a doughnut I would imagine.

With any luck I will post the demo online tomorrow – get it tested by a few folks on different machines.

I think if a close up of my blood was that pink and cheerio-esk I would be straight down the doctors!

i7 920 – 6GB, ATI 5870 GPU

solid 30 fps with roughly 15000 rendered in view (pulled back using s)

CPU utilization sat around 17%
GPU utilization almost 50%


thanks for the info, will be update the demo soon to allow for the ability to pause the camera movement and swap between a few primatives

iMac i7 ATI Radeon HD 6970M OpenGL
60 fps fullscreen (2560 x 1440)
1050 visible objects (above that it starts to dip below 60fps)

That was about 800.000 triangles.

When I compile simple program I made with Stage3D, I am only able to display around 4000 triangles before it starts dipping below 60 fps. Do you have any clue why I would get 200x less performance? It runs with these characteristics:

– uses hardware : OpenGL Vendor=ATI Technologies Inc. Version=2.1 ATI-7.18.11 Renderer=ATI Radeon HD 6970M OpenGL Engine GLSL=1.20 (Direct blitting)
– wmode = direct
– antialias = 0
– allowdebug = false
– enableDepthAndStencil = true

The main loop looks something like this:

public function renderParticles(context3D:Context3D, shader:Program3D, particles:Vector., worldviewproj:Matrix3D):void {
context3D.setVertexBufferAt(0, vertexbuffer, 0, Context3DVertexBufferFormat.FLOAT_3);
context3D.setVertexBufferAt(1, vertexbuffer, 3, Context3DVertexBufferFormat.FLOAT_2);
context3D.setTextureAt( 1, texture );

var m:Matrix3D = new Matrix3D();
for each( var p:Particle in particles ) {
m.appendRotation(p.rot, p.rotAxis);
m.appendTranslation(p.x, p.y, p.z);
context3D.setProgramConstantsFromMatrix(Context3DProgramType.VERTEX, 0, m, true);

I have the feeling I missed some essential checkbox somewhere…

Thanks for the info, am writing an small article at the moment actually that will cover some of the things you need to do to get the most out of stage3D, hopefully ready by tomorrow so do check back*.

How many particles are in your particles list? If it’s more than 1000 then that’s a lot of calls to the drawTriangles method which can be a huge bottleneck.

*Was too busy to finish it so on the back burner until I get a less hectic period πŸ™

@bwhiting : thanks for your reply!

Yeah I think that’s it. Right after writing I went back and figured that might make a difference. So I changed my code to draw patches of quads instead of single quads. Varying the amount of objects and quads-per-object, the sweet spot seems to be around 2000 objects of 500 quads each, reaching over 2M triangles per frame @60fps.

Intel Core i5 650, 3.2 Ghz (dual-core, launched Q1 2010, performance similar to AMD Phenom II X4 810)
Firefox 10, Flash Player 11.1
GeForce GT 430 (in benchmarks the card performs a bit weaker than GeForce 9800 GT)

30 FPS with ~9000 total rendered objects / 6,6 million total rendered triangles
Reaches 60 FPS when it gets below ~600 objects / 400k triangles


Thanks for the tweet (you may regret that in a minute ;-)), I think one of my main questions is around this subject thread. I’ve tried to keep this short and totally failed… sorry!

I’ve just read your post on optimisation and having read around, I’d arrived at the same conclusions about how to make the most of the GPU. Excellent start, we’re on the same page!

What I’m unsure about currently is the best way to handle lots of instances of the same geometry but with different properties and then adding more of them at some future time.

Three examples, all slightly different but homing in on the same problem I think…

Example #1
Rendering a village of a 1000 houses of the same model that don’t ever change / animat and are identical, is probably most efficiently done by just pre-compiling all of the triangles into one big fat vertex buffer (assuming all the data would fit) and (excusing the big upload time) just getting the GPU too spit out all those triangles in one draw call with no context changes.

However, how would you go about having 1000 zombies running in between the houses? Just thinking about the most simple scenario: All of the basic transformation matrices are going to be different, they might all be at some different animation frame, they’ll all be following their own pathfinding algorithm result and so on. Then what if I arbitrarily decided to add another 100 zombies to the horde?

Example #2
I have a landscape scene and I can just throw all the geometry at the GPU but that’s pretty wasteful in the distance and I’d rather use geometry with a lower level of detail. The advantage so far is that I’ve uploaded one vertex buffer and just left the GPU to it, but if I wanted to start dynamically changing which mountains I’m rendering or replacing them with lower res versions then I’m going to be doing a lot of context switching – right?

Example #3
I decide to write a particle system. I could pre calculate all the animation and upload the animation frames to the vertex buffer and that would look good in some scenarios. However I want to apply collisions to the particles (for instance) and that involves I guess, either injecting values into the vertex shader or having the CPU handle everything and re-uploading the vertex buffer like Starling does – ouch! No?

I can think of a few ways to achieve these effects but none of them seem to be in the spirit of minimising context changes and draw calls.

Do you know if there’s a preferred / recommend approach to this? Are the context switches just unavoidable? Feel free to point me at a link or something rather than write a big reply if you know of any! πŸ™‚

Many thanks for your time in advance!


Wow monster post!

Will do my best to try and answer your questions but there will be some speculation involved with a hint of guess work but fingers crossed it will clarify some of the things you mentioned above.

Right example 1:

1000 houses is pretty steep, with a limit of 65535 verts per buffer that only leaves you 65 triangles per house.. so possible but not much intricate detail there.

In terms of triangles then you are looking at a maximum buffer size of 524287 (about 174762 triangles). But realistically 1000 low poly houses would probably something more like 25,000 triangles.

So how to best render them (and this would probably apply if you had even 100 houses)?
I think you would be right to store them all in one buffer, but they doesn’t mean you have to draw them all in one go, the drawTiiangles command also can take a firstIndex and a numTriangles parameter. With that in mind you could group your houses based on location say into 4-10 (numbers plucked out of the sky) groups, keeping the houses nearest each other in one group. Then perform some culling on a bounding volume per group and only render the ones on screen. This leads to more draw calls (0-10 based on the numbers above) but no context changes at all. What you do gain though is the chance to not blindly render the whole lot if they are not even going to be on screen. In some cases (where the triangle count is an issue) this will save you some time. But this would require testing to see if the added few draw calls is worth the saving. Personally I like the idea of not sending things to be draw if easily avoidable.

1000 moving zombies… ok this is trickier but I think the solution would be to 1st upload your zombie mesh then make one draw call for each one updating any constants as you need (matrix information, colours etc…) trying to modify buffers on the fly is something to avoid like the plague so let the GPU do the work on this one. It would also allow you to very easily add zombies with almost no overhead as no buffers need to change… only more draw calls.

Example 2:

I’m not going to dwell on this one too much, but in essence yes, you want to avoid rendering in great detail stuff that is far away. I suggest looking into geomipmaps/geoclipmaps, there are a lot of references around but I haven’t tackled it yet.
Proof it is possible in flash though:
Maybe track down this guy?!?

Example 3:

Dynamic particles are harder in stage3d as we cannot read textures in the vertex shader πŸ™ so all the really cool stuff is out of reach. Again I stress avoid messing with the buffers on the CPU, flash is slow at the best of times so to do everything where possible on the GPU is the rule here. With particles you can upload speeds and directions as attributes but any dynamicness (cant think of the word) will have to come from nifty tricks and formulas in the shaders. So proper collisions will be almost impossible but you could emulate simple collisions with your shader code.
Here’s a link to Simo’s blog, he has a lot of info on particle rendering:

Hope that makes some sort of sense and helps clarify some of your questions, it was written in a mad rush so will might be a bit dodge, but I’ll have a read over it later and correct anything.


Hey Ben,

Thanks for the good solid reply there!

Sorry I tend to be overly dramatic, yes a 1000 houses is a lot πŸ™‚

I’d pretty much arrived at the same conclusions, having been reading around for weeks but it’s always better to chat with someone about it I find. I was worried that I was missing some obviously better strategy in my esoteric trawling of the web but it seems not.

Very difficult to find people to discuss this stuff with: Most Stage3D dabblers seem to be either doing small tech demos that don’t scale up, using a third party engine (no use to me at all since I’m building my own engine for my own amusement), or are so far ahead that they don’t have time to talk down to people like me in the early / mid stages of building something substantial and asking meatier questions than “what’s a vertex buffer then?”.

I’ve been really focused on getting the context changes to a bear minimum but I think I need to start sensibly trading in more draw calls to get the scene flexibility I want / need.

Hmmmm… more food for thought.

Many thanks though, you’ve reassured me that I’m not insane and that’s always nice to hear!

All the best,


No worries, I feel your pain though! I had no background in 3D/GPU programming so all self taught (and that was a LOT of trial error / having to answer most of my own questions… a painful process indeed). So always happy to discuss with folks in similar situations.

Feel free to ask questions regarding engine structure as I have been working on b3d for about 2 years now with quite a few iterations and its pretty speedy whilst remaining flexible at the same time. (its mid way though an overhaul at the moment mind woo)

Shame there isn’t a good forum (not necessarily an actual ‘forum’) of 3d flash devs who one can post questions/ideas to, in order to get some educated feedback/help or just general discussion.

Good luck with your endeavours anyhow!!


I was the same. I wrote a simple fixed function software 3D engine over the course of a number of years (all the maths, from scratch, took me ages!) and I’m now converting it to a hardware programmable pipeline. So the good new is I’ve already battled through the basics of how fundamentally all this stuff works, now I’m just wrapping my head around how best to use the hardware / programmable pipeline.

How do you feel about dynamically built shaders? I’ve got a few simple shaders on the go now but they’re all hardcoded in AGAL (mini assembler) and I keep thinking: Should I bother to make these dynamically generated?

The pairing between the vertex shaders and the fragment shaders makes that a little scary because you’d have to manage the inputs and outputs. But I keep thinking: if the guy (ha! like anyone will ever use this thing :-P) doesn’t want fog, why am I making the shader do all the fog calculations only to decide it’s a value of 0?

Also which AGAL assembler do you favour? As mentioned I’m using the mini version. That’s because I’ve always wanted to dabble in some simple assembly. I’ve done that, it’s initially fun but ultimately painful and I’m thinking about changing to something else! πŸ™‚

With respect to shaders this is how my engine works:

I have a base material class that contains render state information such as culling, blending etc.. as well as the shader program, comprising of the fragment and vertex parts.

The material system allows you to write your shaders per material but also to build them out of various part too combining them automatically.


var mat:Material = new Material(“my material”);
material.add(new BasePass());
material.add(new TexturePass(texture));
material.add(new DiffuseLightingPass(light), BlendMode.multiply);
material.add(new SpecularLightingPass(light), BlendMode.add);
material.add(new FogPass(0xFFFFFF));

the base pass is the one that transforms the vertex into the world space then clip space ready to be rendered but it can still be modified by any following pass.

so each pass has an optional ability to modify the positions/normals or whatever and pass it to its relevent fragment shader and the fragment shader gets the current colour as an input from the one before where applicable… not sure if this makes any sense but I think it warrants a blog post to explain it properly. But it allows anyone a really easy way to build up complex materials from simple parts) currently I have quite a few effects from lighting to normal mapping to detail mapping to fog to reflections and refractions to simple colours and texture stufff + loads more, so already there a many many combinations that you can make thousands of materials out of if you wanted, and all the fragment parts support blending in the shader automatically so the same effect can look different depending on the blending you want. Its a pretty neat system but I am in the process of making it optional to build them like that, previously it was the only way.

So far I am sticking to the mini assembler as well, although I am a big fan of minko’s as3 shaders, they look like a pretty neat way to build em.

Erm went on a bit of a tangent there but my advice it try to keep your system simple but easy to get hardcore with if the need arises.

If you want more intricate details then I’ll deffo write a post about it as there is a couple nifty tricks that could be useful and some that help speed things up. Also sure that someone could point out ways to improve it!

Good stuff! I’ve just discovered blend mode on the context docs this morning, I had been wondering how you’d do passes without rendering to bitmapDatas (yuk!). Passes (as you’re doing) make the whole process simpler so I think that’s my next trick. I’m also wondering (in a similar line of thought) whether I should look at deferred rendering to sort out my lighting issues but that again sounds like outputting images and then re-evaluating – sounds expensive… more thought required there I think πŸ™‚



You could go deferred for sure but because we don’t have proper support for multiple render targets its going to require a complete pass for each object for every buffer you want, i.e. one pass of everything for depth, one for normals, one for colour, one for specular…. and so on. That said you can of course combine multiple outputs into one buffer making use of the available channels but its still going to be very memory taxing and a problem for slower graphics cards… but definitely doable!

Quite old machine (about 10x cheaper than any iMac anyway πŸ˜‰ ), I managed to capture the output when FPS dropped down to 30fps and utilised 85% on my viewport.

AMD Phenom 2 955 x4, ATI Radeon HD 5800

cullingTime: 2
occlusionTime: 0
sortingTime: 0
renderTime: 29
totalTime: 32
totalObjects: 14000
totalTriagles: 10500000
totalRenderedObjects: 11891
totalRenderedTriagles: 8918250 (85%)
totalProgramChanges: 0
Direct :: hardware

Still impressive performance, but one question remains. Did you try affiliate transformations on GPU ?

Have done further optimizations since that was uploaded so could probably squeeze a tiny bit more out of it.

“affiliate transformations on GPU”

Do you mean recompose the transformation matrix on the gpu? At the moment I don’t do that although I have thought about it, it would munch up a fair few instructions though (why I had swerved it before) but I could test the performance comparison quite easily – might give that a bash next week.

Leave a Reply

Your email address will not be published. Required fields are marked *