Categories
3d speed

Frustum Culling – on steroids!

Am not going to really explain how frustum culling works here or how to set it up, there are already some good articles out there will link to a couple of Actionscript ones at the end. This article will explain how to use caching to very easily boost the speed of your frustum culling. As far as testing goes I haven’t yet seen anything as fast as this so I thought I would share it so anyone else can benefit from it or maybe some bright spark can find a way to improve it… there is always room for improvement where flash is concerned 🙂

Just a super quick reasoning why frustum culling is a good thing if done properly:

If the sphere’s position is the same (i.e. a pointer to) the object that it describes position then NO transformations are needed to move the sphere into world space. Consider trying to check a bounding box, while it might be a tighter fit (woop), it would require that the frustum be transformed into local space or the box into world space… either way that is gonna be seriously slow for large numbers of objects when compared to the bounding sphere checks. Frustum checking is easy to implement and can save you a vast amount of time, no rendering of off screen objects!

This is what the standard frustum culling code looks like:

//loop thorugh your objects or spheres
for (var i : int = 0; i < length; i++)
{
	var object:Object3D = objects[i];
	object.cullFlag = 1;		//object.visible = true;
 
	var sphere:BoundingSphere = object.sphere;
	var position:Vector3D = sphere.position;
	var radius:Number = sphere.radius;
 
	var plane:Plane3D;
	var distance:Number;
 
	for (var j : int = 0; j < 6; j++)	//6 planes in the frustum (planes.length)
	{
		plane = planes[j];
		distance = plane.a * position.x + plane.b * position.y + plane.c * position.z + plane.d;
		if (distance + radius < 0.0){
			object.cullFlag = -1;		//object.visible = false;
			break;
		}
	}
}

What the above code is doing is looping through all of you objects or bounding spheres and for each one it projects its position onto each plane that makes up the frustum. If any of the tests show the sphere to be outside of the frustum then it can be marked as not visible and then it will not need to be rendered. As soon as it fails on one plane you can move on to test the next object/sphere but if it passes you will need to go on to check against the next plane until you are sure that it passes all test. (You can see why this is inefficient for objects IN the frustum as 6 checks are required just to confirm that where as the best case is when you can discard an object on the first plane check*)

Now this could be sped up a little bit by unrolling the inner loop but it will only make a very small difference in speed, but it will help.

Introducing the “cache” variable into the mix! Okay here’s where the huge speed up can be gained. In the vast majority of cases, if an object is culled by a frustum plane in one frame and is still culled in the next frame the likelihood that it was culled by the same plane is extremely high! With that in mind if we store the plane that the sphere failed the test on and then test against that one first, for a large number of cases we can eradicate a number of unnecessary plane tests.

Bring on the updated code:

//loop thorugh your objects or spheres
for (var i : int = 0; i < length; i++)
{
	var object:Object3D = objects[i];
	object.cullFlag = 1;		//object.visible = true;
 
	var sphere:BoundingSphere = object.sphere;
	var position:Vector3D = sphere.position;
	var radius:Number = sphere.radius;
 
	var plane:Plane3D;
	var distance:Number;
	var cache:int = sphere.cache;		//the id of the last plane failed on (-1 if none)
 
	if(cache != -1)
	{
		plane = planes[cache];
		distance = plane.a * position.x + plane.b * position.y + plane.c * position.z + plane.d;
		if (distance + radius < 0.0)
		{
			object.cullFlag = -1;	//object.visible = false;
			continue;			//objtect still culled so move to next object
		}
	}
	for (var j : int = 0; j < 6; j++)	//6 planes in the frustum (planes.length)
	{
		if(j == cache) continue;
		plane = planes[j];
		distance = plane.a * position.x + plane.b * position.y + plane.c * position.z + plane.d;
		if (distance + radius < 0.0){
			object.cullFlag = -1;	//object.visible = false;
			sphere.cache = j;	//set the cache to match the current plane
			break;
		}
	}
	//if it's got this far then the object is not culled so the cache can be reset
	sphere.cache = -1;
}

This time round the code checks to see if a plane has been cached and if so then it checks that one first. This means in the best case (a stationary camera with a stationary scene) all the objects, once they have been culled once already will be culled again with only 1 plane test instead of a maximum of 6! Muchos muchos faster!

There are still things that can be optimised such as not testing the cached plane again in the second section if it passes the check in the first section. You could also unroll all the loops to remove any array/vector lookups for a speed up too.

The method can also easily be extend to check for intersections as well a 1/0, true/false or hit/not hit result. Allowing for other checks such as a bounding box or something more tightly fitting.

I have found this approach to be the fastest yet and my implementation can chew through 10,000 objects in under 1 ms without a problem and this is without any bounding volume hierarchies. (Most flash games will probably not even have that many objects in them anyway but still good to know).

Any questions at all then do ask! Or if I have made any booboos then do let me know.

couple o links:

http://blogs.aerys.in/jeanmarc-leroux/2009/12/13/frustum-culling-in-flash-10/ <– nice code (lacking optimisations but I am sure it is just to show the implementation)
http://jacksondunstan.com/articles/1811 <– includes some camera code to so you can get up and running
http://www.flipcode.com/archives/Frustum_Culling.shtml
http://www.crownandcutlass.com/features/technicaldetails/frustum.html

 

*With that in mind maybe one should check the plane most likely to fail most objects, probably the near and far planes, but not necessarily. You could easy count the the number of culls against each plane as you navigate around your scene and get an average to work out which one is the best culler and check that one first… it will depend on your needs though and could be total overkill, but hey this is flash right and flash IS slow, so every little helps!

Categories
3d speed

Stage3D optimisation

Having been playing with Stage3D for a while now, I though I would write a small piece on optimisation.

With great power comes great responsibility!

Stage3D give you GPU access, which can expose some serious rendering horsepower, but if you don’t treat it with respect your going to find you run into limitations pretty quick!

So what follows is a rough (very) guide on how to squeeze the most out of the new 3D apis.

Rule 1 (of 1)
CPU’s are fast, GPU’s are faster, communication between the two however is probably the biggest bottleneck you will face.
Therefore: Reduce this wherever possible.
This means minimise calls to the following Context3D functions;
Context3D.drawTriangles();
Context3D.setProgram();
Context3D.setProgramConstants…();
(actually most of Context3D’s functions but the above are the real doozies)

drawTriangles()
GPU’s can draw triangles fast, and lots of them, millions of them every second without breaking into a sweat.
So you might think, “I can call drawTriangles() 50,000* times no sweat as long as I am only drawing a few triangles in each call.”.
WRONG!! This command is a mighty expensive one so use it very wisely!
How: When you call drawTriangles you pass it a vector of ids that, per three, represent one triangle. Given that this call is expensive it then makes sense that you pass it as many triangles as possible in one call. Sadly this doesn’t quite mean you can just group your geometry into big chunks as you cannot change state (alter the material or any parameters) during this call meaning everything that is sent through will be rendered with the same program and set of constants. It does mean however that static (non moving elements) that share the same program/material, can be combined into one list. Things such as trees, grass and any other repeatable geometry are good candidates for this. You can do this for dynamic geometry also but it gets complex as you have to upload transformation data in a separate buffer, this is one way particle systems can be created. The downside is that each time there is a change to any of the objects the whole buffer will need to be re-uploaded. It is also vital that you only try and draw objects that will be seen on screen, so don’t draw that which is out of view of the camera -> frustum culling saves the day.

side note:
Even high end games rarely want to be issuing more than 1000-2000 draw calls, but the likes of battlefield 3 can get up-to the 3000 mark in some of the environments. Newer consoles however, can issue over 10,000 draw calls and do it much faster due to a more direct access to the hardware.

setProgram()
Assuming you have now done everything in your power to reduce the number of draw calls you issue, next thing to look at is state changes (changing the current program).
Changing state on the GPU might seem like a trivial thing but it is actually something you want to keep to a minimum to be able to squeeze the most out of your graphics card.
There are a few things you can do here to reduce this problem.
1. Group the objects that require drawing by their material/program! For example suppose you had 100 cubes, 50 of them with one material and 50 of them with another. Now if you had a list containing all of those cubes and blindly sent them to be rendered you could end up having to change state a large number of times. If it so happened that each cube in the list had a different material to the object before it, then the program will have to be updated for every draw call. Not good. If that list however was sorted so that

– even if you are only drawing one triangle with this call there

*there is an actual limit of 32,768 drawTriangles() calls per present() call.

setProgramConstants…()
This function is what allows us to upload constants to the gpu. It is how we upload our matrices and any float1/2/3/4..s that we want to utilise in our shaders. While it may not be a huge bottleneck it still has a noticeable impact on performance in my experiments.
So how to optimise?
Any constant that is likely to be reused by different materials, then upload only once per frame not once per object rendered. So what are the likely culprits?
The view projection matrix! This is 16 float values that will not change between objects so it makes sense to upload it once! 16 numbers vs 16,000 for a 1000 objects and 999 less calls to setProgramConstants, and that is a good thing!
The same applies to anything else that will not vary between objects, camera and light positions or common numbers used in shaders (0,0.5,1,-1..).
What this shows is that it is important to have some sort of system to manage uploads so you can keep track of what is already uploaded and only upload data that isn’t already there!

side note:
This also translates into how you write shaders, knowing that each constant requires an upload should make you rethink sometimes about how to achieve something whilst using minimal constants, take unpacking a normal from a texture. Usually you would multiple the value from the texture by 2 then subtract one (2 floats required) but the same result can also be achieved with a subtract by 0.5 then a divide 0.5 (1 float required). Perhaps not the best example but I am sure you get the idea. REUSE is your friend!

—————————————–

While I only focused on 3 methods of the context3d, almost all of them will incur some penalty but those highlighted are the ones I have found to be a more serious problem.

Quick additions:
Drawing to a bitmapdata from the gpu is slow, so if you have to do it, ensure it is at a small a resolution as possible, in theory a 1×1 pixel readback should be big enough for picking!
Resizing the back buffer is slow!
Creating textures is slow, don’t do it on the fly. Pre allocate if possible then pick from a pool (more relevant for post processing).

At some point in the future, I hope to write some test examples that highlight the cost of the functions mentioned above (have already done quite a few but they have been tied into other things rather than dedicated standalone tests).

I hope all of that makes some form of sense to someone 🙂 If you have any additions, corrections or questions… fire away. Will probably update this from time to time to add in more that I have missed out, it’s a broad area with many possible optimisations!

 

Super Quick Summary:

REDUCE DRAW CALLS! Group items where possible and use culling to ensure you are only drawing what is neccessary.

REUSE MATERIALS and BATCH RENDERING by material if you can.

UPLOAD THE MINIMUM NUMBER OF CONSTANTS you can get away with 🙂

 

Categories
speed

Object creation

I was busy comparing my custom Vector3D class with the flash.geom.Vector3D one when I noticed something. While my classes seemed to perform almost as well as the flash version there was one method where my implementation was much much slower.

This was the cross product method. The only thing different in this method was that it returns a new instance of the Vector3D class. So how was flash able to instantiate its own class much faster than I could?

How they did it I don’t know, but I did some tests and it would seem that setting variable values in the constructor is slow when you do it in your own classes. i.e.

var point3D:Point3D = new Point3D(x,y,z);

Funnily enough the above is a slower in my tests than:

var point3D:Point3D = new Point3D().init(x,y,z);

The init function is simply a direct replacement of what was in the constructor:

public function init(x:Number, y:Number, z:Number):Point3D
{
	this.x = x;
	this.y = y;
	this.z = z;
	return this;
}

Note that the init method returns the instance too and a speed up is only gained if the constructor is empty.

I think speed up is only noticed in the release version of the player.

Click the button below to try it out: (Point3DX is the one using init)

[swfobj src=”http://blog.bwhiting.co.uk/wp-content/uploads/2010/10/CreationTest1.swf” height=”400″ width=”400″]

I find the init method to be anywhere up to 2 times faster averaging 1.5 times speedup.

Its more marked on larger numbers but I don’t want to explode anyone’s machine now do I.

Also note that it’s a teeeny bit faster again (in some circumstances) if you set the properties directly on the object once it is created as a 3rd alternative but its not always possible (private attributes) and its more lines of code 🙁