Another quick demo of b2dLite, 100,000 animated sprites anyone?

October 21st, 2014

Nothing exiting but made this a while back and though I would share, a simplistic demo but good at showcasing what flash is capable of.

On a decent machine I think it would run at 60 fps.
Mine starts to slow around the 75,000 mark.

So the demo will display up to 100,000 animated sprites on screen depending how brave you are.
It is mouse controlled, the further RIGHT you move it the more it will draw, the further DOWN you move it the bigger the sprites will be rendered.

This allows you to try out a bit of variety.

DEMO HERE (flash player 15 needed for this one)

I added the number of sprites drawn to the stats panel (sadly the stats cost a bit of performance and leaks memory too, but no everyone will be running scout so in it goes).

So see how it runs on your machine and let me know :-)
Do remember that this is not a final product or a complete lib or anything like that, it is just a showcase of what you can achieve with flash, and quite easily at that. (The whole demo could be condensed into 300-400 lines of easy to follow code including all the Stage3D bits).

Sorry if any machines explode while trying to run it!

Also will try to start posting in here more frequently again soon.

(edit: video added, quality isn’t great but hey ho – best in “hd”)

Introducing b2dLite

March 17th, 2014

update 27/03/2014

Latest version is on github, will add a few extra samples to go along with the source code later today.

Still have to move some more math onto the gpu but its working well at the moment :), need to do some more testing as the more that I shift to the gpu the smaller the batches become and the draw call savings get lessened :(

update 18/03/2014:

Latest version is a bit faster handling 50,000 objects in under 10 ms no sweat (about 6-7 ms on my upper-mid range machine).

Added the use of asc2 fast bytearrays and it speeds it up a little, but not a great deal so will leave that out I think to maximise compatibility.

Just want to move a bit more math onto the GPU (still have a few multiplications and divisions that can be moved over to save a few cycles).

Once that is done I will update on github. Next step is putting rotation back in and maybe an alpha multiplier.

————————-

You can find it on github here:

https://github.com/bwhiting/b2dLite

Current  features:

drawing textured quads of a given position and size, with option texture offsets and scales

Future features:

add rotation – reduces performance as it will add to the number of registers required per quad, 3 no longer 2 (but all math done on gpu so no big deal)

(knowing if there is demand for this will be useful)

add more texture methods, i.e. auto mip map generation and non power of 2 size fixing function

Potential features:

Colourization, and other advanced effects.

 

Please give it a test drive and feedback on anything that works well, or doesn’t, anything missing?

Will keep this updated as I change things. Will also add a few demos when time permits.

First demo:

http://bwhiting.co.uk/b3d/b2dLite/ <– 10,000 objects

Another demo courtesy of Peter Strømberg:

http://www.videometry.net/AGAL/bw/bw.html

 

Vector graphics on the GPU with Stage3D

January 10th, 2013

Hey guys and gals,

Started playing with vectors on the GPU last year but it sort of ground to a halt really, I didn’t like the ugly aliasing and let face it vectors should look smooth as a peach. Anyway, skip forward until flash player 11.6 beta was released supporting new shader op codes (namely ddx and ddy) woo, now I had the ability to add in that sweet sweet anti aliasing by leveraging the screen space derivatives!

Cool so now that I was able to render filled triangles and bezier curves (and soooooo close to cubic curves too – need help on this one though), I thought it should be easy to tie it in to the new readGraphicsData command as seen here:
Query Graphics Data

Sadly not quite as straight forward as you might think!
The readGraphicsData method returns all the information to serialize the graphics of any object with graphics in it in flash! But there are still a few difficulties with parsing this data for use on the GPU.

Here are some of the issues:

  • Determining if a bezier curve is concave or convex (this depends on what side of the closed region the curve is on)


(of the two curves on the left shape one is concave and one is convex, blue and red respectively and the shader needs tweaking dependant on which one it is)

  • Determining if two curves overlap (i.e. in long sweeping thin curves… in this case the curve may need to be broken down into smaller curves)
  • Gradients fills! A couple of problems here, 1 is reversing the gradient matrix and the other is to replicate it on the GPU. Simple gradient fills are possible but can get complex easily as you start added more than two colours (maybe at this point a 1×256 pixel texture could be used as a lookup) . That said I have not got round to this yet so might not be to bad.
  • TRIANGULATION!!!!!! This is the real problem here… (at least to me it is). So from the output of the readGraphicsData we have extracted all the curves and along the way we have collected a series of points. These points make up the triangles that we use to fill the solid sections of the shape. Things get tricky however because these points do NOT automatically make up a nice sequence of triangles, you will get overlapping issues and a whole host of other problems. So this is where the triangulation comes in. At first I tried Delaunay triangulation but it was too greedy making triangles outside of the actual shape, so no good. Then I tried some ear clipping examples that I found but only 1 of the 3 I tried kind of worked and I say “kind of” because  it goes into a number of infinite loops that I had to hardcode exceptions for :( and as such it misses a few triangles. (Also every now and again it would reject a complete path for no apparent reason). Not only that but the ear clipping algorithm is SLOW and doesn’t scale very well.

True vector graphics are great, they are something we love about the platform and will miss when the next major release of actionscript comes about. This is the reason why I though it would be great if we could emulate it on the GPU and I am sure it’s possible. It would however be 100 times easier if Adobe could extend their api to expose the result of the internal triangulation that they have already implemented in the player. That way they are still leaving it up to us to handle the rendering but we won’t have to spend an AGE trying to do work that they already have done a long time ago.

 

Anyway am bored of typing so I will post a small demo video (best viewed @ high resolution + full screen):

..Interactive demo coming soon.

 

 

If anyone wants to know more, discuss the topic, contribute, point me at an awesome triangulation library for as3, or anything else.. just drop me a message or reply to this post.

Related link(s) of justice:
http://www.bytearray.org/?p=5013

 

TODOs:

Use edge lists to enable anti aliasing on straight edges of standard (non curve) triangles.

Use edges to also determine if curve is convex or concave.

Look into cubic curve  - quadratic bezier curve conversion

 

UPDATE:

In an ideal world I would like this to become a small Open Source project. One that is not geared to any specific engine or renderer just a simple tool that can be used to generate the required data from any flash display object. If anyone is interested get in touch, I will be more than happy so share the code once its a bit more optimized and would love to see this become something useful not just for the developers of a particular engine but for all flash devs :-)

3D Procedural Geometry using revolutions

August 9th, 2012

Many of the basic primitives we utilize in 3D can be constructed procedurally. All the way from simple planes to complex shapes.

Today I thought I would really quickly share something I use a lot – a revolution mesh. The idea has been around for ages and can be found most 3D packages i.e. the lathe modifier in 3DSMax and its in as3 3D libs like Away3D I think. The good thing about this technique is that it can be used to produce a number of different shapes all with one function (which makes for a smaller code base as well as flexibility).

The idea is simple, pass in a list of 2D points and rotate them about an axis for any given number of times and then construct a solid mesh out of the result.

First up, a demo:

link to standalone version: here

Okay so not very impressive but all of those objects were created in the same way, just revolving a series of points about and axis (in this instance the y/up axis).

So the code to generate them looks like this:

numRevolutions = 25;
revolutions = new Vector.<Object3D>();
 
revolutions.push(generateRevolutionObject(SplineBuilder.generateTorus(5,2,10)));
revolutions.push(generateRevolutionObject(SplineBuilder.generateCircle(5,15)));
revolutions.push(generateRevolutionObject(SplineBuilder.generateCone()));
revolutions.push(generateRevolutionObject(SplineBuilder.generateTube()));
revolutions.push(generateRevolutionObject(SplineBuilder.generateDisc()));
revolutions.push(generateRevolutionObject(SplineBuilder.generateCylinder()));
revolutions.push(generateRevolutionObject(SplineBuilder.generateArc(5, 5, 0, 0.25)));
revolutions.push(generateRevolutionObject(SplineBuilder.generateWave(10,20,5,0.5)));
revolutions.push(generateRevolutionObject(SplineBuilder.generateWaveAbs(10,20,5,0.5)));
revolutions.push(generateRevolutionObject(SplineBuilder.generateSquareWave(10,20,5,0.5)));
revolutions.push(generateRevolutionObject(SplineBuilder.generateLine(5,5,5,-5)));
 
scene.addObjects(revolutions);
 
private function generateRevolutionObject(spline:Vector.<Number>):Object3D
{
	//builds the mesh based on the number of revolutions and the points
	var mesh:RevolutionMesh = new RevolutionMesh(numRevolutions, spline);	
	var object:Object3D = new Object3D().build(mesh, material);
	//gives the object a parent transform so can be rotated about the scene origin
	object.transform.parent = transformer;					
	return object;
}

The magic happens in the SplineBuilder and the RevolutionMesh. The SplineBuilder is just a really simple class with static methods to generate lists of points:

//couple of example functions from SplineBuilder
public static function generateCone(height:Number = 10, radius:Number = 5):Vector.<Number>
{
	var spline:Vector.<Number> = new Vector.<Number>();
	spline.push(0, height, 0);
	spline.push(radius, 0, 0);
	spline.push(0, 0, 0);
	return spline;
}
public static function generateCylinder(height:Number = 10, radius:Number = 5):Vector.<Number>
{
	var spline:Vector.<Number> = new Vector.<Number>();
	spline.push(0,  height/2, 0);
	spline.push(radius, height/2, 0);
	spline.push(radius, -height/2, 0);
	spline.push(0, -height/2, 0);
	return spline;
}
public static function generateTube(height:Number = 10, radius:Number = 5):Vector.<Number>
{
	var spline:Vector.<Number> = new Vector.<Number>();
	spline.push(radius/2,  height/2, 0);
	spline.push(radius, height/2, 0);
	spline.push(radius, -height/2, 0);
	spline.push(radius/2, -height/2, 0);
	spline.push(radius/2,  height/2, 0);
	return spline;
}

See, super simple! You can of course come up with these however you like such as drawn user input or curve interpolations.

So the final piece is the RevolutionMesh (or lathe or whatever you want to call it). Its job is to take the input, transform it around an axis and storing the vertices. Then build the uv’s and indices. (You can then go on to generate normals and tangents etc…)

private function build():void
{
	if(vertices)		vertices.length = 0;
	else 			vertices = new Vector.<Number>();			
	if(uvs)		uvs.length = 0;
	else 			uvs = new Vector.<Number>();
	if(ids)		ids.length = 0;
	else 			ids = new Vector.<uint>();
	if(normals)		normals.length = 0;
	else 			normals = new Vector.<Number>();	
 
	//doesn't have to revolve the whole way round, _start and _end are values from 0-1 so a start of 0 and end of 0.5 would mean a 0 - 180 degrees of revolution
	var totalAngle:Number = (Math.PI*2)*(_end-_start);
	var angle:Number = totalAngle/_divisions;
	var startAngle:Number = (Math.PI*2)*_start;
 
	for (var i : int = 0; i < _divisions+1; i++)
	{
		var vout:Vector.<Number> = new Vector.<Number>();
		transformer.identity();
		transformer.appendRotation((startAngle*Math3D.RADIANS_TO_DEGREES) + (angle*i*Math3D.RADIANS_TO_DEGREES), _axix);	
		transformer.transformVectors(_spline, vout);
		vertices = vertices.concat(vout);
		//could do the same for normals if supplied ;)
	}
	var splineLength:int = _spline.length/3;
	for (i = 0; i < _divisions+1; i++)
	{
		for (var j:int = 0; j < splineLength; j++)
		{
			uvs.push(i/(_divisions), j/(splineLength-1));
		}
	}
	for (i = 0; i < _divisions; i++)
	{
		for (j = 0; j < splineLength-1; j++)
		{
			var id0:uint = ((i+1)*splineLength)+j;
			var id1:uint = (i*splineLength)+j;
			var id2:uint = ((i+1)*splineLength)+j+1;
			var id3:uint = (i*splineLength)+j+1;
			ids.push(id0, id2, id1);
			ids.push(id3, id1, id2);
		}				
	}			
	//go forth and generate normals etc...
}

It works by firstly using a matrix3D to rotate the points and copy them into a vertex list. Next it calculates the uvs based upon vertices’ index in the original point list and the revolution index (u based on the rev index and v based on the position in the point list). Once that is done its a simple case of assigning the indices!

There you have it, some fairly simple code which can be reused to create a crap load of geometry.

Hope that helps someone out, shout if you have and questions or spot any problems etc…

b

Frustum Culling – on steroids!

June 18th, 2012

Am not going to really explain how frustum culling works here or how to set it up, there are already some good articles out there will link to a couple of Actionscript ones at the end. This article will explain how to use caching to very easily boost the speed of your frustum culling. As far as testing goes I haven’t yet seen anything as fast as this so I thought I would share it so anyone else can benefit from it or maybe some bright spark can find a way to improve it… there is always room for improvement where flash is concerned :-)

Just a super quick reasoning why frustum culling is a good thing if done properly:

If the sphere’s position is the same (i.e. a pointer to) the object that it describes position then NO transformations are needed to move the sphere into world space. Consider trying to check a bounding box, while it might be a tighter fit (woop), it would require that the frustum be transformed into local space or the box into world space… either way that is gonna be seriously slow for large numbers of objects when compared to the bounding sphere checks. Frustum checking is easy to implement and can save you a vast amount of time, no rendering of off screen objects!

This is what the standard frustum culling code looks like:

//loop thorugh your objects or spheres
for (var i : int = 0; i < length; i++)
{
	var object:Object3D = objects[i];
	object.cullFlag = 1;		//object.visible = true;
 
	var sphere:BoundingSphere = object.sphere;
	var position:Vector3D = sphere.position;
	var radius:Number = sphere.radius;
 
	var plane:Plane3D;
	var distance:Number;
 
	for (var j : int = 0; j < 6; j++)	//6 planes in the frustum (planes.length)
	{
		plane = planes[j];
		distance = plane.a * position.x + plane.b * position.y + plane.c * position.z + plane.d;
		if (distance + radius < 0.0){
			object.cullFlag = -1;		//object.visible = false;
			break;
		}
	}
}

What the above code is doing is looping through all of you objects or bounding spheres and for each one it projects its position onto each plane that makes up the frustum. If any of the tests show the sphere to be outside of the frustum then it can be marked as not visible and then it will not need to be rendered. As soon as it fails on one plane you can move on to test the next object/sphere but if it passes you will need to go on to check against the next plane until you are sure that it passes all test. (You can see why this is inefficient for objects IN the frustum as 6 checks are required just to confirm that where as the best case is when you can discard an object on the first plane check*)

Now this could be sped up a little bit by unrolling the inner loop but it will only make a very small difference in speed, but it will help.

Introducing the “cache” variable into the mix! Okay here’s where the huge speed up can be gained. In the vast majority of cases, if an object is culled by a frustum plane in one frame and is still culled in the next frame the likelihood that it was culled by the same plane is extremely high! With that in mind if we store the plane that the sphere failed the test on and then test against that one first, for a large number of cases we can eradicate a number of unnecessary plane tests.

Bring on the updated code:

//loop thorugh your objects or spheres
for (var i : int = 0; i < length; i++)
{
	var object:Object3D = objects[i];
	object.cullFlag = 1;		//object.visible = true;
 
	var sphere:BoundingSphere = object.sphere;
	var position:Vector3D = sphere.position;
	var radius:Number = sphere.radius;
 
	var plane:Plane3D;
	var distance:Number;
	var cache:int = sphere.cache;		//the id of the last plane failed on (-1 if none)
 
	if(cache != -1)
	{
		plane = planes[cache];
		distance = plane.a * position.x + plane.b * position.y + plane.c * position.z + plane.d;
		if (distance + radius < 0.0)
		{
			object.cullFlag = -1;	//object.visible = false;
			continue;			//objtect still culled so move to next object
		}
	}
	for (var j : int = 0; j < 6; j++)	//6 planes in the frustum (planes.length)
	{
		if(j == cache) continue;
		plane = planes[j];
		distance = plane.a * position.x + plane.b * position.y + plane.c * position.z + plane.d;
		if (distance + radius < 0.0){
			object.cullFlag = -1;	//object.visible = false;
			sphere.cache = j;	//set the cache to match the current plane
			break;
		}
	}
	//if it's got this far then the object is not culled so the cache can be reset
	sphere.cache = -1;
}

This time round the code checks to see if a plane has been cached and if so then it checks that one first. This means in the best case (a stationary camera with a stationary scene) all the objects, once they have been culled once already will be culled again with only 1 plane test instead of a maximum of 6! Muchos muchos faster!

There are still things that can be optimised such as not testing the cached plane again in the second section if it passes the check in the first section. You could also unroll all the loops to remove any array/vector lookups for a speed up too.

The method can also easily be extend to check for intersections as well a 1/0, true/false or hit/not hit result. Allowing for other checks such as a bounding box or something more tightly fitting.

I have found this approach to be the fastest yet and my implementation can chew through 10,000 objects in under 1 ms without a problem and this is without any bounding volume hierarchies. (Most flash games will probably not even have that many objects in them anyway but still good to know).

Any questions at all then do ask! Or if I have made any booboos then do let me know.

couple o links:

http://blogs.aerys.in/jeanmarc-leroux/2009/12/13/frustum-culling-in-flash-10/ <– nice code (lacking optimisations but I am sure it is just to show the implementation)
http://jacksondunstan.com/articles/1811 <– includes some camera code to so you can get up and running
http://www.flipcode.com/archives/Frustum_Culling.shtml
http://www.crownandcutlass.com/features/technicaldetails/frustum.html

 

*With that in mind maybe one should check the plane most likely to fail most objects, probably the near and far planes, but not necessarily. You could easy count the the number of culls against each plane as you navigate around your scene and get an average to work out which one is the best culler and check that one first… it will depend on your needs though and could be total overkill, but hey this is flash right and flash IS slow, so every little helps!

SSAO in stage 3D (source added)

June 15th, 2012

I had deveoped my own, simple implementation of SSAO (Screen Space Ambient Occlusion)  and thought I would try some other implementations to get better resukts.

I’ll use this post to share video’s of progress and demos to follow as well.

 

Video 1:

SSAO pre blur with 8 samples per pixel (hits the agal instruction limit real fast). Still some kinks to iron out for sure, depth buffer is currently really short to help with the accuracy but should be able to improve this I have been able to encode it across 4 channels now not 1.

Runs pretty quick, 60fps is still achievable.

 

 

 

 

 

related/useful links:

SSAO

http://www.gamerendering.com/category/lighting/ssao-lighting/
http://www.gamedev.net/page/resources/_/technical/graphics-programming-and-theory/a-simple-and-practical-approach-to-ssao-r2753
http://www.john-chapman.net/content.php?id=8
http://blog.nextrevision.com/?p=76
http://www.gamerendering.com/2008/09/28/linear-depth-texture/
http://www.theorangeduck.com/page/pure-depth-ssao

DEPTH ENCODING

http://www.gamedev.net/topic/516146-how-to-encode-the-depth-value-in-a-32bit-rgba-texture/
http://www.gamedev.net/topic/486847-encoding-16-and-32-bit-floating-point-value-into-rgba-byte-texture/
http://aras-p.info/blog/2009/07/30/encoding-floats-to-rgba-the-final/
http://aras-p.info/blog/2007/03/03/a-day-well-spent-encoding-floats-to-rgba/
http://asgl.googlecode.com/svn-history/r490/trunk/ASGL_FP11/src/asgl/shaders/agal/AGALCodec.as
http://www.gamerendering.com/2008/09/25/packing-a-float-value-in-rgba/
http://www.gamedev.net/topic/585330-packing-a-generic-float-into-a-rgba-texture/     <– nice function at the end of this one
http://www.gamedev.net/topic/442138-packing-a-float-into-a-a8r8g8b8-texture-shader/

SOURCE CODE::

Here it is folks, later than I would have liked but I have been busier than badger in Spring.
I have done my best to comment what is going on.
It took a lot of playing around with to get it working and muchos muchos trial and error.

PLEASE do take this and try and make it better I am sure there is room for improvement!!!

Also feel free to ask any questions at all and I will try and answer them.
Cheers Daniel Holden (orange duck for the code from which this is ported)
pure depth ssao
(Although I use a normal texture to save on operations)

 
var flags:String = (smooth) ? "linear" : "nearest";
 
//varying registers
var uv_in:String = "v0";
 
//samplers
var tex_normal:String = "fs1";
var tex_depth:String = "fs2";
var tex_noise:String = "fs3";
 
//float3 - float4
var colour:String = "ft0";
var normal:String = "ft1";
var position:String = "ft2";
var random:String = "ft3";
var ray:String = "ft4";
var hemi_ray:String = "ft5";
 
//float1
var depth:String = "ft6.x";
var radiusDepth:String = "ft6.y";
var occlusion:String = "ft6.z";
var occ_depth:String = "ft6.w";
var difference:String = "ft7.x";
var temp:String = "ft7.y";
var temp2:String = "ft7.z";
var temp3:String = "ft7.w";			//NOT USED
var fallOff:String = "ft0.x";		//NOT USED
 
//float2
var uv:String = "ft0";
 
//constants
var radius:String = "fc0.z";
var scale:String = "fc0.w";
var decoder:String = "fc2.z";
var zero:String = "fc0.x";
var one:String = "fc0.y";
var two:String = "fc1.z";			//NOT USED
var thresh:String = "fc1.w";
var neg_one:String = "fc2.y";		//NOT USED
 
var depth_decoder:String = "fc3.xyzw";
 
var area:String = "fc1.x";			//NOT USED
var falloff:String = "fc1.y";
var total_strength:String = "fc2.x";
var base:String = "fc4.x";
 
var invSamples:String = "fc2.w";
 
 
//SHADER OF DOOOOOOOOOM
AGAL.init();
//sample normal at current fragment, and decode
AGAL.tex(normal, uv_in, tex_normal, "2d", "clamp", flags);//ex ft0, v0, fs0 <2d,wrap,linear> \n"+
AGAL.decode(normal, normal, decoder);
//sample deopth at current fragment, and decode
AGAL.tex(colour, uv_in, tex_depth, "2d", "clamp", flags);//ex ft0, v0, fs0 <2d,wrap,linear> \n"+
AGAL.decodeFloatFromRGBA(depth, colour, depth_decoder);
 
//use this instead if depth is not encoded
//AGAL.mov("oc.a", one);
//AGAL.mov("oc", depth);//col+".xyz");
 
//sample random vector
AGAL.mov(uv, uv_in);
AGAL.mul(uv, uv, scale);					
AGAL.tex(random, uv, tex_noise, "2d", "wrap", flags);
//AGAL.mul(random+".z", random+".z", neg_one);		//not sure if negation needed?
 
//position
AGAL.mov(position+".xy", uv_in+".xy");
AGAL.mov(position+".z", depth);
 
//radiusDepth
AGAL.div(radiusDepth, radius, depth);
 
//occlusion
AGAL.mov(occlusion, zero);					
 
for(var i:int=0; i < 8; i++)
{
	//reflect the random normal against the current normal and size accoring to depth, further should be larger
	AGAL.reflect(ray,"fc"+(5+(i*2)), random);	//could just add but will look crap?
	AGAL.mul(ray, ray, radiusDepth);
 
	//dot the ray against normal
	AGAL.dp3(hemi_ray, ray, normal);
	AGAL.sign(hemi_ray, hemi_ray, temp);
	AGAL.mul(hemi_ray, hemi_ray, ray);
	AGAL.add(hemi_ray, hemi_ray, position+".xyz");
 
	//use position to sample from 
	AGAL.sat(hemi_ray+".xy", hemi_ray+".xy");
	AGAL.tex(colour, hemi_ray+".xy", tex_depth, "2d", "clamp", flags);
	AGAL.decodeFloatFromRGBA(occ_depth, colour, depth_decoder);
 
	//gets the difference in depth between the current depth and sampled depth
	AGAL.sub(difference, depth, occ_depth);				
	AGAL.sge(temp, difference, thresh);	// 1 if difference is bigger than the threshold, 0 otherwise
	AGAL.slt(temp2, difference, falloff);	// 1 if difference is less than the falloff, 0 otherwise
 
	//set difference to range 0 - 1 (and clamp)
	AGAL.div(difference, difference, falloff);
	AGAL.mul(difference, temp, difference);						
	AGAL.mul(difference, temp2, difference);
 
	//accumulate the occusion
	AGAL.add(occlusion, occlusion, difference);
}
//bring back into range 0-1
AGAL.mul(occlusion, occlusion, invSamples);
//apply any multiplier
AGAL.mul(occlusion, occlusion, total_strength);
//add it to a base value
AGAL.add(occlusion, occlusion, base);
//invert and boom headshot
AGAL.sub("oc", one, occlusion);
 
var fragmentShader:String = AGAL.code;

here are the constants used:: (some of them didn’t end up being used so they can be left out – too busy/lazy to do that myself yet)

//uv sample offset
var radius:Number = 0.002;		//you should derive from texture size
//noise uv scale
var scaler : Number = 24;		//much smaller and the noise blocks become more apparent
//unused at the mo
var falloff : Number = 0.05;		//not using this so ignore it / remove it
//unused at the mo
var area : Number = 5;			//not using this so ignore it / remove it
//the depth difference threshold
var depthThresh:Number = 0.0001;
//strength of the effect
var total_strength : Number = 1;
//base value for the effect
var base : Number = 0;
 
var sample_sphere:Vector.<Number> = new Vector.<Number>();
 
sample_sphere.push( 0.5381, 0.1856,-0.4319, 0);
sample_sphere.push( 0.1379, 0.2486, 0.4430, 0);
sample_sphere.push( 0.3371, 0.5679,-0.0057, 0); 
sample_sphere.push(-0.6999,-0.0451,-0.0019, 0);
sample_sphere.push( 0.0689,-0.1598,-0.8547, 0); 
sample_sphere.push( 0.0560, 0.0069,-0.1843, 0);
sample_sphere.push(-0.0146, 0.1402, 0.0762, 0); 
sample_sphere.push( 0.0100,-0.1924,-0.0344, 0);
sample_sphere.push(-0.3577,-0.5301,-0.4358, 0); 
sample_sphere.push(-0.3169, 0.1063, 0.0158, 0);
sample_sphere.push( 0.0103,-0.5869, 0.0046, 0); 
sample_sphere.push(-0.0897,-0.4940, 0.3287, 0);
sample_sphere.push( 0.7119,-0.0154,-0.0918, 0); 
sample_sphere.push(-0.0533, 0.0596,-0.5411, 0);
sample_sphere.push( 0.0352,-0.0631, 0.5460, 0); 
sample_sphere.push(-0.4776, 0.2847,-0.0271, 0);
 
//...
 
context3D.setProgramConstantsFromVector(Context3DProgramType.FRAGMENT, 0, Vector.<Number>([0, 1, radius, scaler]));
context3D.setProgramConstantsFromVector(Context3DProgramType.FRAGMENT, 1, Vector.<Number>([area, falloff, 2, depthThresh]));
context3D.setProgramConstantsFromVector(Context3DProgramType.FRAGMENT, 2, Vector.<Number>([total_strength, -1, 0.5, 1/8]));
context3D.setProgramConstantsFromVector(Context3DProgramType.FRAGMENT, 3, Vector <Number>[1/(255*255*255), 1/(255*255), 1/255, 1]);
context3D.setProgramConstantsFromVector(Context3DProgramType.FRAGMENT, 4, Vector.<Number>([base, 0, 0, 0]));
context3D.setProgramConstantsFromVector(Context3DProgramType.FRAGMENT, 5, sample_sphere);

Enjoy! On-line demo to follow soonish, and please to feedback with any comments or improvements.

IMPORTANT NOTE:
the values in this shader are very VERY important, tiny tweaks/mistakes can throw the whole thing off so do be carefull and don’t be supprised if the world explodes when you play around with it.

:)

b

3D turntables using b3d (a stage3d engine)

May 30th, 2012

Hooked up b3d to bdj, a little mp3 wrapper for advanced mp3 playback and this was the result:

 

Added another video (higher res)…


demo features:

(runs at 60 fps on my machine, just down to 25-26 while recording)

bdj

  • crossfade
  • volume
  • pitch adjustment
  • addition bend
  • and all the usual stuff like: play pause stop seek cue…

b3d

  • mouse ray generation
  • ray plane intersections (for the mouse dragging)
  • as3 native rendering for the interactive areas
  • screen space anti aliasing (experimental)
  • a fast bloom filter
  • a fast DOF filter

If anyone wants more info, or for me to upload a better/longer video or a live playable demo just shout. and I will see what I can do.

 

things to note:

in order to play tracks backwards and pitch them smoothly I had to cache the whole track as a bytearray 1st, although you can extract on the fly it takes about twice as long (not an issue on higher end machines but a problem all the same… oh for a faster as3).

 

 

Acknowledgements:

Lee Brimelow (gave me the idea when he first posted this…)

http://www.leebrimelow.com/?p=1129

The code for playing an mp3 in reverse came from a response to lees post.

Andre Michelle (some great code)

http://blog.andre-michelle.com/2009/pitch-mp3/

DanceDreemer for the model on turbosquid

Music: Usher – Yeah (Speedbreaker remix), Scatman 2003

Updated Stress Test (4000+ individually transformed normal mapped meshes at 60FPS)

May 24th, 2012

Should run a bit faster across all machines.

 

update includes:

  • faster matrix composition
  • bare minimum uploads to GPU by combining the various techniques used in the material into one on the fly.
  • added another material into the demo (normal mapped)
  • added other meshes to you can try and identify where the limits are.. geometry bound or draw call bound

 

link to updated demo:

http://bwhiting.co.uk/b3d/stress2/

 

controls (as before but also):

p – turns of mouse look

n – changes the mesh being rendered

 

Quick tip, go full screen in your browser to maintain access to all the keyboard short-cuts. (click the title bar and hit F11 is the usual command I think)

As before any feedback on performance would be great, there is still room for gains but fir the time being I am happy. Would love to hear what someone with an i7 and a beast of a graphics card can achieve.

 

b

Stage3D optimisation

March 19th, 2012

Having been playing with Stage3D for a while now, I though I would write a small piece on optimisation.

With great power comes great responsibility!

Stage3D give you GPU access, which can expose some serious rendering horsepower, but if you don’t treat it with respect your going to find you run into limitations pretty quick!

So what follows is a rough (very) guide on how to squeeze the most out of the new 3D apis.

Rule 1 (of 1)
CPU’s are fast, GPU’s are faster, communication between the two however is probably the biggest bottleneck you will face.
Therefore: Reduce this wherever possible.
This means minimise calls to the following Context3D functions;
Context3D.drawTriangles();
Context3D.setProgram();
Context3D.setProgramConstants…();
(actually most of Context3D’s functions but the above are the real doozies)

drawTriangles()
GPU’s can draw triangles fast, and lots of them, millions of them every second without breaking into a sweat.
So you might think, “I can call drawTriangles() 50,000* times no sweat as long as I am only drawing a few triangles in each call.”.
WRONG!! This command is a mighty expensive one so use it very wisely!
How: When you call drawTriangles you pass it a vector of ids that, per three, represent one triangle. Given that this call is expensive it then makes sense that you pass it as many triangles as possible in one call. Sadly this doesn’t quite mean you can just group your geometry into big chunks as you cannot change state (alter the material or any parameters) during this call meaning everything that is sent through will be rendered with the same program and set of constants. It does mean however that static (non moving elements) that share the same program/material, can be combined into one list. Things such as trees, grass and any other repeatable geometry are good candidates for this. You can do this for dynamic geometry also but it gets complex as you have to upload transformation data in a separate buffer, this is one way particle systems can be created. The downside is that each time there is a change to any of the objects the whole buffer will need to be re-uploaded. It is also vital that you only try and draw objects that will be seen on screen, so don’t draw that which is out of view of the camera -> frustum culling saves the day.

side note:
Even high end games rarely want to be issuing more than 1000-2000 draw calls, but the likes of battlefield 3 can get up-to the 3000 mark in some of the environments. Newer consoles however, can issue over 10,000 draw calls and do it much faster due to a more direct access to the hardware.

setProgram()
Assuming you have now done everything in your power to reduce the number of draw calls you issue, next thing to look at is state changes (changing the current program).
Changing state on the GPU might seem like a trivial thing but it is actually something you want to keep to a minimum to be able to squeeze the most out of your graphics card.
There are a few things you can do here to reduce this problem.
1. Group the objects that require drawing by their material/program! For example suppose you had 100 cubes, 50 of them with one material and 50 of them with another. Now if you had a list containing all of those cubes and blindly sent them to be rendered you could end up having to change state a large number of times. If it so happened that each cube in the list had a different material to the object before it, then the program will have to be updated for every draw call. Not good. If that list however was sorted so that

- even if you are only drawing one triangle with this call there

*there is an actual limit of 32,768 drawTriangles() calls per present() call.

setProgramConstants…()
This function is what allows us to upload constants to the gpu. It is how we upload our matrices and any float1/2/3/4..s that we want to utilise in our shaders. While it may not be a huge bottleneck it still has a noticeable impact on performance in my experiments.
So how to optimise?
Any constant that is likely to be reused by different materials, then upload only once per frame not once per object rendered. So what are the likely culprits?
The view projection matrix! This is 16 float values that will not change between objects so it makes sense to upload it once! 16 numbers vs 16,000 for a 1000 objects and 999 less calls to setProgramConstants, and that is a good thing!
The same applies to anything else that will not vary between objects, camera and light positions or common numbers used in shaders (0,0.5,1,-1..).
What this shows is that it is important to have some sort of system to manage uploads so you can keep track of what is already uploaded and only upload data that isn’t already there!

side note:
This also translates into how you write shaders, knowing that each constant requires an upload should make you rethink sometimes about how to achieve something whilst using minimal constants, take unpacking a normal from a texture. Usually you would multiple the value from the texture by 2 then subtract one (2 floats required) but the same result can also be achieved with a subtract by 0.5 then a divide 0.5 (1 float required). Perhaps not the best example but I am sure you get the idea. REUSE is your friend!

—————————————–

While I only focused on 3 methods of the context3d, almost all of them will incur some penalty but those highlighted are the ones I have found to be a more serious problem.

Quick additions:
Drawing to a bitmapdata from the gpu is slow, so if you have to do it, ensure it is at a small a resolution as possible, in theory a 1×1 pixel readback should be big enough for picking!
Resizing the back buffer is slow!
Creating textures is slow, don’t do it on the fly. Pre allocate if possible then pick from a pool (more relevant for post processing).

At some point in the future, I hope to write some test examples that highlight the cost of the functions mentioned above (have already done quite a few but they have been tied into other things rather than dedicated standalone tests).

I hope all of that makes some form of sense to someone :) If you have any additions, corrections or questions… fire away. Will probably update this from time to time to add in more that I have missed out, it’s a broad area with many possible optimisations!

 

Super Quick Summary:

REDUCE DRAW CALLS! Group items where possible and use culling to ensure you are only drawing what is neccessary.

REUSE MATERIALS and BATCH RENDERING by material if you can.

UPLOAD THE MINIMUM NUMBER OF CONSTANTS you can get away with :)

 

Stage3D stress test (b3d engine) 20,000 primitives ~ 15,000,000 triangles (updated with playable demo)

February 23rd, 2012

key:
Click to start.
WASD or LEFT/RIGHT/UP/DOWN to fly
Mouse move to look
Shift to fly faster
Space to toggle rotation
+ to add 500 doughnuts
- to remove 500 doughnuts
m to change material (4 available)
Once started double click to toggle fullscreen (NOTE all keys bar arrow keys will be disabled)

demo (requires flash player 11 download):

link to play standalone version (better experience):
here

Let me know how it runs (assuming it does) if you can: frame-rate, number of objects it can handle etcetera.

video 1 (looks like crap so am uploading another one):

video 2 (hopfully looks a bit better..still looks balls, ah well)

Starts to slow down with 2000 – 3000 and above primitives on screen :( there is still room for some efficiency improvements but not bad for now

Demonstrates how essential good culling is, despite what the stats say whilst recording I am able to cull a full scene of 10,000 objects in under 1 ms on the release player no problems…without that it would probably die (with it you can happily navigate a scene with 50,000 objects in it at 60fps on a good machine as the majority are being culled away leaving perhaps only 500-1500 visible at any one time when they are spread out like they are in the demo)