**Introduction of the problem**

I am trying to speed up the intersection code of a (2d) ray tracer that I am writing. I am using C# and the System.Numerics library to bring the speed of SIMD instructions.

The problem is that I am getting strange results, with over-the-roof speedups and rather low speedups. My question is, why is one over-the-roof whereas the other is rather low?

Context:

- The RayPack struct is a series of (different) rays, packed in Vectors of System.Numerics.
- The BoundingBoxPack and CirclePack struct is a single bb / circle, packed in vectors of System.Numerics.
- The CPU used is an i7-4710HQ (Haswell), with SSE 4.2, AVX(2), and FMA(3) instructions.
- Running in release mode (64 bit). The project runs in .Net Framework 472. No additional options set.

**Attempts**

I've already tried looking up whether some operations may or may not be properly supported (Take note: these are for c++. https://fgiesen.wordpress.com/2016/04/03/sse-mind-the-gap/ or http://sci.tuomastonteri.fi/programming/sse), but it seems that is not the case because the laptop I work on supports SSE 4.2.

In the current code, the following changes are applied:

- Using more proper instructions (packed min, for example).
- Not using the float * vector instruction (causes a lot of additional operations, see the assembly of the original).

**Code ... snippets?**

Apologies for the large amount of code, but I am not sure how we can discuss this concretely without this amount of code.

Code of Ray -> BoundingBox

```
public bool CollidesWith(Ray ray, out float t)
{
// https://gamedev.stackexchange.com/questions/18436/most-efficient-aabb-vs-ray-collision-algorithms
// r.dir is unit direction vector of ray
float dirfracx = 1.0f / ray.direction.X;
float dirfracy = 1.0f / ray.direction.Y;
// lb is the corner of AABB with minimal coordinates - left bottom, rt is maximal corner
// r.org is origin of ray
float t1 = (this.rx.min - ray.origin.X) * dirfracx;
float t2 = (this.rx.max - ray.origin.X) * dirfracx;
float t3 = (this.ry.min - ray.origin.Y) * dirfracy;
float t4 = (this.ry.max - ray.origin.Y) * dirfracy;
float tmin = Math.Max(Math.Min(t1, t2), Math.Min(t3, t4));
float tmax = Math.Min(Math.Max(t1, t2), Math.Max(t3, t4));
// if tmax < 0, ray (line) is intersecting AABB, but the whole AABB is behind us
if (tmax < 0)
{
t = tmax;
return false;
}
// if tmin > tmax, ray doesn't intersect AABB
if (tmin > tmax)
{
t = tmax;
return false;
}
t = tmin;
return true;
}
```

Code of RayPack -> BoundingBoxPack

```
public Vector<int> CollidesWith(ref RayPack ray, out Vector<float> t)
{
// ------------------------------------------------------- \\
// compute the collision. \\
Vector<float> dirfracx = Constants.ones / ray.direction.x;
Vector<float> dirfracy = Constants.ones / ray.direction.y;
Vector<float> t1 = (this.rx.min - ray.origin.x) * dirfracx;
Vector<float> t2 = (this.rx.max - ray.origin.x) * dirfracx;
Vector<float> t3 = (this.ry.min - ray.origin.y) * dirfracy;
Vector<float> t4 = (this.ry.max - ray.origin.y) * dirfracy;
Vector<float> tmin = Vector.Max(Vector.Min(t1, t2), Vector.Min(t3, t4));
Vector<float> tmax = Vector.Min(Vector.Max(t1, t2), Vector.Max(t3, t4));
Vector<int> lessThanZeroMask = Vector.GreaterThan(tmax, Constants.zeros);
Vector<int> greaterMask = Vector.LessThan(tmin, tmax);
Vector<int> combinedMask = Vector.BitwiseOr(lessThanZeroMask, greaterMask);
// ------------------------------------------------------- \\
// Keep track of the t's that collided. \\
t = Vector.ConditionalSelect(combinedMask, tmin, Constants.floatMax);
return combinedMask;
}
```

Code of Ray -> Circle

```
public bool Intersect(Circle other)
{
// Step 0: Work everything out on paper!
// Step 1: Gather all the relevant data.
float ox = this.origin.X;
float dx = this.direction.X;
float oy = this.origin.Y;
float dy = this.direction.Y;
float x0 = other.origin.X;
float y0 = other.origin.Y;
float cr = other.radius;
// Step 2: compute the substitutions.
float p = ox - x0;
float q = oy - y0;
float r = 2 * p * dx;
float s = 2 * q * dy;
// Step 3: compute the substitutions, check if there is a collision.
float a = dx * dx + dy * dy;
float b = r + s;
float c = p * p + q * q - cr * cr;
float DSqrt = b * b - 4 * a * c;
// no collision possible! Commented out to make the benchmark more fair
//if (DSqrt < 0)
//{ return false; }
// Step 4: compute the substitutions.
float D = (float)Math.Sqrt(DSqrt);
float t0 = (-b + D) / (2 * a);
float t1 = (-b - D) / (2 * a);
float ti = Math.Min(t0, t1);
if(ti > 0 && ti < t)
{
t = ti;
return true;
}
return false;
}
```

Code of RayPack -> CirclePack Original, unedited, code can be found at: https://pastebin.com/87nYgZrv

```
public Vector<int> Intersect(CirclePack other)
{
// ------------------------------------------------------- \\
// Put all the data on the stack. \\
Vector<float> zeros = Constants.zeros;
Vector<float> twos = Constants.twos;
Vector<float> fours = Constants.fours;
// ------------------------------------------------------- \\
// Compute whether the ray collides with the circle. This \\
// is stored in the 'mask' vector. \\
Vector<float> p = this.origin.x - other.origin.x; ;
Vector<float> q = this.origin.y - other.origin.y;
Vector<float> r = twos * p * this.direction.x;
Vector<float> s = twos * q * this.direction.y; ;
Vector<float> a = this.direction.x * this.direction.x + this.direction.y * this.direction.y;
Vector<float> b = r + s;
Vector<float> c = p * p + q * q - other.radius * other.radius;
Vector<float> DSqrt = b * b - fours * a * c;
Vector<int> maskCollision = Vector.GreaterThan(DSqrt, zeros);
// Commented out to make the benchmark more fair.
//if (Vector.Dot(maskCollision, maskCollision) == 0)
//{ return maskCollision; }
// ------------------------------------------------------- \\
// Update t if and only if there is a collision. Take \\
// note of the conditional where we compute t. \\
Vector<float> D = Vector.SquareRoot(DSqrt);
Vector<float> bMinus = Vector.Negate(b);
Vector<float> twoA = twos * a;
Vector<float> t0 = (bMinus + D) / twoA;
Vector<float> t1 = (bMinus - D) / twoA;
Vector<float> tm = Vector.ConditionalSelect(Vector.LessThan(t1, t0), t1, t0);
Vector<int> maskBiggerThanZero = Vector.GreaterThan(tm, zeros);
Vector<int> maskSmallerThanT = Vector.LessThan(tm, this.t);
Vector<int> mask = Vector.BitwiseAnd(
maskCollision,
Vector.BitwiseAnd(
maskBiggerThanZero,
maskSmallerThanT)
);
this.t = Vector.ConditionalSelect(
mask, // the bit mask that allows us to choose.
tm, // the smallest of the t's.
t); // if the bit mask is false (0), then we get our original t.
return mask;
}
```

**Assembly code**

These can be found on pastebin. Take note that there is some boilerplate assembly from the benchmark tool. You need to look at the function calls.

- BoundingBox(Pack): https://pastebin.com/RYMQdZMh
- Circle(Pack) Tweaked: https://pastebin.com/YZHjc1vY
- Circle(Pack) Original: https://pastebin.com/87nYgZrv

**Benchmarking**

I've been benchmarking the situation with BenchmarkDotNet.

Results for Circle / CirclePack (updated):

```
| Method | Mean | Error | StdDev |
|------------------- |---------:|----------:|----------:|
| Intersection | 9.710 ms | 0.0540 ms | 0.0505 ms |
| IntersectionPacked | 3.296 ms | 0.0055 ms | 0.0051 ms |
```

Results for BoundingBox / BoundingBoxPacked:

```
| Method | Mean | Error | StdDev |
|------------------- |----------:|----------:|----------:|
| Intersection | 24.269 ms | 0.2663 ms | 0.2491 ms |
| IntersectionPacked | 1.152 ms | 0.0229 ms | 0.0264 ms |x
```

Due to AVX, a speedup of roughly 6x-8x is expected. The speedup of the boundingbox is *significant*, whereas the speedup of the circle is rather low.

Revisiting the question at the top: Why is one speedup over-the-roof and the other rather low? And how can the lower of the two (CirclePack) become faster?

**Edit(s) with regard to Peter Cordes (comments)**

- Made the benchmark more fair: the single ray version does not early-branch-out as soon as the ray can no longer collide. Now the speedup is roughly 2.5x.
- Added the assembly code as a separate header.
- With regard to the square root: This does have impact, but not as much as it seems. Removing the vector square root reduces the total time with about 0.3ms. The single ray code now always performs the square root too.
- Question about FMA (Fused Multiply Addition) in C#. I think it does for scalars (see Can C# make use of fused multiply-add?), but I haven't found a similar operation within the System.Numerics.Vector struct.
- About a C# instruction for packed min: Yes it does. Silly me. I even used it already.

`if()`

statements might avoid that, depending on how well the JIT does at using instructions like`cmpss`

instead of FP compare into integer FLAGS with`ucomiss`

6more comments