Introduction of the problem

I am trying to speed up the intersection code of a (2d) ray tracer that I am writing. I am using C# and the System.Numerics library to bring the speed of SIMD instructions.

The problem is that I am getting strange results, with over-the-roof speedups and rather low speedups. My question is, why is one over-the-roof whereas the other is rather low?


  • The RayPack struct is a series of (different) rays, packed in Vectors of System.Numerics.
  • The BoundingBoxPack and CirclePack struct is a single bb / circle, packed in vectors of System.Numerics.
  • The CPU used is an i7-4710HQ (Haswell), with SSE 4.2, AVX(2), and FMA(3) instructions.
  • Running in release mode (64 bit). The project runs in .Net Framework 472. No additional options set.


I've already tried looking up whether some operations may or may not be properly supported (Take note: these are for c++. https://fgiesen.wordpress.com/2016/04/03/sse-mind-the-gap/ or http://sci.tuomastonteri.fi/programming/sse), but it seems that is not the case because the laptop I work on supports SSE 4.2.

In the current code, the following changes are applied:

  • Using more proper instructions (packed min, for example).
  • Not using the float * vector instruction (causes a lot of additional operations, see the assembly of the original).

Code ... snippets?

Apologies for the large amount of code, but I am not sure how we can discuss this concretely without this amount of code.

Code of Ray -> BoundingBox

public bool CollidesWith(Ray ray, out float t)
    // https://gamedev.stackexchange.com/questions/18436/most-efficient-aabb-vs-ray-collision-algorithms
    // r.dir is unit direction vector of ray
    float dirfracx = 1.0f / ray.direction.X;
    float dirfracy = 1.0f / ray.direction.Y;
    // lb is the corner of AABB with minimal coordinates - left bottom, rt is maximal corner
    // r.org is origin of ray
    float t1 = (this.rx.min - ray.origin.X) * dirfracx;
    float t2 = (this.rx.max - ray.origin.X) * dirfracx;
    float t3 = (this.ry.min - ray.origin.Y) * dirfracy;
    float t4 = (this.ry.max - ray.origin.Y) * dirfracy;

    float tmin = Math.Max(Math.Min(t1, t2), Math.Min(t3, t4));
    float tmax = Math.Min(Math.Max(t1, t2), Math.Max(t3, t4));

    // if tmax < 0, ray (line) is intersecting AABB, but the whole AABB is behind us
    if (tmax < 0)
        t = tmax;
        return false;

    // if tmin > tmax, ray doesn't intersect AABB
    if (tmin > tmax)
        t = tmax;
        return false;

    t = tmin;
    return true;

Code of RayPack -> BoundingBoxPack

public Vector<int> CollidesWith(ref RayPack ray, out Vector<float> t)
    // ------------------------------------------------------- \\
    // compute the collision.                                  \\

    Vector<float> dirfracx = Constants.ones / ray.direction.x;
    Vector<float> dirfracy = Constants.ones / ray.direction.y;

    Vector<float> t1 = (this.rx.min - ray.origin.x) * dirfracx;
    Vector<float> t2 = (this.rx.max - ray.origin.x) * dirfracx;
    Vector<float> t3 = (this.ry.min - ray.origin.y) * dirfracy;
    Vector<float> t4 = (this.ry.max - ray.origin.y) * dirfracy;

    Vector<float> tmin = Vector.Max(Vector.Min(t1, t2), Vector.Min(t3, t4));
    Vector<float> tmax = Vector.Min(Vector.Max(t1, t2), Vector.Max(t3, t4));

    Vector<int> lessThanZeroMask = Vector.GreaterThan(tmax, Constants.zeros);
    Vector<int> greaterMask = Vector.LessThan(tmin, tmax);
    Vector<int> combinedMask = Vector.BitwiseOr(lessThanZeroMask, greaterMask);

    // ------------------------------------------------------- \\
    // Keep track of the t's that collided.                    \\

    t = Vector.ConditionalSelect(combinedMask, tmin, Constants.floatMax);

    return combinedMask;

Code of Ray -> Circle

public bool Intersect(Circle other)
    // Step 0: Work everything out on paper!

    // Step 1: Gather all the relevant data.
    float ox = this.origin.X;
    float dx = this.direction.X;

    float oy = this.origin.Y;
    float dy = this.direction.Y;

    float x0 = other.origin.X;
    float y0 = other.origin.Y;
    float cr = other.radius;

    // Step 2: compute the substitutions.
    float p = ox - x0;
    float q = oy - y0;

    float r = 2 * p * dx;
    float s = 2 * q * dy;

    // Step 3: compute the substitutions, check if there is a collision.
    float a = dx * dx + dy * dy;
    float b = r + s;
    float c = p * p + q * q - cr * cr;

    float DSqrt = b * b - 4 * a * c;

    // no collision possible! Commented out to make the benchmark more fair
    //if (DSqrt < 0)
    //{ return false; }

    // Step 4: compute the substitutions.
    float D = (float)Math.Sqrt(DSqrt);

    float t0 = (-b + D) / (2 * a);
    float t1 = (-b - D) / (2 * a);

    float ti = Math.Min(t0, t1);
    if(ti > 0 && ti < t)
        t = ti;
        return true;

    return false;

Code of RayPack -> CirclePack Original, unedited, code can be found at: https://pastebin.com/87nYgZrv

public Vector<int> Intersect(CirclePack other)
    // ------------------------------------------------------- \\
    // Put all the data on the stack.                          \\

    Vector<float> zeros = Constants.zeros;
    Vector<float> twos = Constants.twos;
    Vector<float> fours = Constants.fours;

    // ------------------------------------------------------- \\
    // Compute whether the ray collides with the circle. This  \\
    // is stored in the 'mask' vector.                         \\

    Vector<float> p = this.origin.x - other.origin.x; ;
    Vector<float> q = this.origin.y - other.origin.y;

    Vector<float> r = twos * p * this.direction.x;
    Vector<float> s = twos * q * this.direction.y; ;

    Vector<float> a = this.direction.x * this.direction.x + this.direction.y * this.direction.y;
    Vector<float> b = r + s;
    Vector<float> c = p * p + q * q - other.radius * other.radius;

    Vector<float> DSqrt = b * b - fours * a * c;

    Vector<int> maskCollision = Vector.GreaterThan(DSqrt, zeros);

    // Commented out to make the benchmark more fair.
    //if (Vector.Dot(maskCollision, maskCollision) == 0)
    //{ return maskCollision; }

    // ------------------------------------------------------- \\
    // Update t if and only if there is a collision. Take      \\
    // note of the conditional where we compute t.             \\

    Vector<float> D = Vector.SquareRoot(DSqrt);

    Vector<float> bMinus = Vector.Negate(b);
    Vector<float> twoA = twos * a;
    Vector<float> t0 = (bMinus + D) / twoA;
    Vector<float> t1 = (bMinus - D) / twoA;
    Vector<float> tm = Vector.ConditionalSelect(Vector.LessThan(t1, t0), t1, t0);

    Vector<int> maskBiggerThanZero = Vector.GreaterThan(tm, zeros);
    Vector<int> maskSmallerThanT = Vector.LessThan(tm, this.t);

    Vector<int> mask = Vector.BitwiseAnd(

    this.t = Vector.ConditionalSelect(
        mask,                                                           // the bit mask that allows us to choose.
        tm,                                                             // the smallest of the t's.
        t);                                                             // if the bit mask is false (0), then we get our original t.

    return mask;

Assembly code

These can be found on pastebin. Take note that there is some boilerplate assembly from the benchmark tool. You need to look at the function calls.


I've been benchmarking the situation with BenchmarkDotNet.

Results for Circle / CirclePack (updated):

|             Method |     Mean |     Error |    StdDev |
|------------------- |---------:|----------:|----------:|
|       Intersection | 9.710 ms | 0.0540 ms | 0.0505 ms |
| IntersectionPacked | 3.296 ms | 0.0055 ms | 0.0051 ms |

Results for BoundingBox / BoundingBoxPacked:

|             Method |      Mean |     Error |    StdDev |
|------------------- |----------:|----------:|----------:|
|       Intersection | 24.269 ms | 0.2663 ms | 0.2491 ms |
| IntersectionPacked |  1.152 ms | 0.0229 ms | 0.0264 ms |x

Due to AVX, a speedup of roughly 6x-8x is expected. The speedup of the boundingbox is significant, whereas the speedup of the circle is rather low.

Revisiting the question at the top: Why is one speedup over-the-roof and the other rather low? And how can the lower of the two (CirclePack) become faster?

Edit(s) with regard to Peter Cordes (comments)

  • Made the benchmark more fair: the single ray version does not early-branch-out as soon as the ray can no longer collide. Now the speedup is roughly 2.5x.
  • Added the assembly code as a separate header.
  • With regard to the square root: This does have impact, but not as much as it seems. Removing the vector square root reduces the total time with about 0.3ms. The single ray code now always performs the square root too.
  • Question about FMA (Fused Multiply Addition) in C#. I think it does for scalars (see Can C# make use of fused multiply-add?), but I haven't found a similar operation within the System.Numerics.Vector struct.
  • About a C# instruction for packed min: Yes it does. Silly me. I even used it already.
  • 1
    What hardware? (CPU microarchitecture = Haswell? Skylake? Ryzen? And don't just say "i7" or something, that would be useless.) What compiler+runtime / version / options? Presumably you're using Release mode if you were able to get good speedups with some code. Jul 9, 2019 at 11:56
  • My apologies! I've been over this for a few days now, the question appeared narrow to me but I think a form of tunnel vision got to me. It is indeed rather broad. With regard to the major rules of SO, is this the topic I should be referring to: stackoverflow.com/help/how-to-ask for in the future?
    – Willem124
    Jul 9, 2019 at 11:58
  • @Hille: looks pretty much fine to me (the problem is missing key details, not generally too broad); it's a specific performance question about exactly why one thing only gets a small speedup over scalar, and how to optimize Circle intersection better with SIMD. That's definitely answerable if we have the JIT-compiled asm and the details of what CPU it runs on. (This is a fairly good performance question: including a scalar baseline for each is very good, and using a good microbenchmark framework) Jul 9, 2019 at 12:04
  • 1
    Your speedup of > 8x for the BoundingBox function indicates that the scalar version wasn't doing well. Probably you were getting branch mispredicts, so branchless was a huge win there. Writing the scalar source differently (maybe with ternary or boolean operations on compare results) instead of if() statements might avoid that, depending on how well the JIT does at using instructions like cmpss instead of FP compare into integer FLAGS with ucomiss Jul 9, 2019 at 12:31
  • 1
    @PeterCordes: Thank you for your elaborate response! I've read the stackoverflow article that you linked too. Quite informative! The speed up is still not x8, but with the more realistic benchmark (no early-branch-out) it is already quite closer. Also learning about the assembly code was quite informative too. Do you have any further suggestions?
    – Willem124
    Jul 9, 2019 at 16:44


Your Answer

By clicking “Post Your Answer”, you agree to our terms of service, privacy policy and cookie policy

Browse other questions tagged or ask your own question.