Questions tagged [fma]

Fused Multiply Add or Multiply-Accumulate

Filter by
Sorted by
Tagged with
47votes
2answers
36kviews

How to use Fused Multiply-Add (FMA) instructions with SSE/AVX

I have learned that some Intel/AMD CPUs can do simultanous multiply and add with SSE/AVX: FLOPS per cycle for sandy-bridge and haswell SSE2/AVX/AVX2. I like to know how to do this best in code and I ...
user avatar
47votes
1answer
8kviews

Obtaining peak bandwidth on Haswell in the L1 cache: only getting 62%

I'm attempting to obtain full bandwidth in the L1 cache for the following function on Intel processors float triad(float *x, float *y, float *z, const int n) { float k = 3.14159f; for(int i=0;...
user avatar
  • 30.9k
36votes
2answers
3kviews

Significant FMA performance anomaly experienced in the Intel Broadwell processor

Code1: vzeroall mov rcx, 1000000 startLabel1: vfmadd231ps ymm0, ymm0, ymm0 vfmadd231ps ymm1, ymm1, ymm1 vfmadd231ps ymm2, ymm2, ymm2 vfmadd231ps ymm3, ymm3, ymm3 ...
user avatar
  • 549
24votes
2answers
16kviews

FMA3 in GCC: how to enable

I have a i5-4250U which has AVX2 and FMA3. I am testing some dense matrix multiplication code in GCC 4.8.1 on Linux which I wrote. Below is a list of three difference ways I compile. SSE2: gcc ...
user avatar
  • 30.9k
18votes
1answer
5kviews

AVX2: Computing dot product of 512 float arrays

I will preface this by saying that I am a complete beginner at SIMD intrinsics. Essentially, I have a CPU which supports the AVX2 instrinsic (Intel(R) Core(TM) i5-7500T CPU @ 2.70GHz). I would like ...
user avatar
  • 919
16votes
6answers
4kviews

Which algorithms benefit most from fused multiply add?

fma(a,b,c) is equivalent to a*b+c except it doesn't round intermediate result. Could you give me some examples of algorithms that non-trivially benefit from avoiding this rounding? It's not obvious, ...
user avatar
  • 17.3k
15votes
4answers
10kviews

How to get data out of AVX registers?

Using MSVC 2013 and AVX 1, I've got 8 floats in a register: __m256 foo = mm256_fmadd_ps(a,b,c); Now I want to call inline void print(float) {...} for all 8 floats. It looks like the Intel AVX ...
user avatar
  • 166k
15votes
2answers
2kviews

Fused multiply add and default rounding modes

With GCC 5.3 the following code compield with -O3 -fma float mul_add(float a, float b, float c) { return a*b + c; } produces the following assembly vfmadd132ss %xmm1, %xmm2, %xmm0 ret I ...
user avatar
  • 30.9k
14votes
3answers
2kviews

Can I use the AVX FMA units to do bit-exact 52 bit integer multiplications?

AXV2 doesn't have any integer multiplications with sources larger than 32-bit. It does offer 32 x 32 -> 32 multiplies, as well as 32 x 32 -> 64 multiplies1, but nothing with 64-bit sources. Let's say ...
user avatar
  • 55.9k
14votes
2answers
6kviews

Why does the FMA _mm256_fmadd_pd() intrinsic have 3 asm mnemonics, "vfmadd132pd", "231" and "213"?

Could someone explain to me why there are 3 variants of the fused multiply-accumulate instruction: vfmadd132pd, vfmadd231pd and vfmadd213pd, while there is only one C intrinsics _mm256_fmadd_pd? To ...
user avatar
  • 61.7k
10votes
1answer
796views

Do FMA (fused multiply-add) instructions always produce the same result as a mul then add instruction?

I have this assembly (AT&T syntax): mulsd %xmm0, %xmm1 addsd %xmm1, %xmm2 I want to replace it with: vfmadd231sd %xmm0, %xmm1, %xmm2 Will this transformation always leave equivalent state ...
user avatar
  • 3,144
10votes
3answers
1kviews

Optimize for fast multiplication but slow addition: FMA and doubledouble

When I first got a Haswell processor I tried implementing FMA to determine the Mandelbrot set. The main algorithm is this: intn = 0; for(int32_t i=0; i<maxiter; i++) { floatn x2 = square(x), ...
user avatar
  • 30.9k
9votes
2answers
10kviews

Preventing GCC from automatically using AVX and FMA instructions when compiled with -mavx and -mfma

How can I disable auto-vectorization with AVX and FMA instructions? I would still prefer the compiler to employ SSE and SSE2 automatically, but not FMA and AVX. My code that uses AVX checks for its ...
user avatar
9votes
2answers
3kviews

Automatically generate FMA instructions in MSVC

MSVC supports AVX/AVX2 instructions for years now and according to this msdn blog post, it can automatically generate fused-multiply-add (FMA) instructions. Yet neither of the following functions ...
user avatar
  • 7,698
8votes
2answers
11kviews

How do I know if I can compile with FMA instruction sets?

I have seen questions about how to use FMA instructions set but before I get to start using them, I'd first like to know if I can (does my processor support them). I found a post saying that I needed ...
user avatar
  • 3,211

15 30 50 per page
1
2 3 4 5