Questions tagged [avx2]

AVX2 (Advanced Vector Extensions 2) is an instruction set extension for x86. It adds 256bit versions of integer instructions (where AVX only provided 256b floating point).

Filter by
Sorted by
Tagged with
42votes
6answers
18kviews

AVX2 what is the most efficient way to pack left based on a mask?

If you have an input array, and an output array, but you only want to write those elements which pass a certain condition, what would be the most efficient way to do this in AVX2? I've seen in SSE ...
user avatar
  • 985
38votes
4answers
63kviews

How to tell if a Linux machine supports AVX/AVX2 instructions?

I'm on SUSE Linux Enterprise 10/11 machines. I launch my regressions to a farm of machines running Intel processors. Some of my tests fail because my tools are built using a library which requires AVX/...
user avatar
  • 2,737
37votes
2answers
2kviews

Why is Intel Haswell XEON CPU sporadically miscomputing FFTs and ART?

During the last days I observed a behaviour of my new workstation I couldn't explain. Doing some research on this problem, there might be a possible bug in the INTEL Haswell architecture as well as in ...
user avatar
  • 937
33votes
2answers
7kviews

In what situation would the AVX2 gather instructions be faster than individually loading the data?

I have been investigating the use of the new gather instructions of the AVX2 instruction set. Specifically, I decided to benchmark a simple problem, where one floating point array is permuted and ...
user avatar
26votes
2answers
7kviews

How are the gather instructions in AVX2 implemented?

Suppose I'm using AVX2's VGATHERDPS - this should load 8 single-precision floats using 8 DWORD indices. What happens when the data to be loaded exists in different cache-lines? Is the instruction ...
user avatar
24votes
5answers
7kviews

How to perform the inverse of _mm256_movemask_epi8 (VPMOVMSKB)?

The intrinsic: int mask = _mm256_movemask_epi8(__m256i s1) creates a mask, with its 32 bits corresponding to the most significant bit of each byte of s1. After manipulating the mask using bit ...
user avatar
22votes
2answers
4kviews

AVX 256-bit code performing slightly worse than equivalent 128-bit SSSE3 code

I am trying to write very efficient Hamming-distance code. Inspired by Wojciech Muła's extremely clever SSE3 popcount implementation, I coded an AVX2 equivalent solution, this time using 256 bit ...
user avatar
  • 1,902
20votes
2answers
3kviews

Fastest way to multiply an array of int64_t?

I want to vectorize the multiplication of two memory aligned arrays. I didn't find any way to multiply 64*64 bit in AVX/AVX2, so I just did loop-unroll and AVX2 loads/stores. Is there a faster way to ...
user avatar
19votes
2answers
2kviews

Haswell memory access

I was experimenting with AVX -AVX2 instruction sets to see the performance of streaming on consecutive arrays. So I have below example, where I do basic memory read and store. #include <iostream&...
user avatar
  • 275
18votes
3answers
9kviews

How to find the horizontal maximum in a 256-bit AVX vector

I have a __m256d vector packed with four 64-bit floating-point values. I need to find the horizontal maximum of the vector's elements and store the result in a double-precision scalar value; My ...
user avatar
18votes
1answer
5kviews

AVX2: Computing dot product of 512 float arrays

I will preface this by saying that I am a complete beginner at SIMD intrinsics. Essentially, I have a CPU which supports the AVX2 instrinsic (Intel(R) Core(TM) i5-7500T CPU @ 2.70GHz). I would like ...
user avatar
  • 919
16votes
5answers
10kviews

Transpose an 8x8 float using AVX/AVX2

Transposing a 8x8 matrix can be achieved by making four 4x4 matrices, and transposing each of them. This is not want I'm going for. In another question, one answer gave a solution that would only ...
user avatar
  • 1,560
14votes
2answers
6kviews

Scatter intrinsics in AVX

I can't find them in the Intel Intrinsic Guide v2.7. Do you know if AVX or AVX2 instruction sets support them?
user avatar
  • 11.8k
14votes
3answers
2kviews

Can I use the AVX FMA units to do bit-exact 52 bit integer multiplications?

AXV2 doesn't have any integer multiplications with sources larger than 32-bit. It does offer 32 x 32 -> 32 multiplies, as well as 32 x 32 -> 64 multiplies1, but nothing with 64-bit sources. Let's say ...
user avatar
  • 55.9k
14votes
3answers
5kviews

Load address calculation when using AVX2 gather instructions

Looking at the AVX2 intrinsics documentation there are gathered load instructions such as VPGATHERDD: __m128i _mm_i32gather_epi32 (int const * base, __m128i index, const int scale); What isn't clear ...
user avatar
  • 201k

15 30 50 per page
1
2 3 4 5
38