# Questions tagged [floating-point]

Floating point numbers are approximations of real numbers that can represent larger ranges than integers but use the same amount of memory, at the cost of lower precision. If your question is about small arithmetic errors (e.g. why does 0.2 + 0.1 equal 0.300000001?) or decimal conversion errors, please read the "info" page linked below before posting.

1,636 questions with no upvoted or accepted answers
Filter by
Sorted by
Tagged with
610views

### problem with vulkan floating point behavior

I try to implement the paper Extended-Precision Floating-Point Numbers for GPU Computation by Andrew Thall, Alma College in a GLSL Vulkan compute shader. I need this because some of my devices don't ...
• 111
596views

### Are .NET Decimal type computations deterministic?

I have two questions regarding .NET's decimal data type determinism: Are decimal type computations cross-platform deterministic? Or in other words, will math operations on decimal type produce ...
• 460
774views

### How to avoid Numpy type conversions?

Is it possible to avoid or emit warnings for automatic Numpy type conversions from integer and 32 bit float arrays to 64 bit float arrays? My use case for this is that I'm developing a large analysis ...
• 2,420
1kviews

### Representing a float or a binary as a 32 bit signed integer in R

I've been given a task to write an API for the AR.Drone 2.0 in R. I know it's probably not the wisest choice of language as there are good validated APIs written in Python and JS, but I took the ...
• 355
37views

### How to parse floating point infinity from std::istream

I have written a superdumb serialization library for a project that I am working on. I just got bitten by floating point infinity, which I illustrate with the sample program below. I expect the ...
• 345
123views

### C (MIPS) - How to tell compiler load single-precision floats immidiates with GPRs?

Recently, I am trying to write some utilities for n64 with gcc and have some problems with it's optimization strategy. Please consider following example: // cctest.c extern struct { float x; ...
• 81
116views

### Probable bug in MSVC with compile-time NaN comparison

My colleague was doing some basic experiments with NaN and was puzzled by the behavior on Visual Studio that did not match his expectations. After discussion, it seems that he uncovered a probable ...
• 5,477
146views

### How to catch floating point errors early (right at where they occur)?

When developing floating-point heavy code, it is very useful to enable FPU exceptions. When an operation results in a NaN/inf, we could catch it immediately. For example, on Linux, I can enable this ...
• 27.4k
365views

### Metal SIMD Min and Max operations fail for floats

Question in short Why am I getting undefined behavior from simd_min and simd_max functions in Metal 2.1 with floats? Update: Seems this only occurs on the Radeon Pro 560X GPU, but not on the Intel ...
• 133
104views

### Rationale for range restriction of IEEE-754 compound function

The IEEE Std 754-2008 lists in Table 9.1 the recommended function compound(x,n) = (1+x)^n, with real x, integer n (where ^ is the power operator). The domain is specified as x in [-1, +infinity] and ...
• 1,111
112views

### Is there a way to force numpy.set_printoptions to show the exact float value?

Following question 59674518, is there a way for numpy.set_printoptions to ensure the EXACT float value is displayed, without displaying trailing zeros, and without knowing the value a priori? I have ...
• 571
34views

### fpclassify(): what are the examples of another implementation-defined categories?

N2479 C17..C2x working draft — February 5, 2020 ISO/IEC 9899:202x (E) (emphasis added): The fpclassify macro classifies its argument value as NaN, infinite, normal, subnormal, zero, or into another ...
• 4,132
115views

### Is there a bug in controlled rounding using `exp`?

I'm observing incorrect (IMO) rounding behaviour on some platforms as follows: Calculate the value of log(2) under rounding modes to FE_DOWNWARD and FE_UPWARD (see <fenv.h>). In all cases I've ...
384views

### How to preserve raster dataType in raster processing?

When doing raster math, for example raster1-raster2, the datatype of the output raster is 'FLT4S', even if the datatype ot both raster1 and raster 2 is 'INT2S'. How can I force the output to be 'INT2S'...
• 41
357views

### Convert List of Floating point to bytearray and back in Python

I am trying to convert a list of floating point number to bytearray and convert it back to original list. My list looks like this: [-0.055999, -0.054000, -0.049, -0.040999, -0.037000] I am trying to ...
110views

### add3 instruction for a+b+c with one single rounding

Background It is well known that the exact product of two floating point numbers is not always a floating point number, but the error exact(a*b) - float(a*b) is. Some codes for exact multiplication ...
• 44.8k
502views

### Why does complex floating-point division underflow weirdly with NumPy?

Consider this code: import numpy numpy.seterr(under='warn') x1 = 1 + 1j / (1 << 533) x2 = 1 - 1j / (1 << 533) y1 = x1 * 1.1 y2 = x2 * 1.1 z1 = x1 / 1.1 z2 = x2 / 1.1 print(numpy.divide(1, ...
• 196k
149views

### Two different kinds of floating-point overflow in Python

I am testing with calculating (1e308)**2 and (1e308)*2 in python. I expect that either both yield overflow, or both yield inf. However, (1e308)**2 manifests an overflow exception while (1e308)*...
• 8,673
211views

### Any insights on this Microsoft C 5.1 floating point and DOSBox weirdness?

This is a fantastically strange bug that has been tweaking my noodle for the better part of a day; it took me some time to boil it down to this. The setup: Microsoft C 5.10 (~1988) DOSBox 0.74 ...
• 16.9k
384views

### C# Change FPU rounding mode

I'm attempting to write an interval arithmetic library in C# .NET, but in order to do this accurately I need to be able to control the rounding mode of floating point operations. After a bit of ...
• 639
2kviews

### pragma STDC FENV_ACCESS ON is not supported

I tried to slightly modify the example from the article: #include <iostream> #include <cfenv> #pragma STDC FENV_ACCESS ON int main() { std::feclearexcept(FE_ALL_EXCEPT); //int r ...
• 14.6k
1kviews

### How to correctly pass a float from C# to C++ (dll)

I'm getting huge differences when I pass a float from C# to C++. I'm passing a dynamic float wich changes over time. With a debugger I get this: c++ lonVel -0.036019072 float c# lonVel -0....
187views

### Create a program that returns the smallest cube which exceeds a non-negative integer n

So I'm trying to create a program which generates the smallest cube greater than an integer n. def first_cube_above(n): #Return the smallest cube which exceeds the non-negative integer n. ...
• 41
81views

### wrong result on adition of numbers larger than epsilon using numpy.float128

Considering that epsilon is the smallest number that you can add to one. I'm getting 1 instead of 1+epsilon when I perform the addition and print the result. I've implemented a getEpsilon function. I ...
• 441
111views

### Arithmetic operations on floating point numbers giving unexpected results

I know that with binary representation it is not possible to exactly represent a floating-point number (and I also understand why 0.1 + 0.2 == 0.3 is false). Now here is where I got stuck while I ...
• 311
105views

### FLT_HAS_SUBNORM is 0: does execution of fpclassify() with manually constructed subnormal lead to UB or lead to WDB returning FP_SUBNORMAL?

In case of FLT_HAS_SUBNORM == 0 (or any XXX_HAS_SUBNORM == 0 in general) does execution of fpclassify macro with manually constructed subnormal (constructed using type punning via union, using memcpy,...
• 4,132
48views

### In python, is there hidden rules to control how to display the precision of decimal number

For python, do read this link: https://docs.python.org/3/tutorial/floatingpoint.html, "Floating Point Arithmetic: Issues and Limitations" I do understand that there is mismatch(tiny ...
• 41
120views

### Denormalized floating point numbers: which operations trigger expensive special cases?

Denormalized floating point numbers require expensive special handling in some operations (additions, multiplications). While this is well-known, it seems to me that there are also many comparably ...
• 457
544views

### Find smallest integer that satisfies floating point inequality equation

I am looking for a fast algorithm that finds the smallest integer N that will satisfy the following inequality where s, q, u, and p are float numbers (using the IEEE-754 binary32 format): s > q + ...
400views

### Efficiently represent 16777217 as a float

Browsing job advertisements, I saw the following question: Do you understand what it takes to efficiently represent 16,777,217 as a float? [Siemens] I don't understand the question. I know that ...
• 22.9k

15 30 50 per page