Questions tagged [floating-point]

Floating point numbers are approximations of real numbers that can represent larger ranges than integers but use the same amount of memory, at the cost of lower precision. If your question is about small arithmetic errors (e.g. why does 0.2 + 0.1 equal 0.300000001?) or decimal conversion errors, please read the "info" page linked below before posting.

1,636 questions with no upvoted or accepted answers
Filter by
Sorted by
Tagged with
11votes
0answers
610views

problem with vulkan floating point behavior

I try to implement the paper Extended-Precision Floating-Point Numbers for GPU Computation by Andrew Thall, Alma College in a GLSL Vulkan compute shader. I need this because some of my devices don't ...
user avatar
  • 111
9votes
1answer
596views

Are .NET Decimal type computations deterministic?

I have two questions regarding .NET's decimal data type determinism: Are decimal type computations cross-platform deterministic? Or in other words, will math operations on decimal type produce ...
user avatar
  • 460
7votes
1answer
774views

How to avoid Numpy type conversions?

Is it possible to avoid or emit warnings for automatic Numpy type conversions from integer and 32 bit float arrays to 64 bit float arrays? My use case for this is that I'm developing a large analysis ...
user avatar
  • 2,420
6votes
0answers
1kviews

Representing a float or a binary as a 32 bit signed integer in R

I've been given a task to write an API for the AR.Drone 2.0 in R. I know it's probably not the wisest choice of language as there are good validated APIs written in Python and JS, but I took the ...
user avatar
  • 355
5votes
0answers
37views

How to parse floating point infinity from std::istream

I have written a superdumb serialization library for a project that I am working on. I just got bitten by floating point infinity, which I illustrate with the sample program below. I expect the ...
user avatar
  • 345
5votes
0answers
123views

C (MIPS) - How to tell compiler load single-precision floats immidiates with GPRs?

Recently, I am trying to write some utilities for n64 with gcc and have some problems with it's optimization strategy. Please consider following example: // cctest.c extern struct { float x; ...
user avatar
  • 81
5votes
0answers
116views

Probable bug in MSVC with compile-time NaN comparison

My colleague was doing some basic experiments with NaN and was puzzled by the behavior on Visual Studio that did not match his expectations. After discussion, it seems that he uncovered a probable ...
user avatar
  • 5,477
5votes
0answers
146views

How to catch floating point errors early (right at where they occur)?

When developing floating-point heavy code, it is very useful to enable FPU exceptions. When an operation results in a NaN/inf, we could catch it immediately. For example, on Linux, I can enable this ...
user avatar
  • 27.4k
5votes
0answers
365views

Metal SIMD Min and Max operations fail for floats

Question in short Why am I getting undefined behavior from simd_min and simd_max functions in Metal 2.1 with floats? Update: Seems this only occurs on the Radeon Pro 560X GPU, but not on the Intel ...
user avatar
  • 133
5votes
0answers
104views

Rationale for range restriction of IEEE-754 compound function

The IEEE Std 754-2008 lists in Table 9.1 the recommended function compound(x,n) = (1+x)^n, with real x, integer n (where ^ is the power operator). The domain is specified as x in [-1, +infinity] and ...
user avatar
  • 1,111
4votes
0answers
112views

Is there a way to force numpy.set_printoptions to show the exact float value?

Following question 59674518, is there a way for numpy.set_printoptions to ensure the EXACT float value is displayed, without displaying trailing zeros, and without knowing the value a priori? I have ...
user avatar
  • 571
4votes
0answers
34views

fpclassify(): what are the examples of another implementation-defined categories?

N2479 C17..C2x working draft — February 5, 2020 ISO/IEC 9899:202x (E) (emphasis added): The fpclassify macro classifies its argument value as NaN, infinite, normal, subnormal, zero, or into another ...
user avatar
  • 4,132
4votes
3answers
115views

Is there a bug in controlled rounding using `exp`?

I'm observing incorrect (IMO) rounding behaviour on some platforms as follows: Calculate the value of log(2) under rounding modes to FE_DOWNWARD and FE_UPWARD (see <fenv.h>). In all cases I've ...
user avatar
4votes
3answers
384views

How to preserve raster dataType in raster processing?

When doing raster math, for example raster1-raster2, the datatype of the output raster is 'FLT4S', even if the datatype ot both raster1 and raster 2 is 'INT2S'. How can I force the output to be 'INT2S'...
user avatar
  • 41
4votes
0answers
357views

Convert List of Floating point to bytearray and back in Python

I am trying to convert a list of floating point number to bytearray and convert it back to original list. My list looks like this: [-0.055999, -0.054000, -0.049, -0.040999, -0.037000] I am trying to ...
user avatar
4votes
0answers
110views

add3 instruction for a+b+c with one single rounding

Background It is well known that the exact product of two floating point numbers is not always a floating point number, but the error exact(a*b) - float(a*b) is. Some codes for exact multiplication ...
user avatar
4votes
1answer
501views

Why does complex floating-point division underflow weirdly with NumPy?

Consider this code: import numpy numpy.seterr(under='warn') x1 = 1 + 1j / (1 << 533) x2 = 1 - 1j / (1 << 533) y1 = x1 * 1.1 y2 = x2 * 1.1 z1 = x1 / 1.1 z2 = x2 / 1.1 print(numpy.divide(1, ...
user avatar
  • 196k
4votes
0answers
149views

Two different kinds of floating-point overflow in Python

I am testing with calculating (1e308)**2 and (1e308)*2 in python. I expect that either both yield overflow, or both yield inf. However, (1e308)**2 manifests an overflow exception while (1e308)*...
user avatar
  • 8,673
4votes
0answers
211views

Any insights on this Microsoft C 5.1 floating point and DOSBox weirdness?

This is a fantastically strange bug that has been tweaking my noodle for the better part of a day; it took me some time to boil it down to this. The setup: Microsoft C 5.10 (~1988) DOSBox 0.74 ...
user avatar
4votes
0answers
384views

C# Change FPU rounding mode

I'm attempting to write an interval arithmetic library in C# .NET, but in order to do this accurately I need to be able to control the rounding mode of floating point operations. After a bit of ...
user avatar
  • 639
4votes
0answers
2kviews

pragma STDC FENV_ACCESS ON is not supported

I tried to slightly modify the example from the article: #include <iostream> #include <cfenv> #pragma STDC FENV_ACCESS ON int main() { std::feclearexcept(FE_ALL_EXCEPT); //int r ...
user avatar
4votes
2answers
1kviews

How to correctly pass a float from C# to C++ (dll)

I'm getting huge differences when I pass a float from C# to C++. I'm passing a dynamic float wich changes over time. With a debugger I get this: c++ lonVel -0.036019072 float c# lonVel -0....
user avatar
4votes
2answers
187views

Create a program that returns the smallest cube which exceeds a non-negative integer n

So I'm trying to create a program which generates the smallest cube greater than an integer n. def first_cube_above(n): #Return the smallest cube which exceeds the non-negative integer n. ...
user avatar
  • 41
3votes
0answers
81views

wrong result on adition of numbers larger than epsilon using numpy.float128

Considering that epsilon is the smallest number that you can add to one. I'm getting 1 instead of 1+epsilon when I perform the addition and print the result. I've implemented a getEpsilon function. I ...
user avatar
  • 441
3votes
1answer
111views

Arithmetic operations on floating point numbers giving unexpected results

I know that with binary representation it is not possible to exactly represent a floating-point number (and I also understand why 0.1 + 0.2 == 0.3 is false). Now here is where I got stuck while I ...
user avatar
  • 311
3votes
0answers
105views

FLT_HAS_SUBNORM is 0: does execution of fpclassify() with manually constructed subnormal lead to UB or lead to WDB returning FP_SUBNORMAL?

In case of FLT_HAS_SUBNORM == 0 (or any XXX_HAS_SUBNORM == 0 in general) does execution of fpclassify macro with manually constructed subnormal (constructed using type punning via union, using memcpy,...
user avatar
  • 4,132
3votes
1answer
48views

In python, is there hidden rules to control how to display the precision of decimal number

For python, do read this link: https://docs.python.org/3/tutorial/floatingpoint.html, "Floating Point Arithmetic: Issues and Limitations" I do understand that there is mismatch(tiny ...
user avatar
  • 41
3votes
0answers
120views

Denormalized floating point numbers: which operations trigger expensive special cases?

Denormalized floating point numbers require expensive special handling in some operations (additions, multiplications). While this is well-known, it seems to me that there are also many comparably ...
user avatar
3votes
2answers
544views

Find smallest integer that satisfies floating point inequality equation

I am looking for a fast algorithm that finds the smallest integer N that will satisfy the following inequality where s, q, u, and p are float numbers (using the IEEE-754 binary32 format): s > q + ...
user avatar
3votes
0answers
400views

Efficiently represent 16777217 as a float

Browsing job advertisements, I saw the following question: Do you understand what it takes to efficiently represent 16,777,217 as a float? [Siemens] I don't understand the question. I know that ...
user avatar
  • 22.9k
3votes
1answer
146views

What guarantees does System.Numerics.Vectors provide about size and bit order?

I have implemented a vector-based c# approximation of Log. It includes unsafe code. It's been working fine in a number of environments, but on a recent deployment has fallen over. The implementation ...
user avatar
3votes
0answers
151views

Correctly rounding a trigonometric function for single-precision

I want a correctly rounded (round to nearest ties to even) single-precision trigonometric function (0.5 ulp error). I can use either the CORDIC algorithm or one of the polynomial approximation ...
user avatar
  • 440
3votes
0answers
107views

Is there a way to disable denormals in numpy? (Enabling ftz and daz flags)

I'm trying to perform a few calculations on floating point numbers that are close to the float32 min. I want the numbers to be flushed to zero when they drop below the float32 minimum instead of ...
user avatar
3votes
0answers
155views

Floating point [in]accuracy of C program, when running on the same machine, changed over last two weeks

The following C code was compiled today on two systems with Microsoft's compiler (installed with Visual Studio 2017 Community), both of which had modern 64-bit Intel processors and were running ...
user avatar
  • 1,606
3votes
1answer
62views

Dealing with floating point point inaccuracy in very small numbers efficiently

The program I am working with takes OpenStreetMap data to render a map. The data consists of 4 coordinates, that make up the bounds of the data. I am drawing lines, that sometimes exceed these bounds ...
user avatar
3votes
0answers
85views

R not working properly with big numbers because of default options?

Seems like R coerces big numbers and cannot compare them effectively: x = 123412415124231251233213 x == 123412415124231251233214 [1] TRUE x == 123412415124231251233217 [1] TRUE Any idea why (maybe a ...
user avatar
3votes
0answers
700views

Unexpected result with kotlin contentEquals on DoubleArray

I have a need to compare two DoubleArrays in order to determine if they have the same values in the same order. To do so I have used the contentEquals extension function, however, it treats 0 and -0 ...
user avatar
  • 7,117
3votes
0answers
2kviews

How to format float to 4 decimal places within json dumps?

Have difficulty in converting pf_stats output to have 4 decimal places: import json import numpy as np def run_simulation(H, P, B, C, mu, sigma, T, L): x = normal(mu, sigma, (L, T)) pf_all ...
user avatar
  • 31
3votes
0answers
1kviews

Float16 (HalfTensor) in pytorch + cuda

Can I set torch.HalfTensor as default and use it with CUDA? I can't even create usual Conv2D: In [1]: import torch In [2]: torch.__version__ Out[2]: '0.2.0_3' In [3]: from torch import nn In [4]: ...
user avatar
  • 718
3votes
1answer
431views

Lack of precision of the toFixed method in javascript

I have do some test about Number.prototype.toFixed method in chrome(v60.0.3112.101) console and found sth puzzled me. Why 1.15.toFixed(1) return "1.1" but not the "1.2"? Why 1.05.toFixed(1) return "1....
user avatar
  • 1,199
3votes
0answers
169views

Type-preserving rounding in Haskell

Is there a built-in function in Haskell that rounds a real floating-point number to the nearest whole number, without changing the type of said number? sameTypeRound f == fromIntegral (round f)
user avatar
3votes
1answer
484views

Getting FloatingPointError instead of ZeroDivisionError when dividing by zero

I'm running a very time-consuming post-processor in Python and have encountered a FloatingPointError where I was expecting a ZeroDivisionError. My code captured the possibility of a ZeroDivisionError ...
user avatar
3votes
1answer
312views

Is it safe to cast Math.Round result to float?

A colleague has written some code along these lines: var roundedNumber = (float) Math.Round(someFloat, 2); Console.WriteLine(roundedNumber); I have an uncertainty about this code - is the number ...
user avatar
  • 124k
3votes
1answer
1kviews

How to preserve float precision in CSV to JSON conversion (via pandas.read_csv)?

NB: My question is not a duplicate of Format floats with standard json module. In fact, Mark Dickinson provided a good answer to my question in one of his comments, and this answer is all about ...
user avatar
  • 30.1k
3votes
1answer
699views

Python pyvttbl ANOVA error

I am trying to perform ANOVA with pyvttbl over my dataset but I get a strange error. Here is my code: import pyvttbl df = pyvttbl.DataFrame() df.read_tbl("ANOVA_MWE_input.csv") print df print type(...
user avatar
  • 261
3votes
1answer
124views

Javascript wrongfully changes the result of a simple multiplication. How can I fix it?

function roundUp(num, precision) { return Math.ceil(num * precision) / precision; } var num = 0.07; var precision = 100; console.log(roundUp(num, precision)); When the arguments to the ...
user avatar
  • 1,154
3votes
0answers
569views

why 0.1 + 0.3 = 0.4 in JavaScript and Python?

I know why 0.1 + 0.2 !== 0.3, because 0.1 cannot be represented exactly in a binary floating point representation, but why 0.1 + 0.3 === 0.4 in JavaScript? I think 0.1, 0.3 both cannot be represented ...
user avatar
3votes
0answers
5kviews

How to solve...ValueError: cannot convert float NaN to integer

I'm running quite a complex code so I won't bother with details as I've had it working before but now im getting this error. Particle is a 3D tuple filled with 0 or 255, and I am using the scipy ...
user avatar
3votes
0answers
115views

Behaviour of floating point precision for division

While working with various floating point number solutions I have been logging the values to compare the outputs. e.g. console.log(3 * 0.1) //0.30000000000000004 console.log(3 * 0.2) //0....
user avatar
  • 1,441
3votes
0answers
1kviews

What's a floating-point operation and how to count them (in MATLAB)?

I have an assignment where I basically need to count the number of floating point operations in a simple program, which involves a loop, a matrix, and operations such as *, + and ^. From my ...
user avatar
  • 13.6k

15 30 50 per page
1
2 3 4 5
33