# Questions tagged [floating-point]

Floating point numbers are approximations of real numbers that can represent larger ranges than integers but use the same amount of memory, at the cost of lower precision. If your question is about small arithmetic errors (e.g. why does 0.2 + 0.1 equal 0.300000001?) or decimal conversion errors, please read the "info" page linked below before posting.

1,636 questions with no upvoted or accepted answers
Filter by
Sorted by
Tagged with
569views

### Facing undefined symbols linker issue with Diab compiler when I type cast array of data from float to long long

I wrote a small example code and executed in both GCC and DIAB compilers. #include<stdio.h> int main() { float a[10]; long long int b[10]; int i; for (i =0;i<10;i++) { ...
106views

### How can I specify tolerance for floating point operation in CGAL library?

Is there a way to specify tolerance for CGAL geometry operations? I use exact_predicates_inexact_constructions kernel for computational speed. However sometimes operations such as determining a point ...
• 791
1kviews

### default to floats in python

I'm writing some code that uses GPS coordinates but I'm running into a headache with python 2. I'm using the dronekit python library to be specific. Here's a sample of what I'm trying to do. ...
• 21
1kviews

### Inconsistent type conversion in python/numpy when using scalars or lists/arrays

I have a question about the strange way python/numpy performs type conversion. When I perform an arithmetic operation between a float32 and a float64 number, the lower precision is converted to ...
• 75
59views

### Solving the strict inequality with integer math and accounting for rounding errors

I have an array of floating point numbers of type T, where T could be either float or double T x[n]; These numbers are strictly positive and sorted, i.e. 0 < x[0] < x[1] < x[2] < ... &...
• 1,945
236views

### failure of floating point interrupt MFC Windows 10

We have an MFC product built in Visual Studio 2012/13 in C++ that performs many mathematical calculations. We use floating point interrupts through the coprocessor to help speed up the calculations, ...
222views

### Understanding floating point precision range

Though reading numerous posts here regarding floating point, I couldn't find answer to this basic question: What is the calculation that leads to this fact: The IEEE 754 standard specifies a ...
• 6,609
120views

### Floating point relative error bounds, clarification needed

I'm reading David Goldberg's What Every Computer Scientist Should Know About Floating Point Arithmetic paper, and I'm confused by one of the inequalities (2): (1/2)B^-p <= (1/2)ulp <= (B/2)B^-p ...
• 19.5k
519views

### ARM: Floating point (VFP) instructions won't work when pushing into the stack

I am trying to print some floating point values in ARM Assembly. As long as I don't push stuff into the stack the value gets printed correctly. If I push something it just prints 0. This code works ...
379views

### Why Floating point exception handling doesn't work in Embarcadero C++ Builder in 64 bit compilation?

I can't make working floating point exception handling in Embarcadero C++ Builder XE8 in 64 bit compilation. 32-bit compilation works fine. (MS Visual C++ works fine in both 32-bit and 64-bit ...
• 31
359views

### Scala Range.Double missing last element

I am trying to create a list of numBins numbers evenly spaced in the range [lower,upper). Of course, there are floating point issues and this approach is not the best. The result of using Range.Double,...
99views

### Floating Point Number issue JavaScript

I'm struggling to resolve a Floating Point Number issue where var change returns as 0.0999 recurring, and i need to return 0.01 (one penny). The code works fine, except the very last penny, because of ...
422views

### 16-bit floating point on fpga

I try to use Altera's floating point IP to generate half precision instead of single (32-bit) blocks for addition , multiplication etc. However when configuring the IP it seems that half precision fp ...
• 157
57views

### Generate all numbers of the binary system (B=2, t=3, L=-2, U=3)

Suppose we have the following binary system (B=2, t=3, L=-2, U=3) where B is the base of the system, since it's a binary system, B is of course 2. t is the precision of the number, usually refers to ...
• 13.6k
212views

### How can I get accurate results from math in scriptengine for java?

So I'm working on a big Java project and in it I have a part with user submitted equations, both mathematical and for things like If/else. I've been using a scriptengine with javascript to get the ...
64views

### Is it possible to check the "long double" type being used?

I'd like that an application would be able to know if it's running on a system whose "long double" type is a synonym for "double", or a true 128bit IEEE quad precision, or an 80bit Intel extended ...
• 695
60views

### how to format a float for output so the result is readable?

My question pertains to formatting a floating point number in a form that is human readable. For instance in the following example, the result is a float, and I think I have made a mistake on the ...
416views

### Using Windows Structured Exception Handling (SEH) to catch floating point exceptions

I am attempting to catch floating point exceptions in code compiled using Visual Studio 2008, similar to this post: Visual C++ / Weird behavior after enabling floating-point exceptions (compiler bug ?)...
• 182
234views

### Can computing tan(x)=sin(x)/cos(x) cause a loss of precision?

After a call to sincos(x,&s,&c) from the unix m math library it would be natural to get the tangent as s/c. Is this safe or there may be (ill) cases in which the (supposedly) more expensive ...
• 5,175
183views

### Conservative AABB for voxel-triangle intersection

The problem I am trying to solve is to generate an AABB for the intersection of a triangle and a cube. In 2D, the required volume is the green square shown here: The input points and output bounds ...
• 123
240views

### Optimization flags for floating point arithmetic and GPU for android

I am performing floating point and GPU operations using C++ on android. I would like to know what are the various compiler optimization flags to improve speed of execution of these operations , I just ...
160views

### How to tell if up to round-off error in floating point, a collection of 2-d double precision floating point pairs might lie on some ellipse?

So arbitrary ellipses seem to have two more degrees of freedom than circles, because in addition to a circle's radius and center there is the angle of rotation as well as the scaling of the ratio of ...
• 4,523
136views

### Using Matlab to read binary data from file

I'm currently trying to read data from a .surf file using Matlab. (I realise there will probably be quite a lot of other questions similar to this, but each is specific to its own problem, and I wasn'...
• 358
2kviews

### convert int 32 to q31 or f32

I am trying to understand exactly how to do this. I know how fixed point and floating point notations work, but I was wondering how I can convert from int32 to q31 or f32. If I understand q31 ...
• 185
199views

### 64 bit FP hardware with two 32 bit FP units

When a GPU claims different performance for FP64 vs FP32, does it mean it has separate circuits for FP32 and FP64? Or is it combining FP32 units into FP64 units? I'm asking this because I'm wondering ...
• 2,945
62views

### ActionScript: Number.toExponential returns incorrect results for some values

I found at least two numbers for which ActionScript's Number.toExponential(20) returns incorrect results. The most obvious value is 0. trace(Number(0).toExponential(20)); // 0.00000000000000000000e-...
• 1,316
65views

### Tools to detect denormal numbers during execution

I have a program which I suspect is slow due to the presence of denormal numbers. Is there a tool available for Linux which will allow me to monitor the execution of a program and print statistics on ...
• 3,433
146views

### Why are these two calculations which are exactly same giving different results in Fortran using gfortran?

real, dimension(3), parameter :: boxlen = [4.0, 5.0, 7.0] real, parameter :: mindist = 0.1 integer ::i write(*,"(A)") "Operation on array" print*, floor(boxlen/mindist) write(*,"(/A)") "...
• 384
3kviews

### 32-bit Grayscale Tiff with floating point pixel values to array using LibTIFF.NET C#

I just started using LibTIFF.NET in my c# application to read Tiff images as heightmaps obtained from ArcGIS servers. All I need is to populate an array with image's pixel values for terrain ...
143views

### Stalled cycles due to fldz instruction

I am trying to interpret some perf results on a Xeon x5675 processor. I have a program where a large percentage of the cycles are stalls (from perf stat). Using perf record -e stalled-cycles-...
• 1,855
2kviews

### Encoding and decoding floats in json with PHP without losing precision

I want to decode a json string to PHP object and then the object back again to json string without losing precision for floats numbers in json. If you run the sample below the output would be: JSON: ...
• 460
141views

### How to fix floating point artifacts in StdDev calculation?

I trying to calculate standard deviation with a next method: private static double? StdDev(IReadOnlyCollection<double> items) { if(items == null) { throw new ArgumentNullException("items")...
• 1,469
6kviews

### JSON.parse fails for negative floating point numbers

I have a simple script like this: request = \$.ajax({ url: "/getmesomefloats.php", type: "post", }); request.done(function (response, textStatus, jqXHR){ ...
7kviews

### GCC Cortex-M4 -mfpu=vfpv4 vs. -mfpu=fpv4-sp-d16

I'm using a Freescale K22 (Cortex-M4F) with Kinetis Design Studio, which includes a GNU toolchain. I'm trying to use a binary-only library provided by Invensense, and they compiled it with GCC for use ...
• 308
1kviews

### Reliably detect integer overflow/underflow

I'm working on code that has to do the following with the result of a calculation: If the result exceeds the limit that can be represented in PHP's integer type then throw an exception. If the ...
• 30.1k
134views

### Convert float to bits in elisp

How do I get the IEEE 754 binary representation (single or double precision) of an float in elisp. Only thing I found sofar is this here: http://lists.gnu.org/archive/html/help-gnu-emacs/2002-10/...
• 2,396
590views

### Converting fl in hexadecimal in c++

I am new to C++, and programming, and I want to write a C++ program to convert a float in hexadecimal with the help of pointers I've looked on other threads and really tried to get a hold of this but ...
• 2,168
668views

### numpy/pandas: test float64 arrays are equal up to significant digits

I have two pandas data frames in which I store money amounts, i.e. decimal numbers with at most 15 significant decimal digits. Since float64 has a precision of 15 significant decimal digits, this ...
• 6,788
104views

### Enforcing rational-based conversions between custom-unit-based quantities in Boost.Units

I have a custom unit system defined, which derives from boost::units::si::time. Child units are defined using boost::units::make_scaled_unit, hence the conversion factors are specified using boost::...
817views

### Implementing unordered set of triplets

I have two 3D data-sets. Each element in these sets is a triplet of type (float,float,float). These data-sets have some duplicate elements. I want to merge these two data-sets in such a way so that ...
• 475
232views

### Bitwise creation of 64-bit float

The situation is that I'm on a 32-bit embedded platform (Cortex-M4F) which has a hardware FPU. I'd really like to use the FPU, but the platform provides no hardware implementation of 64-bit float ...
• 1,488
59views

### NSUserDefaults float time value not saving

I'm trying to make a game in which the user completes a task in the fastest time possible, and I want to display the user's best time, and update it every time he gets a better time. I have the ...
175views

### How can I find out if my JVM is using hardware square root?

If you read the top of the incredibly-hard-to-find native sqrt method for Java, which is located at jdk1.6\src\jdk\src\share\native\java\lang\fdlibm\src\e_sqrt.c you will find this: /* __ieee754_sqrt(...
• 10.6k
736views

### Imagemagick: Move image to float pixel

I would like to move an image to a floating point position, so that the imagemagick take care of simulating the non integer position. What it does now is to round the coordinates. So if I try: ...
• 192
94views

### Floating point addition and division

For this code segment double count = 0.0; while( count != 1.0) {count += 1.0/3;} I was wondering what parts of (IEEE 754) would cause count 3: to be 1.0 instead of .9999999999999999. I realize that ...
2kviews

### Upper interval limits in function 'cut'

I’d like to class a data frame in a certain way in R. Assume to have a data frame like the following: > data = sample(1:500, 5000, replace = TRUE) In order to class this data frame I’m making ...
4kviews

### Save float * images in C++

I wanted to understand how I can save an image of type float: float * image; Allocated in this way: int size = width * height; image = (float *)malloc(size * sizeof(float)); I tried using the ...
3kviews

### Python float types vs Decimal

I am an intern at TCD, Physics. I wrote a code to perform some data analysis on random particle packings. The code is written in Python. The code reads in columns of data from a .txt file, provided ...
2kviews

### SSE: convert from const __m128 * to const float *

I'm trying to write a little SSE code but can't continue because of this error: error C2664: '_mm_loadu_ps' : cannot convert parameter 1 from 'const __m128 *' to 'const float *' I've to load ...
• 156