Questions tagged [floating-point]

Floating point numbers are approximations of real numbers that can represent larger ranges than integers but use the same amount of memory, at the cost of lower precision. If your question is about small arithmetic errors (e.g. why does 0.2 + 0.1 equal 0.300000001?) or decimal conversion errors, please read the "info" page linked below before posting.

1,636 questions with no upvoted or accepted answers
Filter by
Sorted by
Tagged with
2votes
0answers
569views

Facing undefined symbols linker issue with Diab compiler when I type cast array of data from float to long long

I wrote a small example code and executed in both GCC and DIAB compilers. #include<stdio.h> int main() { float a[10]; long long int b[10]; int i; for (i =0;i<10;i++) { ...
user avatar
2votes
0answers
106views

How can I specify tolerance for floating point operation in CGAL library?

Is there a way to specify tolerance for CGAL geometry operations? I use exact_predicates_inexact_constructions kernel for computational speed. However sometimes operations such as determining a point ...
user avatar
2votes
0answers
1kviews

default to floats in python

I'm writing some code that uses GPS coordinates but I'm running into a headache with python 2. I'm using the dronekit python library to be specific. Here's a sample of what I'm trying to do. ...
user avatar
  • 21
2votes
3answers
1kviews

Inconsistent type conversion in python/numpy when using scalars or lists/arrays

I have a question about the strange way python/numpy performs type conversion. When I perform an arithmetic operation between a float32 and a float64 number, the lower precision is converted to ...
user avatar
  • 75
2votes
0answers
59views

Solving the strict inequality with integer math and accounting for rounding errors

I have an array of floating point numbers of type T, where T could be either float or double T x[n]; These numbers are strictly positive and sorted, i.e. 0 < x[0] < x[1] < x[2] < ... &...
user avatar
  • 1,945
2votes
0answers
236views

failure of floating point interrupt MFC Windows 10

We have an MFC product built in Visual Studio 2012/13 in C++ that performs many mathematical calculations. We use floating point interrupts through the coprocessor to help speed up the calculations, ...
user avatar
2votes
0answers
222views

Understanding floating point precision range

Though reading numerous posts here regarding floating point, I couldn't find answer to this basic question: What is the calculation that leads to this fact: The IEEE 754 standard specifies a ...
user avatar
  • 6,609
2votes
0answers
120views

Floating point relative error bounds, clarification needed

I'm reading David Goldberg's What Every Computer Scientist Should Know About Floating Point Arithmetic paper, and I'm confused by one of the inequalities (2): (1/2)B^-p <= (1/2)ulp <= (B/2)B^-p ...
user avatar
2votes
0answers
519views

ARM: Floating point (VFP) instructions won't work when pushing into the stack

I am trying to print some floating point values in ARM Assembly. As long as I don't push stuff into the stack the value gets printed correctly. If I push something it just prints 0. This code works ...
user avatar
2votes
0answers
379views

Why Floating point exception handling doesn't work in Embarcadero C++ Builder in 64 bit compilation?

I can't make working floating point exception handling in Embarcadero C++ Builder XE8 in 64 bit compilation. 32-bit compilation works fine. (MS Visual C++ works fine in both 32-bit and 64-bit ...
user avatar
2votes
1answer
359views

Scala Range.Double missing last element

I am trying to create a list of numBins numbers evenly spaced in the range [lower,upper). Of course, there are floating point issues and this approach is not the best. The result of using Range.Double,...
user avatar
2votes
1answer
99views

Floating Point Number issue JavaScript

I'm struggling to resolve a Floating Point Number issue where var change returns as 0.0999 recurring, and i need to return 0.01 (one penny). The code works fine, except the very last penny, because of ...
user avatar
2votes
0answers
422views

16-bit floating point on fpga

I try to use Altera's floating point IP to generate half precision instead of single (32-bit) blocks for addition , multiplication etc. However when configuring the IP it seems that half precision fp ...
user avatar
2votes
0answers
57views

Generate all numbers of the binary system (B=2, t=3, L=-2, U=3)

Suppose we have the following binary system (B=2, t=3, L=-2, U=3) where B is the base of the system, since it's a binary system, B is of course 2. t is the precision of the number, usually refers to ...
user avatar
  • 13.6k
2votes
0answers
212views

How can I get accurate results from math in scriptengine for java?

So I'm working on a big Java project and in it I have a part with user submitted equations, both mathematical and for things like If/else. I've been using a scriptengine with javascript to get the ...
user avatar
2votes
0answers
64views

Is it possible to check the "long double" type being used?

I'd like that an application would be able to know if it's running on a system whose "long double" type is a synonym for "double", or a true 128bit IEEE quad precision, or an 80bit Intel extended ...
user avatar
  • 695
2votes
2answers
60views

how to format a float for output so the result is readable?

My question pertains to formatting a floating point number in a form that is human readable. For instance in the following example, the result is a float, and I think I have made a mistake on the ...
user avatar
2votes
0answers
416views

Using Windows Structured Exception Handling (SEH) to catch floating point exceptions

I am attempting to catch floating point exceptions in code compiled using Visual Studio 2008, similar to this post: Visual C++ / Weird behavior after enabling floating-point exceptions (compiler bug ?)...
user avatar
  • 182
2votes
1answer
234views

Can computing tan(x)=sin(x)/cos(x) cause a loss of precision?

After a call to sincos(x,&s,&c) from the unix m math library it would be natural to get the tangent as s/c. Is this safe or there may be (ill) cases in which the (supposedly) more expensive ...
user avatar
  • 5,175
2votes
0answers
183views

Conservative AABB for voxel-triangle intersection

The problem I am trying to solve is to generate an AABB for the intersection of a triangle and a cube. In 2D, the required volume is the green square shown here: The input points and output bounds ...
user avatar
  • 123
2votes
1answer
240views

Optimization flags for floating point arithmetic and GPU for android

I am performing floating point and GPU operations using C++ on android. I would like to know what are the various compiler optimization flags to improve speed of execution of these operations , I just ...
user avatar
2votes
2answers
160views

How to tell if up to round-off error in floating point, a collection of 2-d double precision floating point pairs might lie on some ellipse?

So arbitrary ellipses seem to have two more degrees of freedom than circles, because in addition to a circle's radius and center there is the angle of rotation as well as the scaling of the ratio of ...
user avatar
  • 4,523
2votes
0answers
136views

Using Matlab to read binary data from file

I'm currently trying to read data from a .surf file using Matlab. (I realise there will probably be quite a lot of other questions similar to this, but each is specific to its own problem, and I wasn'...
user avatar
  • 358
2votes
0answers
2kviews

convert int 32 to q31 or f32

I am trying to understand exactly how to do this. I know how fixed point and floating point notations work, but I was wondering how I can convert from int32 to q31 or f32. If I understand q31 ...
user avatar
  • 185
2votes
0answers
199views

64 bit FP hardware with two 32 bit FP units

When a GPU claims different performance for FP64 vs FP32, does it mean it has separate circuits for FP32 and FP64? Or is it combining FP32 units into FP64 units? I'm asking this because I'm wondering ...
user avatar
  • 2,945
2votes
0answers
62views

ActionScript: Number.toExponential returns incorrect results for some values

I found at least two numbers for which ActionScript's Number.toExponential(20) returns incorrect results. The most obvious value is 0. trace(Number(0).toExponential(20)); // 0.00000000000000000000e-...
user avatar
2votes
0answers
65views

Tools to detect denormal numbers during execution

I have a program which I suspect is slow due to the presence of denormal numbers. Is there a tool available for Linux which will allow me to monitor the execution of a program and print statistics on ...
user avatar
  • 3,433
2votes
1answer
146views

Why are these two calculations which are exactly same giving different results in Fortran using gfortran?

real, dimension(3), parameter :: boxlen = [4.0, 5.0, 7.0] real, parameter :: mindist = 0.1 integer ::i write(*,"(A)") "Operation on array" print*, floor(boxlen/mindist) write(*,"(/A)") "...
user avatar
2votes
3answers
3kviews

32-bit Grayscale Tiff with floating point pixel values to array using LibTIFF.NET C#

I just started using LibTIFF.NET in my c# application to read Tiff images as heightmaps obtained from ArcGIS servers. All I need is to populate an array with image's pixel values for terrain ...
user avatar
2votes
0answers
143views

Stalled cycles due to fldz instruction

I am trying to interpret some perf results on a Xeon x5675 processor. I have a program where a large percentage of the cycles are stalls (from perf stat). Using perf record -e stalled-cycles-...
user avatar
  • 1,855
2votes
1answer
2kviews

Encoding and decoding floats in json with PHP without losing precision

I want to decode a json string to PHP object and then the object back again to json string without losing precision for floats numbers in json. If you run the sample below the output would be: JSON: ...
user avatar
2votes
1answer
141views

How to fix floating point artifacts in StdDev calculation?

I trying to calculate standard deviation with a next method: private static double? StdDev(IReadOnlyCollection<double> items) { if(items == null) { throw new ArgumentNullException("items")...
user avatar
2votes
2answers
6kviews

JSON.parse fails for negative floating point numbers

I have a simple script like this: request = $.ajax({ url: "/getmesomefloats.php", type: "post", }); request.done(function (response, textStatus, jqXHR){ ...
user avatar
2votes
0answers
7kviews

GCC Cortex-M4 -mfpu=vfpv4 vs. -mfpu=fpv4-sp-d16

I'm using a Freescale K22 (Cortex-M4F) with Kinetis Design Studio, which includes a GNU toolchain. I'm trying to use a binary-only library provided by Invensense, and they compiled it with GCC for use ...
user avatar
2votes
1answer
1kviews

Reliably detect integer overflow/underflow

I'm working on code that has to do the following with the result of a calculation: If the result exceeds the limit that can be represented in PHP's integer type then throw an exception. If the ...
user avatar
  • 30.1k
2votes
0answers
134views

Convert float to bits in elisp

How do I get the IEEE 754 binary representation (single or double precision) of an float in elisp. Only thing I found sofar is this here: http://lists.gnu.org/archive/html/help-gnu-emacs/2002-10/...
user avatar
  • 2,396
2votes
3answers
590views

Converting fl in hexadecimal in c++

I am new to C++, and programming, and I want to write a C++ program to convert a float in hexadecimal with the help of pointers I've looked on other threads and really tried to get a hold of this but ...
user avatar
  • 2,168
2votes
1answer
668views

numpy/pandas: test float64 arrays are equal up to significant digits

I have two pandas data frames in which I store money amounts, i.e. decimal numbers with at most 15 significant decimal digits. Since float64 has a precision of 15 significant decimal digits, this ...
user avatar
  • 6,788
2votes
0answers
104views

Enforcing rational-based conversions between custom-unit-based quantities in Boost.Units

I have a custom unit system defined, which derives from boost::units::si::time. Child units are defined using boost::units::make_scaled_unit, hence the conversion factors are specified using boost::...
user avatar
2votes
0answers
817views

Implementing unordered set of triplets

I have two 3D data-sets. Each element in these sets is a triplet of type (float,float,float). These data-sets have some duplicate elements. I want to merge these two data-sets in such a way so that ...
user avatar
  • 475
2votes
1answer
232views

Bitwise creation of 64-bit float

The situation is that I'm on a 32-bit embedded platform (Cortex-M4F) which has a hardware FPU. I'd really like to use the FPU, but the platform provides no hardware implementation of 64-bit float ...
user avatar
  • 1,488
2votes
0answers
59views

NSUserDefaults float time value not saving

I'm trying to make a game in which the user completes a task in the fastest time possible, and I want to display the user's best time, and update it every time he gets a better time. I have the ...
user avatar
2votes
0answers
175views

How can I find out if my JVM is using hardware square root?

If you read the top of the incredibly-hard-to-find native sqrt method for Java, which is located at jdk1.6\src\jdk\src\share\native\java\lang\fdlibm\src\e_sqrt.c you will find this: /* __ieee754_sqrt(...
user avatar
  • 10.6k
2votes
1answer
736views

Imagemagick: Move image to float pixel

I would like to move an image to a floating point position, so that the imagemagick take care of simulating the non integer position. What it does now is to round the coordinates. So if I try: ...
user avatar
  • 192
2votes
1answer
94views

Floating point addition and division

For this code segment double count = 0.0; while( count != 1.0) {count += 1.0/3;} I was wondering what parts of (IEEE 754) would cause count 3: to be 1.0 instead of .9999999999999999. I realize that ...
user avatar
2votes
0answers
2kviews

Upper interval limits in function 'cut'

I’d like to class a data frame in a certain way in R. Assume to have a data frame like the following: > data = sample(1:500, 5000, replace = TRUE) In order to class this data frame I’m making ...
user avatar
2votes
5answers
4kviews

Save float * images in C++

I wanted to understand how I can save an image of type float: float * image; Allocated in this way: int size = width * height; image = (float *)malloc(size * sizeof(float)); I tried using the ...
user avatar
2votes
0answers
3kviews

Python float types vs Decimal

I am an intern at TCD, Physics. I wrote a code to perform some data analysis on random particle packings. The code is written in Python. The code reads in columns of data from a .txt file, provided ...
user avatar
2votes
1answer
2kviews

SSE: convert from const __m128 * to const float *

I'm trying to write a little SSE code but can't continue because of this error: error C2664: '_mm_loadu_ps' : cannot convert parameter 1 from 'const __m128 *' to 'const float *' I've to load ...
user avatar
  • 156
2votes
0answers
243views

Sum of large real numbers is integer instead of real

I created a table with some values: db.execSQL("create table table_test (" + "id integer primary key autoincrement," + "money real not null" + ");"); ContentValues ...
user avatar
  • 1,215

15 30 50 per page
1 2 3
4
5
33