Lesson 2: IEEE Standards


E-mail this post



Remember me (?)



All personal information that you provide here will be governed by the Privacy Policy of Blogger.com. More...



This lesson is next and final step before we start to code. It is about decoding numbers and saving them into computer using IEEE protocols for standard and double precision. Normalization procedures are shown precisely and are really easy to understand.



Display of Real Numbers by a computer

Standard precision: 32 bits (4 byte)

Double precision: 64 bits (8 byte)


Real Numbers of Standard Precision

Declaration in programming language C:

float

IEEE (Institute of Electrical and Electronics Engineers) standard 754 for display of real numbers in standard precision:



P

for sign ( P=1 negative, P=0 positive)

Characteristic

binary exponent + 127 (to avoid display of negative exponent)

Mantissa

normalized (only one bit in front of a binary spot).


Example: display of decimal number 5.75

5.7510 = 101.112 * 20 = 1.01112 * 22

Because normalization of every binary number (except zero) displays shape 1.xxxxx, leading 1 is unnecessary. This is why the leading 1 isn’t saved into computer and is referred as hidden bit. This advantage provides us one extra bit of space, giving us higher precision possibility.

P =for sign = 0 (positive number)

Binary exponent = 2 K = 2 + 127 = 129 = (1000 0001)2

Mantissa (whole) 1.0111

Mantissa (without hidden bit) 0111

Resault: 0 10000001 01110000000000000000000

or 0100 0000 1011 1000 0000 0000 0000 0000

4 0 B 8 0 0 0 0 (hexadecimal)


Examples:

2 = 102 * 20 = 12 * 21 = 0100 0000 0000 0000 ... 0000 0000 = 4000 0000 hex

P = 0, K = 1 + 127 = 128 (10000000), M = (1.) 000 0000 ... 0000 0000

-2 = -102 * 20 = -12 * 21 = 1100 0000 0000 0000 ... 0000 0000 = C000 0000 hex

Equal to 2, but P = 1


4
= 1002 * 20 = 12 * 22 = 0100 0000 1000 0000 ... 0000 0000 = 4080 0000 hex

Equal Mantissa, BE = 2, K = 2 + 127 = 129 (10000001)


6
= 1102 * 20 = 1.12 * 22 = 0100 0000 1100 0000 ... 0000 0000 = 40C0 0000 hex


1
= 12 * 20 = 0011 1111 1000 0000 ... 0000 0000 = 3F80 0000 hex

K = 0 + 127 (01111111).


.75
= 0.112 * 20 = 1.12 * 2-1 = 0011 1111 0100 0000 ... 0000 0000 = 3F40 0000 hex


Special Case - 0:

Normalization of number 0 can’t produce shape 1.xxxxx

0 = 0 0000000 0000 ... like 1.02 * 2-127


Range and precision of Real Numbers:

In case of Real number of standard precision, characteristic (8 bits) can be somewhere in interval [0,255].

K = 0 reserved to display zero

K = 255 reserved to display infinity

While BE = K - 127, BE can be created in interval [-126,127].


Smallest positive number different than zero which can be displayed:

1.02 * 2 ‑126 ~ 1.175494350822*10 ‑38


and the biggest is:

1.111111111111111111111112 * 2127 ~2128 = 3.402823669209*1038


Precision: 24 binary digits

224 ~ 10x 24 log 2 ~ x log 10 x ~ 24 log 2 ~ 7.224719895936

about first 7 digits are valid correct.


Display by numerical line:



Numerical mistake:

Not possible to use all bits while calculating:


Example
: 0.000110 + 0.990010

0.000110 : (1.)101000110110111000101112 * 2-14

0.990010 : (1.)111110101110000101000112 * 2-1


While adding, binary spots must be one underneath the other:

#.000000000000011010001101 * 20 Only 11 of 24 bits!

+.111111010111000010100011 * 20

=.111111010111011100110000 * 20 = 0,990099906921410


Real numbers in double precise mode

Declaration in program language C:


double


P

forsign ( P=1 negative, P=0 positive)

Characteristic

binary exponent + 1023 (11 bits)

Mantissa

normalized (52+1 bit).


Range:

K [0,2047].

K = 0 reserved for display of zero

K = 2047 reserved for display infinity

BE = K - 1023

BE [-1022,1023]


Smallest positive number different than zero which can be displayed:

1.02 * 2 ‑1022 ~2.225073858507*10 ‑308


and the biggest is:

1.1111.....1111112 * 21023 ~21024 = 1.797693134862316*10308


Correct: 53 binary numbers

253 ~ 10x 53 log 2 ~ x log 10 x ~ 53 log 2 ~ 15.95458977019

near to 16 first numbers are valid.


There is also:

long double 80 bits

Characteristic: 15 bits

Binary exponent: Characteristic - 16383


Real constants

1. 2.34 9e-8 8.345e+25 double

2f 2.34F -1.34e5f float

1.L 2.34L -2.5e-37L long double



Technorati Tags: , , , , , ,



9 Responses to “Lesson 2: IEEE Standards”

  1. Anonymous Anonymous 

    sorry for delay...reupped lesson 2 with pictures for better understanding (float and double precision display)

  2. Anonymous Anonymous 

    I am searching for a site like this but has materials/tutorials for writing J2SE Java App. I will be grateful if anyone can help me

  3. Anonymous Anonymous 

    Nice short overview. Thx for that

    the polarizer

  4. Anonymous Anonymous 

    Guess I'll have to look elsewhere for a programming tutorial. This one assumes the user already understands many concepts.

  5. Anonymous Anonymous 

    Well acctually it doesn't. Try following it from Lesson 1 (found on this website) and take few days for all tutorials. Go slowly through one lesson at the time and when you're stuck, just post the question, and I'll be more than glad to help you! It was written in a Way so you don't need to have any pre programming knowledge.

  6. Anonymous Anonymous 

    [quote]
    It was written in a Way so you don't need to have any pre programming knowledge.
    [/quote]

    binary exponent, mantissa, real number and normalised are not terms that you often hear down the pub. I agree with chris, this is more of a reference manual for seasoned programmers than a way to introduce basic concepts to a newbie.

  7. Anonymous Anonymous 

    spent a couple hours over lessons 1 + 2 and cross ref Wikipedia for definitiions and further examples- all hangs together; thanks

  8. Anonymous Anonymous 

    I kinda have to agree with the others. I have spend hours looking over lessons 1 & 2 and i dot understand it. You need more definitions. Thanks for the other lessons though they are good.

    Grath

  9. Anonymous Anonymous 

    Hi!

    I'm looking at a pretty simple problem that involves reading numbers from a file and storing them in a different file with a small amount of procesing and in somewhat different order, but in IEEE standard integer and floating point formats. I'd like to use a compiler that already stores the numbers in the correct formats, so that in most cases I can just copy the bytes, but in a few cases some bit fiddling will still be needed. So the question is, which Windows XP based C and C++ compilers that suppport the IEEE standard number formats would you recommend?

    Most of my programming experience is with DOS based Turbo Pascal, but I did one big application in DOS based C a long time ago, and have intended to upgrade to a windows based C compiler for years, but until now never had any applications (all small and simple) for which C would be better that my very old versions of Tubrbo Pascal. So now is the time to upgrade because the application would be much more convenient to write with a C compiler that already stored the numbers in the proper format. With the Turbo Pascal compiler, I would have to reformat every number, which is very easy for the integers, and not too hard for the floats, but why waste time this way? The switch to C is well overdue.

Leave a Reply

      Convert to boldConvert to italicConvert to link

 


German Flag Spanish Flag French Flag Italian Flag Portuguese Flag Japanese Flag Korean Flag Chinese Flag British Flag


This Website is optimized for Firefox. Users browsing with Internet Explorer may encounter problems while viewing pages.


C++ Maniac



Learn C



Additional



#include



Learn Converting



Appendix


Links


Previous posts




Daily Lessons for programming in Visual Studio, using C code.