## Tuesday, February 17, 2015

### FIXED POINT ARITHMETIC:

Fixed point arithmetic is widely used in hardware implementations. Fixed point is a method to describe real numbers (ones with an integer part and a fraction part) using only integer values. Fixed point values are represented using integers divided into integer and fractional parts.

Qm.n notation where m bits for integer portion, n bits for fractional portion and m+n is known as Word Length(WL). Total number of bits N = m+n+1, for signed numbers.

•       Fixed point value can be calculated as:

Fixed point value= real number * scale

•       Convert from fixed-point back into a real number:

Real number =fixed point value/scale

•      Convert a real number to fixed point number

 m.n Integer Decimal Scale factor 4.8 4 8 2^8=256 8.8 8 8 2^8=256 2.14 2 14 2^14=16384

•          Conversion from fractional to integer value:

Step1: Normalize decimal fractional number to the range determined by the desired Q format.
Step2: Multiply the normalized fractional number by 2n.
Step3: Round the product to the nearest integer.
Step4: Write the decimal integer value in binary using N bits.

### FLOATING POINT NUMBER:

The term floating point is derived from the fact that there is no fixed number of digits before and after the decimal point. In general, floating-point representations are slower and less accurate than fixed-point representations, but they can hold a larger range of numbers. Floating number represented approximately to a fixed number of significant digits and scaled using an exponent; the base for the scaling is normally two, ten or sixteen. A number that can be representing exactly is in the following form:

Significand x baseexponent

For example: 1.2345=12345 x 10-4

Floating-point representation is similar in concept to scientific notation. Logically, a floating-point number consists of:

•                      A signed (meaning negative or non-negative) digit string of a given length in a given base or radix). The digit string is referred to as the significand, mantissa, or coefficient. The length of the significand determines the precision to which number can be represented. The radix point position is assumed always to be somewhere within the significand.
•                      A signed integer exponent which modifies the magnitude of the number.

Nearly all hardware and programming languages use floating-point numbers in the same binary formats, which are defined in the IEEE754 standard. The usual formats in floating point are 32 or 64 bits in total length:

Single Precision – In this, total bits are 32, significand bits 23+1 sign, and exponent bits 8.
Double Precision – In this, total bits 32, significand bits 52 + 1 sign, and exponent bits 11.