@pearl3344 2018-02-23T20:41:40.000000Z 字数 2982 阅读 1346

浮点数 floating point system

浮点数

IEEE standard 754-2008

$v=(-1)^s \times b^e \times m$
以32位2进制数为例。

b=2 基数radius
k=32=1+w+t 机器存储长度
s=0,1 符合部分的位数1,取值0或1，表示负数正数
w=8 指数部分的位数
e: 指数exponent
E $\in[0,2^w-1]=[0,255]$ 指数部分表示的整数 biased exponent.

$bias=2^{w-1}-1=127$

t=23 有效数字的小数部分的位数 the number of digints in the trailing significand (precision).
p=t+1=24 有效数字的位数 the number of digits in the significand (precision),包括一个隐含首位 the leading bit of the significand is implicitly encode in the biased exponent E.
m: 有效数字mantissa, significand (in scientific form) 1.xxxx 或者0.xxx
T $\in[0,2^t-1]$ 有效数字的小数点后面部分trailing significand field digit string 如果表示成整数

指数部分E有8位可以表示2^8=256个数，有不同的定义：

1. 指数部分全0，E=0：0，subnormal浮点数

有效数字部分全0，T=0, 表示0，
$v=0$
有效数字部分非全0, 表示subnormal浮点数, 指数exponent $e=e_{\min}=-126=E'-1, E'=1$

$v=(-1)^s\times 2^{e_{\min}}\times 0.xxx\\ 0.xxx=m=2^{-t}\times T=\sum_{i=1}^t 2^{-i}d_i$

最小的非0 $m_{\min}=2^{-t}$ .

2. 指数部分全1，E=255 ：NaN无穷大

3. 指数部分 E=[1,254] ：normal浮点数

指数部分w位，可表示 $2^w$ 个整数，除去全0全1两个，剩下的分成两波：正数、负数和0。可以得到 the minimum exponent, the maximum exponent.

，

$e_{\min}=-126=- (\frac{2^w-2}{2}-1)， \\ e_{\max}=127=\frac{2^w-2}{2}$

E= $[1,2^8-2=254]$ ,
e=E-bias = $[e_{\min},e_{\max}]$ ，

$v=(-1)^s \times 2^e \times 1.xxx \\ 1.xxx=m=1+2^{-t}\times T= 1+ \sum_{i=1}^t 2^{-i} d_i \in [1,b)$
m可以无限逼近2,

$m_{\max}=1+1-2^{-t}$

指数部分确定了overflow、underflow

overflow：能表示的最大数the largest floating-point number

$2^{e_{\max}}\times m_{\max}=2^{e_{\max}}\times (2-2^{-t})=2^{e_{\max}+1}-2^{e_{\max}-t}=3.4\times 10^{38}$
underflow: normal浮点数能表示的最小绝对值，smallest positive normalized floating-point number, smallest normal magnitude,

$2^{e_{\min}}=2^{-126}=1.17\times 10^{-38}$

有效数字部分确定了unit roundoff (machine precision, machine epsilon)

机器能准确表示的浮点数是离散的有限的，称这些能被准确表示的数叫machine number。
当实数x不能被机器准确表示时，用machine number近似表示fl(x)。将实数近似成浮点数会造成误差rounding error, roundoff error。

unit roundoff, machine precision, machine epsilon定义了一个浮点数系统表示任意非0实数可能的相对误差的上界maximum relative error。“相对”误差，因为浮点数不是均匀分布的，有效数字部分还要乘以指数部分。

$\left|\frac{fl(x)-x}{x}\right|\leq \epsilon_{\rm mach}$
存在不同的近似方法，相应的近似误差也不一样。

round toward zero (chop)：直接截断， $0<fl(x)\leq x$
如果t=2，0.751=0.5+0.25+0.001将直接表示成0.5+0.25=0.75.

$\epsilon_{\rm mach}=2^{-t}$
round to nearest (even): 用最近的fl(x)近似x，如果正好在两个machine number正中间，用末尾为偶数的那个。这种近似方法误差更小。IEEE 默认是这种。

$\epsilon_{\rm mach}=\frac{1}{2}\times 2^{-t}$

最小subnormal量级

能表示的最小绝对值数, smallest subnormal magnitude

$2^{e_{\min}}\times 2^{-t}$
所有浮点数都是最小subnormal量级的整数倍。
指数位不同，两个相邻浮点数的数值差也不同；最小量级是两个相邻浮点数的最小数值差。

subnormal浮点数

$2^{e_\min}\times \sum_{i=1}^t d_i 2^{-i} =2^{e_{\min}}\times 2^{-t}\times \sum_{i=1}^t d_i 2^{t-i} =2^{e_{\min}}\times 2^{-t}\times \sum_{i=0}^{t-1} d_{t-i} 2^i$

normal浮点数

$2^e\times( 1+\sum_{i=1}^t d_i 2^{-i}) =2^{e_{\min}-t}\times 2^{e-e_{\min}}\times (2^t+\sum_{i=1}^t d_i 2^{t-i}) =2^{e_{\min}-t}\times 2^{e-e_{\min}}\times \sum_{i=0}^{t} d_{t-i} 2^i$
都是最小绝对值数的整数倍

当w=3, $e\in [-2,3]$ ,
当t=4，
这些整数倍数仅仅包括 16-31
2(16-31),4(16-31),8(16-31),16(16-31),32(16-31)
17、19、21、23、25、27....这些倍数的仍然无法用machine number精确表示。

normal浮点数

$2^e\times( 1+\sum_{i=1}^t d_i 2^{-i}) =2^{-t}\times 2^{e}\times (2^t+\sum_{i=1}^t d_i 2^{t-i}) =2^{-t}\times 2^e\times \sum_{i=0}^t d_i 2^{t-i}$
不能保证是

$2^{-t}$ 的整数倍。

如果e=-1, T=13,则是0.5*（16+13）= 14.5倍的 $2^{-t}$ , 会表示成14倍的 $2^{-t}$ ,让T=12.
如果e=-2, T=13,则是0.25*（16+13）=7.25倍的 $2^{-t}$ , 会表示成7倍的 $2^{-t}$ , 7=0.25*（16+8）让T=8.