## 编程知识 cdmana.com

### Reading notes of understanding computer system in depth Chapter 2 representation and processing of information

This chapter mainly studies the sign less number in computer , Complement code , The encoding of floating-point numbers , By studying the actual coding of numbers , We can understand the range of values that can be represented by different types of data in a computer , Properties of different arithmetic operations , You can know how computers handle data overflow . Learn how to code computers , For us to write that can span different machines , Code with different operating systems and compiler combinations can be of great help .

@[TOC]

### Information storage

#### Why there's binary ？ What's the meaning and advantage of binary ？

For having 10 For a human with a finger , It's natural to use decimal notation , But when building machines that store and process information , Binary values work better . Binary signals can be easily represented 、 Storage and transmission . for example , It can be expressed as a perforated card with or without holes 、 High or low voltage on a wire , Or clockwise or counterclockwise magnetic fields . The electronic circuit for storing and calculating binary signals is very simple and reliable , Manufacturers can integrate millions or even billions of such circuits on a single silicon chip . Speaking in isolation , Single bits are not very useful . However , When you put bits together , With some kind of explanation , That is to give meaning to different possible bit patterns , We can represent the elements of any finite set . such as , Using a binary number system , We can Encode nonnegative numbers in bits . By using standard character codes, we can Encode the letters and symbols in the document .

#### Three coding methods of computers

Unsigned ： Unsigned （unsigned） The coding is based on the traditional binary representation , A number that is greater than or equal to zero .

Complement code ： Complement code （two' s-complement） Encoding is the most common way to represent signed integers , Signed integers are numbers that can be positive or negative .

Floating point numbers ： Floating point numbers （ floating-point） Coding is the scientific notation of real numbers 2 Base version .

#### Integers & Floating point numbers

In the computer , The operation of integers conforms to the commutative and associative laws of operators , The result of the overflow will be expressed as a negative number . The encoding range of integers is relatively small , But the result is accurate .

The operation of floating-point number is not combined , And its overflow produces special values —— It's just infinite . Floating point numbers have a wide coding range , But the result is approximate .

The main reason for the difference is that the encoding formats of integers and floating-point numbers are different .

#### Virtual memory & Virtual address space

Most computers use 8 Block of bits , Or bytes （byte）, As The smallest addressable unit of memory , Instead of accessing individual bits in memory . Machine level programs treat memory as a very large Byte array , be called Virtual memory （ virtual memory）. Each byte of memory is identified by a unique number , It's called Address （address）, The set of all possible addresses is called Virtual address space （ virtual address space）.

The pointer is made by Data types and pointer values Composed of , Its value represents something The location of the object , And its Type represents the type of the object stored at that location （ Like integers or floating-point numbers ）.C The pointer value of any type in the language corresponds to a Virtual address .C Language compilers can be based on Different types of pointer values generate different machine codes to access the values stored at the location the pointer points to . But the reality that it generates Machine level programs do not contain information about data types .

#### Binary system & Decimal system & Hexadecimal

##### Binary to hexadecimal （ Group switching ）

A four bit binary can represent one hexadecimal bit . Binary and hexadecimal conversion methods are shown in the table below . There will be no further explanation .

Hexadecimal 1 7 3 A 4 C
Binary system 0001 0111 0011 1010 0100 1100

Gamma Formula display $\Gamma(n) = (n-1)!\quad\forall n\in\mathbb N$ It's through Euler integral

set up x by 2 Of Non-negative integer n The next power when , That is to say $x = {2^n}$. We can easily put x In hexadecimal form , Just remember that x The binary representation of is 1 Followed by n individual 0（ such as
$1024 = {2^{10}}$, Binary for 10000000000）. Hexadecimal number 0 representative 4 It's binary 0. therefore , When n Expressed as i+4j In the form of , among 0≤i≤3, We can x The hexadecimal number written at the beginning is 1（i=0）、2（i=1）、4（i=2） perhaps 8（i=3）, Followed by j Hexadecimal 0. such as ,$2048 = {2^{11}}$, We have n=11=3+4*2, So the hexadecimal representation is 0x800. Here are a few more examples .

n ${2^{n}}$（ Decimal system ） ${2^{n}}$（ Hexadecimal ）
9 512 0x200
19（3+4*4） 524288 0x80000
14（2+4*2） 16384 0x4000
16（0+4*4） 65536 0x10000
17（1+4*4） 131072 0x20000
5（1+4*1） 32 0x20
7（3+4*1） 128 0x80

There is another way to convert decimal to hexadecimal ： division . In turn, , The conversion from hexadecimal to decimal can be done with the corresponding 16 Times each hexadecimal digit .

#### The range of virtual addresses

Every computer has a word length （ word size）, Indicates the nominal size of the pointer data （ nominal size）. Because the virtual address is encoded in such a word , therefore The most important system parameter determined by word length is the maximum size of the virtual address space . in other words , For a word is w position As far as the machine is concerned , The virtual address range is 0~${2^{w}}$-1 . Most of the programs visit ${2^{w}}$ Bytes .

16 Address range of bit length machine ：0~65535(FFFF)

32 Address range of bit length machine ：0~4294967296（FFFFFFFF,4GB）

64 Address range of bit length machine ：0~18446744073709551616（1999999999999998,16EB）

32 Bit compilation instruction ：gcc -m32 main.c

64 Bit compilation instruction ：gcc -m64 main.c

#### C Typical size of language basic data types （ Bytes are units ）

A signed Unsigned 32 position 64 position
[signed] char unsigned char 1 1
short unsigned short 2 2
int unsigned int 4 4
long unsigned long 4 8
int32_t uint32_t 4 4
int64_t uint64_t 8 8
char* 4 8
float 4 4
double 8 8

Be careful ： basic C The typical size of a data type. The number of bytes allocated is determined by how the compiler compiles , It's not determined by the number of machines . This table shows 32 Bit and 64 Typical values of bit programs .

In order to avoid being dependent on “ A typical ” Strange behavior caused by size and different compiler settings ,ISOC99 Class data type is introduced , The data size is fixed , Does not change with compiler and machine settings . There are data types int32_t and int64_t, They are 4 Bytes and 8 Bytes . Using a fixed size integer type is the best way for programmers to control data representation accurately .

In terms of the order of keywords and whether to include or omit optional keywords ,C Language allows for many forms . such as , All the statements below mean the same thing ：

unsigned long
unsigned long int
long unsigned
long unsigned int

#### Big end & The small end

Big end ： The high byte of data is stored in the low address of memory , The low byte of data is stored in the high address of memory , This kind of storage mode is a bit similar to processing data in string order ： The address increases from small to large , And the data goes from high to low .

The small end ： The high byte of data is stored in the high address of memory , And the low byte of data is stored in the low address of memory , This storage mode effectively combines the high and low address with the data bit weight , The weight of high address part is high , The weight of low address part is low , Consistent with our logical approach .

for instance , Hypothetical variables x The type of int, At the address 0x100 It's about , Its hexadecimal value is 0x01234567. Address range 0x100~0x103 The byte order of depends on the type of machine .

Big end method

data 01 23 45 67

Small end method

data 67 45 23 01
Be careful , In the word 0x01234567 in , The hexadecimal value of the high byte is 0x01, And the low byte value is 0x67.

Memory mode ：

Big end == High tail , That's the end （67） Put it on the high address （0x103）.

The small end == Low end , That's the end （67） Put it at a low address （0x100）.

###### Expand ： What's the meaning of the big and small ends ？

1. Data transmission between different devices

A The device is in small end mode ,B The device is in big end mode . When the Internet will A Data from the device is transmitted to B Equipment time , There will be problems .（B How the device switches A The data of the device will be explained in the following chapter ）

hypothesis Intel x86-64(x86 All belong to small end ) The code for generating a program is as follows ：

4004d3:01 05 43 0b 20 00              add   %eax,0x200b43(%rip)

This instruction is to add a word length of data to a value , The storage address of the value is from 0x200b43 Add the value of the current program counter to , The value of the current program counter is the address of the next instruction to be executed .

We are used to reading in the order of the lowest on the left , The highest bit is on the right ,0x00200b43. The lowest bit of the anti sink code generated by the small end mode is on the right side , The highest position is on the left ,01 05 43 0b 20 00. It's the opposite of our reading order .

3. Write a general program that conforms to various systems

/* Print the byte representation of the program object . This code uses cast to circumvent the type system . It's easy to define similar functions for other data types */
#include <stdio.h>
typedef unsigned char* byte_pointer;
/* Pass to  show_bytes They point to a parameter x The pointer to &x, And the pointer is cast to “unsigned char*”. This cast tells the compiler , The program should think of this pointer as pointing to a sequence of bytes , Instead of pointing to an object of primitive data type .*/
void show_bytes(byte_pointer start, size_t len){
size-t i;
for (i=0;i<len;i++)
printf("%.2x",start [i]);
printf("\n");
}
void show_int (int x){
show_bytes ((byte_pointer)&x,sizeof(int));
}
void show_float (float x){
show_bytes ((byte_pointer)&x,sizeof(float));
}
void show_pointer (void* x){
show_bytes ((byte_pointer)&x,sizeof(void* x));
}
void test_show_bytes (int val){
int ival = val;
float fval =(float)ival;
int *pval = &ival;
show_int(ival);
show_float(fval);
show_pointer(pval);
}

The above code prints the byte representation of the sample data object in the following table ：

machine value type byte （ Hexadecimal ）
Linux32 12345 int 39 30 00 00
Windows 12345 int 39 30 00 00
Linux64 12345 int 39 30 00 00
Linux32 12345.0 float 00 e4 40 46
Windows 12345.0 float 00 e4 40 46
Linux64 12345.0 float 00 e4 40 46
Linux32 &ival int * e4 f9 ff bf
Windows &ival int * b4 cc 22 00
Linux64 &ival int * b8 11 e5 ff ff 7f 00 00

notes ：Linux64 by x86-64 processor .

Besides byte order ,int and float The result is the same . Pointer values are related to the machine type . Parameters 12345 The hexadecimal representation of is 0x00003039. about int Data of type , Besides byte order , We get the same results on all machines . Besides , The pointer values are completely different . Different machines / Operating system configuration uses different storage allocation rules .（ Linux32、 Windows Machine use 4 Byte address , and Linux64 Use 8 Byte address ）

Can be observed , Even though floating-point and integer data are logarithms 12345 code , But they have very different byte patterns ： The integer is 0x00003039, And floating point numbers are 0x4640E400. generally speaking , The two formats use different encoding methods .

#### An operator & Logical operators

An operator ：& | ~ ^. Logical operators ：&& || !. Especially ~ and ！ The difference between , See the following example .

expression result
!0x41 0x00
!!0x41 0x01
!0x00 0x01
~0x41 0x3E
~0x00 0xff

“！” Logical nonoperator , Logical operators generally treat their operands as conditional expressions , The return result is Bool type ：“！true” The condition is true （true）.“！false ” The condition is false （false）.

"~" An operator , The negation of the representative position , For shaping variables , Reverse each bit ,0 change 1,1 change 0.

^ Is the exclusive or operator , There is an important property ：a ^ a = 0,a ^ 0= a. That is, any number and its own exclusive or result in 0, and 0 The XOR result is still the original number . Take advantage of this property , We can find that there is only one occurrence in the array / two / Three times the number . How to find it ？

example 1： Suppose you give an array arr, Except that an element only appears once , Each of the other elements occurs twice . Find the element that only appears once .

Ideas ： The rest of the elements appear twice , therefore , XOR all elements in the array in turn , The final result is an element that appears only once . such as ,arr = [0,0,1,1,8,8,12],0 ^ 0 ^ 1 ^ 1^ 8^ 8^ 12 = 12. Interested in their own programming to try .

example 2： Given an array of integers arr, There are exactly two elements that only appear once , All the other elements appear twice . Find the two elements that only appear once .

Ideas ： First, you can get the XOR value of two numbers that appear once by XOR , The XOR of the XOR is 1 Of bit Bits must come from one of these two numbers . Then you can choose any one as 1 Of bit position , According to this bit position , Put all the bits as 1 The numbers are divided into groups , All the bits are 0 The numbers are divided into groups , This becomes a search for numbers that only appear once in two subarrays .

example 3： Suppose you give an array arr, Except that an element only appears once , Each of the other elements appears three times . Find the element that only appears once .

Ideas ： You can think about it for yourself .

Logical operators && and || There is also a property of short-circuit evaluation . As follows .

If you evaluate the first parameter, you can determine the result of the expression , Then the logical operator does not evaluate the second parameter . Common examples are as follows

int main()
{
int a=3,b=3;
(a=0)&&(b=5);
printf("a=%d,b=%d\n",a,b); // a = 0 ,b = 3
(a=1)||(b=5);
printf("a=%d,b=%d",a,b); // a = 1 ,b = 3
} 

a=0 It's false, so it's not true B To operate

a=1 It's true , So there's no right b To operate

#### Logical left shift and arithmetic left shift

Logic shift left （SHL） And the arithmetic moves left (SAL), Same rules , Add... To the right 0

Logical shift right (SHR), Add unity to the left 0

Count right (SAR), The number added to the left is related to the sign ( Positive complement 0, Negative numbers make up 1)

For example, a sign bit 8 Bit binary number 11001101, Logical shift right regardless of sign bit , If you move a bit, it becomes 01100110. The arithmetic shift right depends on the sign bit , Shift one bit to the right 11100110.

e.g:1010101010, among [] Bits are added numbers

Logical shift left one bit ：010101010

Count left one bit ：010101010

Logic shifts one bit to the right ：101010101

Move the number one bit to the right ：101010101

Shift symbol （<<, >>, >>>）

<<, Signed left shift , Move the binary integer of the operands to the left by a specified number of digits , In the low position 0 A filling .

>>, Sign shift right , Shift the binary whole of the operand to the right by specifying the number of digits , The high position of a positive number is used 0 A filling , The high position of a negative number is used 1 A filling （ Keep the sign of negative numbers unchanged ）.

##### Expand ： What to do when the number of moving digits is greater than the actual number of digits ？

For a by w Bit data type , If you want to move k≥w What's the result ？ for example , What is the result of evaluating the following expression , Suppose the data type int by w=32.

int lval=OxFEDCBA98 << 32;
int aval=0xFEDCBA98 >> 36;
unsigned uval = OxFEDCBA98u >>40;

C Language standards are careful to circumvent the specification of what to do in this case . On many machines , When moving a w Bit value , The shift instruction only considers the low displacement $[{\log _2}w]$ position , So the actual displacement is calculated by k mod w Got . for example , When w=32 when , The above three shift operations are respectively moving 0、4 and 8 position , Get the results :

lval OxFEDCBA98
aval OxFFEDCBA9
uvaL OXOOFEDCBA

But this kind of behavior is very important to C There is no guarantee for the program , Therefore, we should keep the displacement less than the number of digits to be shifted .

### An integer represents

Some of the agreed terms are as follows

Symbol type meaning
$[B2{T_w}]$ function Binary complement code
$[B2{U_w}]$ function Binary to unsigned number
$[U2{B_w}]$ function To convert an unsigned number to a binary number
$[U2{T_w}]$ function No sign to complement code
$[T2{B_w}]$ function Complement to binary
$[T2{U_w}]$ function From complement to signed number
$T{Min_w}$ constant Minimum complement value
$T{Max_w}$ constant Maximum complement value
$U{Max_w}$ constant The maximum signed number
$+ _w^t$ operation Complement addition
$+ _w^u$ operation The addition of unsigned numbers
$* _w^t$ operation Complement multiplication
$* _w^u$ operation Multiplication of unsigned numbers
$- _w^t$ operation The complement is reversed
$- _w^u$ operation Negate the signed number

#### The coding of an unsigned number

The definition of the code of the unsigned number ：

For vectors $\vec x = [{x_{w - 1}},{x_{w - 2}}, \cdots ,{x_0}]$:$B2{U_w}(\vec x) = \sum\limits_{i = 0}^{w - 1} {{x_i}{2^i}}$.

among ,$\vec x$ As a binary representation of a number , Each bit ${x_i}$ The value is 0 or 1. Here's an example .

$B2{U_4}() = 0*{2^3} + 0*{2^2} + 0*{2^1} + 1*{2^0} = 0 + 0 + 0 + 1 = 1$

$B2{U_4}() = 1*{2^3} + 1*{2^2} + 1*{2^1} + 1*{2^0} = 8 + 4 + 2 + 1 = 15$

The maximum value that an unsigned number can represent is ：$UMa{x_w} = \sum\limits_{i = 0}^{w - 1} {{2^w} - 1}$

#### The encoding of the complement code

The definition of complement code ：

For vectors $\vec x = [{x_{w - 1}},{x_{w - 2}}, \cdots ,{x_0}]$:$B2{T_w}(\vec x) = - {x_{w - 1}}{2^{w - 1}} + \sum\limits_{i = 0}^{w - 2} {{x_i}{2^i}}$.

Most significant bit ${x_{w - 1}}$ Also known as sign bit , its “ The weight ” by $- {2^{w - 1}}$, Is the negative number of the weight in the unsigned representation . The sign bit is set to 1 when , Indicates that the value is negative , And when set to 0 when , The value is nonnegative . for instance

$B2{T_4}() = - 0*{2^3} + 0*{2^2} + 0*{2^1} + 1*{2^0} = 0 + 0 + 0 + 1 = 1$

$B2{T_4}() = - 1*{2^3} + 1*{2^2} + 1*{2^1} + 1*{2^0} = - 8 + 4 + 2 + 1 = - 1$

w The range that a bit complement can represent ：$TMi{n_w} = - {2^{w - 1}}$,$TMa{x_w} = {2^{w - 1}} - 1$.

#### Note the need for complements

First of all , The range of complement is asymmetric ：$\left| {TMin} \right| = \left| {TMax} \right| + 1$, in other words ,$TMin$ There is no corresponding positive number . The reason for this asymmetry , Because half of the bit patterns （ The sign bit is set to 1 Number of numbers ） A negative number , And the other half （ The sign bit is set to 0 Number of numbers ） It means a non negative number . because 0 It is a non negative number , It means that there is one less integer than a negative number .

second , The largest unsigned value is just twice the maximum of the complement ：$UMa{x_w} = 2TMa{x_w} + 1$. All bit patterns of negative numbers in the complement representation become positive in the unsigned representation .

notes ： Complements are not the only way for computers to express negative numbers , It's just that everyone has adopted this approach . Computers can also express negative numbers in other ways , Like the original code and the reverse code . If you are interested, you can continue to learn more about .

#### Determine the size of the data type

In systems of different bit lengths ,int,double,long The number of digits occupied by etc. is different , The size of the range it can represent is also different , How to write a universal program ？ISO C99 The standard is in the document stdint.h The class of integer type is introduced in . This file defines a set of data types , Their statement is in the form of ：intN_t,uintN_t(N The value is generally 8,16,32,64). such as ,uint16_t In any operating system, you can express a 16 The signed variable of bit .int32_t Express 32 Bits have signed variables .

alike , The maximum and minimum values of these data types are represented by a set of macro definitions , such as INTN_MIN,INTN_MAX and UINTN_MAX.

When printing certain types of content , You need to use a macro .

such as , Print int32_t,uint64_t, You can do this in the following way ：

printf("x=%"PRId32",y=%"PRIu64"\n",x,y);

Compiled into 64 Bit program , macro PRId32 Expand it into a string “d”, macro PRIu64 Then expand it into two strings “l”“u”. When C Preprocessor encountered only spaces （ Or other blank characters ） When separating a sequence of string constants , Just connect them in series . therefore , above printf The call becomes ：printf("x=%d.y=%lu\n",x,y);

Using macros ensures that ： No matter how the code is compiled , Can generate the correct format string .

#### The conversion between the signed number and the complement

The complement is converted to an unsigned number ：

For satisfying Examples are as follows ：

x $T2{U_4}(x)$ explain
-8 8 -8<0,-8+${2^4}$=16,$T2{U_4}(-8)$=16
-3 13 -3<0,-3+${2^4}$=13,$T2{U_4}(-3)$=16
-2 14 -2<0,-2+${2^4}$=14,$T2{U_4}(-2)$=16
-1 15 -1<0,-1+${2^4}$=15,$T2{U_4}(-1)$=16
0 0 $T2{U_4}(0)$=0
5 5 $T2{U_4}(5)$=5

Signed number to complement code ：

For satisfying $0 \le u \le UMa{x_w}$ Of u Yes ： Combine the following two pictures to understand ：

The conversion from a complement to an unsigned number . function T2U Convert negative numbers to large positive numbers The conversion from an unsigned number to a complement . function U2T The greater than ${2^{w - 1}} - 1$ To a negative value #### The conversion between signed and unsigned numbers

As mentioned earlier , Complements are not the only way for computers to express negative numbers , But almost all computers use complements to represent negative numbers . Therefore, to convert an unsigned number to a signed number is to use a function $U2{T_w}$, And from signed number to unsigned number is the application function $T2{U_w}$.

Be careful ： When performing an operation , If one of its operands is signed and the other is unsigned , that C Language implicitly Cast a signed parameter to an unsigned number , And suppose that both numbers are nonnegative , To perform this operation .

such as , Suppose the data type int Expressed as 32 Bit complement , Find the expression -1<0U Value . Because the second operands are unsigned , The first operand is implicitly converted to an unsigned number , So the expression is equivalent to 4294967295U<0U, So the value of the expression is 0.

#### Expand the numbers

To convert an unsigned number to a larger data type , We simply add... At the beginning of the representation 0. This operation is called Zero expansion （ zero extension）.

such as , take 16 Signed number of bits 12（0xC）, Expand to 32 Position as 0x0000000C.

To convert a complement number to a larger data type , You can execute a Symbol extension （ sign exten sion）, That is, the extended symbol bit .

such as , take 16 Signed number of bits -25（0x8019,1000000000011001）, Expand to 32 Position as 0xffff8019.

#### Truncate numbers

The truncation formula of the signed number is ：$B2{U_k}(x) = B2{U_w}(X)mod{2^k}$ Equivalent to $x' = x\bmod {2^k}$,$x'$ Cut it off k The result of a .

such as , take 9 from int Convert to short, It needs to be truncated 16 position ,k=16.$9\bmod {2^{16}} = 9$. therefore ,9 from int Convert to short As the result of the 9.

The truncation formula of signed numbers is ：$B2{T_k}(x) = U2{T_k}(B2{U_w}(X)mod{2^k})$ Equivalent to $x' = U2{T_k}(x{\kern 1pt} {\kern 1pt} {\kern 1pt} mod{2^k})$,$x'$ Cut it off k The result of a .

such as , take 53791 from int Convert to short, It needs to be truncated 16 position ,k=16.$53791\bmod {2^{16}} = 53791$,$U2{T_{16}}(53791) = 53791 - 65536 = - 12345$. therefore ,53791 from int Convert to short As the result of the -12345.

Several examples of the truncation of an unsigned number （ take 4 The bit value is truncated to 3 position ）

Original value Cutoff value
0 0
2 2
9 1
11 3
15 7

Several examples of truncated signed numbers （ take 4 The bit value is truncated to 3 position ）

Original value Cutoff value
0 0
2 2
-7 1
-5 3
-1 -1

#### Summary

On the conversion of signed and unsigned numbers , The expansion and truncation of numbers , It often happens in Different types of , Different bit length The conversion of numbers , These operations are usually done automatically by the computer , But we'd better know how computers do the conversion , This is for us to check BUG It's especially useful . We don't have to remember them all , But when something goes wrong , We need to know where to check .

### Integer operation

#### The addition of unsigned numbers

To satisfy , Under normal circumstances ,x+y The value of remains unchanged , In the case of spillover, we should subtract from the sum ${2^{\rm{w}}}$ Result .

such as , Consider one 4 Bit number representation （ The maximum value is 15）,x=9,y=12, And for 21, Out of range . that x+y As the result of the 9+12-15=6.

To satisfy When ${{2^{w - 1}} \le x + y}$, There is a positive overflow , When ${w + y < - {2^{w - 1}}}$, There is a negative overflow . When ${ - {2^{w - 1}} \le x + y < {2^{w - 1}}}$, normal . Refer to the following figure for details. . An example is shown in the table below （ With 4 Take the addition of bit complements as an example ）

x y x+y $x + _4^ty$ situation
-8 -5 -13 3 1
-8 -8 -16 0 1
-8 5 -3 -3 2
2 5 7 7 3
5 5 10 -6 4

#### The complement is not

To satisfy $TMi{n_w} \le x \le TMa{x_w}$ Of x, Its complement is not $- _w^tx$ Given by the following formula （ Make complaints about CSDN, Use typora Write well latex The formula , Paste it and report an error , original CSDN Of Markdown Yes, it is Katex Rendered . Isn't it an increase in workload ？）

in other words , Yes w In addition to the complement of bits ,${TMi{n_w}}$ It's the inverse of one's own addition , And for any other number x There are -x As the inverse of its addition .

#### Multiplication of unsigned numbers

To satisfy $0 \le x,y \le UMa{x_w}$ Of x and y Yes ：$x*_w^uy = (x*y)mod{2^w}$.

#### Multiplication of complements

To satisfy $TMi{n_w} \le {\rm{x}}{\rm{y}} \le TM{\rm{a}}{{\rm{x}}_{\rm{w}}}$ Of x and y Yes ：$x*_w^ty = U2{T_w}((x*y)mod{2^w})$.

give an example ,3 The result of digit multiplication

Pattern x y x * y Truncated x * y
Unsigned 4  5  20  4 
Complement code -4  -3  12  -4 
Unsigned 2  7  14  6 
Complement code 2  -1  -2  -2 
Unsigned 6  6  36  4 
Complement code -2  -2  4  -4 

#### Multiplication of constants and sign numbers

On most machines , Integer multiplication instructions are quite slow , need 10 Or more clock cycles , But other integer operations （ Such as addition 、 Subtraction 、 Bit level operations and shifts ） It only needs 1 Clock cycles . therefore , The compiler uses an important optimization , Try a combination of shift and addition instead of multiplication by constant factors .

Because integer multiplication is much more expensive than shift and addition , many C The language compiler attempts to shift 、 The combination of addition and subtraction eliminates many cases where integers are multiplied by constants . for example , Suppose a program contains expressions x*14. utilize $14 = {2^3} + {2^2} + {2^1}$, The compiler rewrites the multiplication as （x<<3）+（x<<2）+（x<1）, Replace a multiplication with three shifts and two additions . No matter what x Is it unsigned or a complement , Even when multiplication leads to overflow , Both calculations give the same result .（ This can be proved by the properties of integer operations .） What's better is , The compiler can also take advantage of properties $14 = {2^4} - 1$, Rewrite multiplication to （x<<4）-（x<<1）, In this case, only two shifts and one subtraction are needed .

Sum up the following , For a constant K The expression of x * K The generated code . We can calculate the effect of these bits on the product in one of two different forms ：

form A:$(x < < n) + (x < < (n - 1)) + \cdots + (x < < m)$

form B:$(x < < (n + 1)) - (x < < m)$

For embedded development , We often use this way to operate registers . In programming , We should get used to using shift operation instead of multiplication , Can greatly improve the efficiency of the code .

#### The division of constants and signed numbers

On most machines , Division of integers is slower than multiplication of integers — need 30 Or more clock cycles . Divide 2 The power of can also be realized by shift operation , It's just that we're moving right , Instead of moving left . No sign and complement are used respectively Logical shift and Arithmetic shift To achieve the goal .

##### The division of unsigned numbers

It's very simple to use shift for unsigned operations , The reason for the shift to the right must be partly due to the absence of sign Logical shift right . At the same time pay attention to , Shifts are always rounded to zero .

Examples are as follows , With 12340 Of 16 Bits represent a logical shift to the right k The result of a . The zeros moved in at the left end are shown in bold .

k >>k( Binary system ) Decimal system $12340/{2^k}$
0 0011000000110100 12340 12340.0
1 0001100000011010 6170 6170.0
4 0000001100000011 771 771.25
8 0000000000110000 48 48.203125
##### Division of complements （ Round down ）

For dividing by 2 For the complement operation of the power of , It's a little more complicated . First , To ensure that negative numbers are still negative , What the shift does is Arithmetic shift right .

about x≥0, Variable x The most significant bit of is 0, So the effect is the same as logical shift right . therefore , For non negative numbers , Arithmetic shift right k Bit sum divided by ${2^k}$ It's the same .

An example is shown below , Yes -12340 Of 16 Bits represent arithmetic shift right k position . For cases where rounding is not required （k=1）, The result is $x/{2^k}$. When rounding is required , Displacement results in Round down . for example , Move right 4 A bit will put -771.25 Round down to -772. We need to adjust our strategy to deal with negative numbers x Division .

k >>k（ Binary system ） Decimal system $-12340/{2^k}$
0 1100111111001100 -12340 -12340.0
1 1110011111100110 -6170 -6170.0
4 1111110011111100 -772 -771.25
8 1111111111001111 -49 -48.203125
##### Division of complements （ Round up ）

We can do this by “ bias （ biasing）” This value , To correct this improper rounding .

The following table shows how adding an appropriate offset before performing an arithmetic right shift can result in the result being rounded correctly . In the 3 Column , We have given -12340 The result after adding the partial value , low k position （ The ones that move out to the right ） In italics . We can see , low k The bit to the left may add 1, Maybe not 1. For cases where rounding is not required （k=1）, Adding biases only affects the bits that have been removed . For situations where rounding is required , The addition of bias results in a higher bit addition 1, So it turns out that Round to zero .

k Biased quantity -12340+ Biased quantity >>k（ Binary system ） Decimal system $-12340/{2^k}$
0 0 1100111111001100 1100111111001100 -12340 -12340.0
1 1 1100111111001101 1110011111100110 -6170 -6170.0
4 15 1100111111011011 1111110011111100 -771 -771.25
8 255 1101000011001011 1111111111001111 -48 -48.203125

#### summary

Now we see , Divide 2 The power of can be realized by logical or arithmetic right shift . That's why most machines offer both types of right shift . Unfortunately , This method cannot be extended to divide by any constant . Different from multiplication , We can't divide by 2 To divide by an arbitrary constant K Division .

### Floating point numbers

#### Binary decimals

A decimal code for binary numbers ：$b = \sum\limits_{i = - n}^m {{2^i} \times {b_i}}$. Moving the binary decimal point one bit to the left is equivalent to the number being 2 except , Moving the binary decimal point one bit to the right is equivalent to multiplying a number by 2.

#### IEEE Floating point means

IEEE Floating point standard uses $V = {( - 1)^s} \times M \times {2^E}$ To express a number

• Symbol （sign）s Decide that this number is negative （s=1） Still positive （s=0）, And for numbers 0 The symbolic bit interpretation of is treated as a special case .
• mantissa （ significand）M It's a binary decimal , Its scope is $1 \sim 2 - \varepsilon$, Or is it $0 \sim 1 - \varepsilon$.
• Order code （ exponent）E Its function is to weight floating-point numbers , The weight is 2 Of E The next power （ It could be a negative number ）. Divide the bit representation of a floating-point number into three fields , Code these values separately ：
• A single sign bit s Direct encoding symbols s
• k Step code field of bit $\exp = {e_{k - 1}} \cdots {e_1}{e_0}$ Code level code E.
• n Decimal field $frac = {f_{n - 1}} \cdots {f_1}{f_0}$ Code mantissa M, But the encoded value also depends on whether the value of the order code field is equal to 0.

C How to code in a language ：

Single precision floating point format （float） —— s、exp and frac The fields are 1 position 、k = 8 Bit and n = 23 position , Get one 32 Who said .

Double precision floating point format （double） —— s、exp and frac The fields are 1 position 、k = 11 Bit and n = 52 position , Get one 64 Who said . according to exp Value , The encoded values can be divided into three different situations ： situation 1： Normalized values —— exp Bit patterns of ： It's not all about 0（ The number 0）, Not all for 1（ The single precision value is 255, The double precision value is 2047）.

The value of the step code ：E = e - Bias（ Offset coding ）

e yes An unsigned number , Its position is expressed as ${e_{k - 1}} \cdots {e_1}{e_0}$, Under single precision, the value range is 1~254. The value range of double precision is 1 ~ 2047.

Offset value $Bias = {2^{k - 1}} - 1$, In single precision, it is 127, Double precision is 1023.

So the stage code E Value range of ： In single precision, it is -126 ~ +127. Double precision is -1022 ~ 1024.

e The scope of the ：1~254

Bias Value ：127

E The scope of the ：-126~127

The value of the mantissa ：M=1+f（ Implicit encoding , Because there's an implicit 1, So it's impossible to say 0）

among $0 \le f \le 1.0$,f The binary representation of $0.{f_{n - 1}} \cdots {f_1}{f_0}$, That is, the binary decimal point is to the left of the most significant digit .

So an implicit 1,M The value range of is $1.0 \le M \le 2.0$.

Why not exp Fields are encoded with complements ？ Why use offset coding form ？

exp If the field is a complement code , Comparing two floating-point numbers is to compare the sign bit of the complement , And compare the encoding bits .

And in the exp Offset coding is used in the field , We only need to compare the unsigned numbers once e The value of .

give an example ：float f = 15213.0

$\begin{array}{l} {15213_{10}} = {11101101101101_2}\ \quad {\kern 1pt} \;\;\, = {1.1101101101101_2} \times {2^{13}} \end{array}$

$\begin{array}{l} M = {1.1101101101101_2}\ frac = {11011011011010000000000_2} \end{array}$

be ：

$\begin{array}{l} E = 13\ Bias = 127\ \exp = 140 = {10001100_2} \end{array}$ situation 2： Denormalized values —— exp The bit pattern of is full 0.

The value of the step code ：E = 1 - Bias.

The value of the mantissa ：M = f（ There is no implied 1, Can be said 0）

Denormalized numbers have two uses ：

Represents the value 0 —— Just the mantissa M = 0.

Very close to 0.0 Number of numbers

situation 3： Special values —— exp The bit pattern of is full 1.

When the decimal fields are all 0 when , The resulting value represents infinity , When s = 0 When is $+ \infty$, When s=1 When is $-\infty$. When the decimal field is nonzero 0, obtain NaN(Not a Number).

#### The operation rules of floating point numbers

##### Multiply integers and floating-point numbers

The rules ：$x{ + _f}y = Round(x + y)$,$x{ \times _f}y = Round(x \times y)$, among $Round(x \times y)$ Follow the rounding rules in the following table .

1.4 1.6 1.5 2.5 -1.5
towards 0 Round off 1 1 1 2 -1
Round to infinity 1 1 1 2 -2
Round to infinity 2 2 2 3 -1
Even rounding （ rounding ） 1 2 2 2 -2
##### Multiply two floating-point numbers

Two floating point multiplication rules ：$({( - 1)^{s1}} \times M1 \times {2^{E1}}) \times ({( - 1)^{s2}} \times M2 \times {2^{E2}}) = {( - 1)^S} \times M \times {2^E}$

S:s1^s2

M:M1+M2

E:E1+E2

Floating point addition rules ：${( - 1)^{s1}} \times M1 \times {2^{E1}} + {( - 1)^{s2}} \times M2 \times {2^{E2}} = {( - 1)^S} \times M \times {2^E}$

S and M The value of is the result of the addition of two floating-point decimal points .

E:E1 ( hypothesis E1>E2) #### Even rounding of floating-point numbers

For example, the excess number of significant digits exceeding the specified number is 1001, it More than half of the limit （ namely 0.5）, Therefore, the lowest position goes into 1. If the extra number is 0111, It's less than half the lowest , Then leave out the extra numbers （ Truncate the mantissa 、 truncation ） that will do . For extra numbers, it's 1000、 It happens to be the special case of the lowest half , The lowest is 0 Then we give up the extra places , The lowest is 1 Then carry 1、 So that the lowest position is still 0（ even numbers ）.

Note that the digits here are binary numbers .

give an example ： After the decimal point is required 3 position .

about 1.0011001, After rounding, it is 1.010（ Get rid of the excess 4 position , Add 0.001）
about 1.0010111, After rounding, it is 1.001（ Get rid of the excess 4 position ）
about 1.0011000, After rounding, it is 1.010（ Get rid of the excess 4 position , Add 0.001, So that the lowest position is 0）

about 1.1001001, After rounding, it is 1.101（ Get rid of the excess 4 position , Add 0.001）
about 1.1000111, After rounding, it is 1.100（ Get rid of the excess 4 position ）
about 1.1001000, After rounding, it is 1.100（ Get rid of the excess 4 position , No addition , Because the lowest position is already 0）

about 1.01011, After rounding, it is 1.011（ Get rid of the excess 2 position , Add 0.001）
about 1.01001, After rounding, it is 1.010（ Get rid of the excess 2 position ）
about 1.01010, After rounding, it is 1.010（ Get rid of the excess 2 position , No addition ）

##### Be careful

Floating point operations do not support associative law .

give an example ：(1e10+3.14)-1e10=0,3.14+（1e10-1e10）=3.14. Because of rounding , The first expression will be lost 3.14.

give an example ：（1e20 1e20）1e-20 It evaluates to positive infinity , and 1e20 （1e201e-20） = 1e20.

#### C Floating point numbers in languages

stay C In language , When in int、float and double When casting between formats , The principle of changing numerical and bit patterns is as follows （ hypothesis int yes 32 Bit ）

• from int convert to float, Numbers don't spill over , But it can be rounded .
• from int or float convert to double, because double There is a wider range of （ That is, the range of values that can be represented ）, It also has higher accuracy （ That's the number of significant digits ）, So we can keep the exact values .
• from double convert to float, Because the scope is smaller , So the value may overflow into $+ \infty$ or $- \infty$. in addition , Because of its low accuracy , It may also be rounded off from float perhaps double convert to int, The value will be rounded to zero . for example ,1.999 Will be converted to 1, and -1.999 Will be converted to -1. Further on , Values may overflow .C The language standard does not specify a fixed result for this situation . A conversion from floating point numbers to integers , If you can't find a reasonable integer approximation for the floating-point number , It will produce such a value . therefore , expression （int）+1e10 You'll get -21483648, From a positive value to a negative value .

give an example ：int x = ...; float f = ....;double d =... ;

expression Yes / wrong remarks
x == (int)(float)x wrong float There are not enough bits to indicate int, Conversion can cause precision loss
x == (int)(double)x Yes
f ==(float)(double)f Yes
d == (double)(float)d wrong float->double Insufficient precision
f == -(-f) Yes
2/3 == 2/3.0 wrong
d<0.0 ==> ((d*2)<0.0) Yes
d>f ==> -f >-d Yes
d*d >=0.0 Yes
(d+f)-d == f wrong There is no law of Union

### summary

In this chapter, we need to master the main content of ： An unsigned number , Complement code , The encoding of signed numbers , The size of the range that can be represented , The rules of conversion , Operational rules . The encoding method of floating point numbers can be understood , This part is a little difficult to understand , If it's useful later, I'll come back and have a closer look , But for C The conversion rules from other data types to floating point numbers in the language should be mastered .

Develop habits , Praise first and then watch ！ If you think it's good , Welcome to your attention , give the thumbs-up , forward , Thank you for your time ！ 