← Notes

Data Representation

Number Systems

Decimal System

The Decimal System is a number system that uses base 10, characterized by two fundamental properties:

More formally, any $N$-digit number can be expressed using positional notation as:

$$ (d_{N-1} \cdot 10^{N-1}) + (d_{N-2} \cdot 10^{N-2}) + \dots + (d_2 \cdot 10^2) + (d_1 \cdot 10^1) + (d_0 \cdot 10^0) $$

Binary System

Binary is a number system that uses base 2, where:

Binary Groupings

Hexadecimal System

Hexadecimal uses base 16, where:

Note

Binary numbers are typically prefixed with 0b, while hexadecimal numbers with 0x.

Converting Between Number Systems

Decimal to Any Base (Division method)

  1. Repeatedly divide the decimal number by the target base
  2. Record the remainder at each step
  3. Continue until the quotient becomes 0
  4. Read the remainders from bottom to top to get the result

Any Base to Decimal (Addition of Positional Notation)

For a number with digits $d_{N - 1}, d_{N - 2}, \dots, d_1, d_0$ in base $B$, the converted number to base $C$ is calculated using the following formula:

$$ C = (d_{N-1} \cdot B^{N-1}) + (d_{N-2} \cdot B^{N-2}) + \dots + (d_1 \cdot B^1) + (d_0 \cdot B^0) $$

Binary to Hexadecimal

  1. Group binary digits into sets of 4 bits from right to left
  2. Pad the leftmost group with leading zeros if necessary
  3. Convert each 4-bit group to its hexadecimal equivalent using the table below

Hexadecimal to Binary

  1. Convert each hexadecimal digit to its 4-bit binary equivalent using the table below
  2. Concatenate all binary groups
HexBinaryDecimal
000000
100011
200102
300113
401004
501015
601106
701117
810008
910019
A101010
B101111
C110012
D110113
E111014
F111115

Computer Data Units

Memory and Storage Capacities

UnitValue
1 KiB (Kibibyte)$2^{10}$ bytes (1,024 bytes)
1 MiB (Mebibyte)$2^{20}$ bytes (1,048,576 bytes)
1 GiB (Gibibyte)$2^{30}$ bytes (1,073,741,824 bytes)
1 TiB (Tebibyte)$2^{40}$ bytes (1,099,511,627,776 bytes)
1 PiB (Pebibyte)$2^{50}$ bytes (1,125,899,906,842,624 bytes)
1 EiB (Exbibyte)$2^{60}$ bytes (1,152,921,504,606,846,976 bytes)

Data Transfer Units

UnitValue
1 Kilobit (Kb)1,000 bits
1 Megabit (Mb)1,000,000 bits
1 Gigabit (Gb)1,000,000,000 bits
1 Terabit (Tb)1,000,000,000,000 bits

Character Representation

ASCII

The American Standard Code for Information Interchange (ASCII) is a character encoding standard which uses 7 bits per character.

Types of ASCII Characters

ASCII Table

Searchable ASCII Table

Unicode

Unicode is a character set standard that assigns a unique hexadecimal identifier, called a code point, to every character across all writing systems. These code points follow the format U+<....>. It is also a superset of ASCII.

Code Point Categories

Character Encoding Standards

Unicode defines three primary character encoding standards for converting code points into bytes for storage and transmission. These standards use code units as their basic storage elements to represent characters.

Note

Code points can be encoded in another encoding scheme. However, when there is equivalent Unicode code point in the other encoding scheme, the character will appear as a �.

Note

Since UTF-16 and UTF-32 stores multi-byte code points, characters whose Unicode code points fall in the ASCII range (U+0000 to U+007F) gets zero-padded. This leads to Both big-endian and little-endian valid orderings. For instance, “Hello”, corresponding to U+0048 U+0065 U+006C U+006C U+006F, which can represented as 00 48 00 65 00 6C 00 6C 00 6F (big-endian) or 48 00 65 00 6C 00 6C 00 6F 00 (little-endian). To help decoders detect the byte order, Unicode introduced the Byte Order Mark U+FEFF which encodes as FE FF in big-endian and reads as FF FE in little-endian. Modern systems that do use UTF-16 typically just assume little-endian and skip the BOM entirely, which is its own subtle gotcha.

Grapheme clusters

A grapheme cluster represents what users typically think of as a “character”, that is, the smallest unit of written language that has semantic meaning.

Example

The grapheme clusters in the Hindi word "क्षत्रिय" are ["क्ष", "त्रि", "य"], where each cluster can comprise multiple Unicode code points:

Note

The example above highlights whysimply counting Unicode code points does not always correspond to what users perceive as individual characters!

Integer Representation

Unsigned Integer Representation

Unsigned integers use all available bits to represent positive values (including zero). No bit is reserved for a sign, allowing for a larger range of positive values compared to signed integers of the same bit width.

An $N$-bit unsigned integer can take up values in the range of $[0, 2^N - 1]$.

Signed Integer Representation

Signed integers reserve one bit (typically the MSB) to indicate sign, reducing the range of representable positive values but enabling representation of negative numbers.

Two’s complement is the most common encoding system to represent signed integers. A number encoded under two’s complement has its MSB exclusively as a sign bit:

An $N$-bit unsigned integer using two’s complement can take up values from $-2^{N-1}$ to $2^{N-1} - 1$.

To represent a positive number using two’s complement, no change is required.

To represent a negative number using two’s complement:

  1. Start with the binary representation of the positive number and toggle all the bits
  2. Add 1 to the result from step 1

To understand why this works, think of these steps as a way of finding the additive inverse $-b$ such that $b + (-b) = 1000\dots0$. We target $1000\dots0$ ($n + 1$ bits wide) rather than $0000\dots0$ because in a $n$ fixed-width register, the leading $1$ is discarded as overflow, making them equivalent. It also sidesteps the problem with one’s complement, where $b + \tilde{b} = 1111\dots1$ introduces a “negative zero” ($1111\dots1$) alongside the usual $0000\dots0$.

Step 1 exploits the fact that toggling all the bits of $b$ produces a number $\tilde{b}$ such that every bit position sums to $1$, giving $b + \tilde{b} = 1111\dots1$. Step 2 then adds $1$ to both sides of the equation: the right-hand side carries all the way through, flipping $1111\dots1$ into $1000\dots0$ (with the leading $1$ overflowing out of the register), and the left-hand side tells us the additive inverse is $\tilde{b} + 1$.

Note

When converting from binary to decimal, if the MSB is a 1, treat it as a negative number, then proceed as discussed in the “Any Base to Decimal” section.

Floating-Point Representation

IEE 754 Structure

For some number $x$:

$$ x = (-1)^\text{sign} \times (\text{integer}.\text{fraction})_2 \times 2^\text{actual exponent} $$
| 1 bit | 8 bits         |           23 bits                   |
  sign    biased exponent            fraction
| 1 bit |     11 bits      |             52 bits                |
  sign    biased exponent               fraction
| 1 bit |     15 bits      |             112 bits                |
  sign    biased exponent               fraction
Note

It may not be possible to store a given $x$ exactly with such a scheme whenever the actual exponent is outside of the possible range, or if the fraction field can’t fit in the allocated number of bits (i.e., say for single precision, bits 24 and bits 25, where bit 0 is the implicit integer, are ones)

Fields

Normal Numbers

$$ x = (-1)^\text{sign} \times (1.\text{fraction})_2 \times 2^\text{actual exponent} $$
| sign = any | exponent != 00000000 or exponent != 11111111 | fraction = any

Special Numbers

$+\infty$ and $-\infty$

| sign = any | exponent = 11111111 | fraction = 000...0 |

NaN

Key properties:

Types of $\text{NaN}$:

Quiet $\text{NaN}$

| sign = any | exponent = 11111111 | fraction = 1<no restriction> |

Signalling $\text{NaN}$

| sign = any | exponent = 11111111 | fraction = 0<no restriction> |

0

| sign = any | exponent = 00000000 | fraction = 000000...0 |

Denormalized (Subnormal) Numbers

$$ x = (-1)^\text{sign} \times (0.\text{fraction})_2 \times 2^\text{smallest possible actual exponent} $$
| sign = any | exponent = 00000000 | fraction != 000...0 |
Note

$+\infty$ and $-\infty$, $+0$ and $-0$ are not interchangeable!

Converting Decimal to IEEE 754

To convert $12.375_{10}$ to single precision IEEE 754:

Step 1: Convert to binary

$$ 12.375_{10} = 1100.011_2 $$

Step 2: Determine the sign

Since it’s a positive number, the sign bit is 0.

Step 3: Normalize the fraction

$$ 1100.011_2 = 1.100011_2 \times 2^3 $$

The normalized fraction, padded to 23 bits, is: $10001100000000000000000_2$

Step 4: Calculate the biased exponent

$$ \text{biased exponent} = \text{actual exponent} + \text{bias} = 3 + 127 = 130 = 10000010_2 $$

Step 5: Combine all fields

| Sign | Exponent | Mantissa                |
|  0   | 10000010 | 10001100000000000000000 |

Final result: $01000001010001100000000000000000_2$

Integer Overflow and Integer Underflow

$N$-bit signed and unsigned integers have a certain range of values they may represent:

RepresentationRange
Unsigned $N$-bit integer$[0, 2^n − 1]$
Signed $N$-bit integer$[−2^{n - 1}, 2^{n - 1} − 1]$

Integer overflow occurs when an arithmetic operation produces a result larger than the maximum value representable with the given number of bits. Integer underflow occurs when an arithmetic operation produces a result smaller than the minimum value representable with the given number of bits.

In Rust, integer overflow and underflow produces panics, while release builds produce the following behaviour:

Integer Byte Order

Byte order, also known as endianness of a system defines how multi $N$-byte chunks ($N \gt 1$) are assign to memory addresses.

Example: For the string “Hello” in UTF-16 encoding, in big-endian format is 00 48 00 65 00 6C 00 6C 00 6F

while “Hello” in little-endian format is 48 00 00 65 00 6C 00 6C 00 6F 00 since within every 2-byte chunk, the least significant byte, the non-zero byte, will be at the lowest memory address.

Note

On x86-64 systems (and most systems today), the byte ordering is little endian.

Note

The computer’s preference for the layout of individual bits is known as bit endianness.