Data Representation
Number Systems
Decimal System
Decimal is a number system that uses base 10, characterized by two fundamental properties:
- Each digit position can hold one of 10 unique values (0 through 9). Values greater than 9 require carrying to an additional digit position to the left.
- Each digit's position determines its contribution to the overall value through positional notation. Digits are labeled from right to left as \(d_0, d_1, d_2, \dots, d_{N - 1}\) where each successive position represents ten times the value of the position to its right.
More formally, any \(N\)-digit decimal number can be expressed using positional notation as:
$$ (d_{N-1} \cdot 10^{N-1}) + (d_{N-2} \cdot 10^{N-2}) + \dots + (d_2 \cdot 10^2) + (d_1 \cdot 10^1) + (d_0 \cdot 10^0) $$
Decimal numbers are typically written without prefixes, though they may be denoted as \(N_{10}\) when clarification is needed.
Binary System
Binary is a number system that uses base 2, characterized by:
- Each digit position (called a bit) can hold one of two unique values: 0 or 1. Values greater than 1 require carrying to an additional bit position to the left.
- Each bit's position determines its contribution through powers of 2: \(2^{N-1}, \dots, 2^1, 2^0\).
Binary numbers are typically prefixed with 0b
.
Binary Terminology and Groupings
Since individual bits represent limited information, they are commonly grouped:
- Byte: 8 bits (the smallest addressable unit of memory)
- Nibble: 4 bits (half a byte)
- Word: The natural data width of a CPU (typically 32 or 64 bits)
In any binary sequence, we identify:
- Most Significant Bit (MSB): The leftmost bit
- Least Significant Bit (LSB): The rightmost bit
Hexadecimal System
Hexadecimal uses base 16, characterized by:
- Each digit position can hold one of 16 unique values. Since we only have 10 decimal digits (0 to 9), letters A to F represent values 10 to 15 respectively.
- Each digit's position determines its contribution through powers of 16: \(16^{N-1}, \dots, 16^1, 16^0\).
Hexadecimal numbers are prefixed with 0x
.
Hexadecimal-Binary-Decimal Conversion Table
Hex | Binary | Decimal |
---|---|---|
0 | 0000 | 0 |
1 | 0001 | 1 |
2 | 0010 | 2 |
3 | 0011 | 3 |
4 | 0100 | 4 |
5 | 0101 | 5 |
6 | 0110 | 6 |
7 | 0111 | 7 |
8 | 1000 | 8 |
9 | 1001 | 9 |
A | 1010 | 10 |
B | 1011 | 11 |
C | 1100 | 12 |
D | 1101 | 13 |
E | 1110 | 14 |
F | 1111 | 15 |
Converting Between Number Systems
Decimal to Any Base (Division method)
- Repeatedly divide the decimal number by the target base
- Record the remainder at each step
- Continue until the quotient becomes 0
- Read the remainders from bottom to top to get the result
Any Base to Decimal (Addition of Positional Notation)
For a number with digits \(d_{N-1}, d_{N-2}, \dots, d_1, d_0\) in base \(B\), the converted number to base \(C\) is calculated using the following formula:
$$ (d_{N-1} \cdot B^{N-1}) + (d_{N-2} \cdot B^{N-2}) + \dots + (d_1 \cdot B^11) + (d_0 \cdot B^0) $$
Binary to Hexadecimal
- Group binary digits into sets of 4 bits from right to left
- Pad the leftmost group with leading zeros if necessary
- Convert each 4-bit group to its hexadecimal equivalent
Hexadecimal to Binary
- Convert each hexadecimal digit to its 4-bit binary equivalent
- Concatenate all binary groups
Computer Data Units
Memory and Storage Capacities
Unit | Value |
---|---|
1 KB (Kilobyte) | \(2^{10}\) bytes (1,024 bytes) |
1 MB (Megabyte) | \(2^{20}\) bytes (1,048,576 bytes) |
1 GB (Gigabyte) | \(2^{30}\) bytes (1,073,741,824 bytes) |
1 TB (Terabyte) | \(2^{40}\) bytes (1,099,511,627,776 bytes) |
1 PB (Petabyte) | \(2^{50}\) bytes (1,125,899,906,842,624 bytes) |
1 EB (Exabyte) | \(2^{60}\) bytes (1,152,921,504,606,846,976 bytes) |
Data Transfer Units
Unit | Value |
---|---|
1 Kilobit (Kb) | 1,000 bits |
1 Megabit (Mb) | 1,000,000 bits |
1 Gigabit (Gb) | 1,000,000,000 bits |
1 Terabit (Tb) | 1,000,000,000,000 bits |
Character Representation
ASCII
ASCII is a character encoding standard which uses 8 bits per character, where 7 bits encode the actual character and the eighth bit serves as a parity bit for error detection, where the added bit ensures either an even number of 1 (even parity) or an odd number of 1 (odd parity) in the complete byte.
Types of ASCII Characters
- Non-printable characters
- Control characters, from
0x00
to0x1F
(i.e. \(0_{10}\) to \(31_{10}\)) - Delete character,
0x7F
(i.e. \(127_{10}\))
- Control characters, from
- Printable characters, from
0x20
to0x7E
(\(32_{10}\) to \(126_{10}\)).
ASCII Table
Hex | Char | Description |
---|---|---|
00 | NUL | Null character |
01 | SOH | Start of Heading |
02 | STX | Start of Text |
03 | ETX | End of Text |
04 | EOT | End of Transmission |
05 | ENQ | Enquiry |
06 | ACK | Acknowledge |
07 | BEL | Bell |
08 | BS | Backspace |
09 | HT | Horizontal Tab |
0A | LF | Line Feed |
0B | VT | Vertical Tab |
0C | FF | Form Feed |
0D | CR | Carriage Return |
0E | SO | Shift Out |
0F | SI | Shift In |
10 | DLE | Data Link Escape |
11 | DC1 | Device Control 1 (XON) |
12 | DC2 | Device Control 2 |
13 | DC3 | Device Control 3 (XOFF) |
14 | DC4 | Device Control 4 |
15 | NAK | Negative Acknowledge |
16 | SYN | Synchronous Idle |
17 | ETB | End of Transmission Block |
18 | CAN | Cancel |
19 | EM | End of Medium |
1A | SUB | Substitute |
1B | ESC | Escape |
1C | FS | File Separator |
1D | GS | Group Separator |
1E | RS | Record Separator |
1F | US | Unit Separator |
20 | SP | Space |
21 | ! | Exclamation mark |
22 | " | Double quote |
23 | # | Number sign |
24 | $ | Dollar sign |
25 | % | Percent sign |
26 | & | Ampersand |
27 | ' | Apostrophe |
28 | ( | Left parenthesis |
29 | ) | Right parenthesis |
2A | * | Asterisk |
2B | + | Plus sign |
2C | , | Comma |
2D | - | Hyphen-minus |
2E | . | Period |
2F | / | Slash |
30 | 0 | Digit 0 |
31 | 1 | Digit 1 |
32 | 2 | Digit 2 |
33 | 3 | Digit 3 |
34 | 4 | Digit 4 |
35 | 5 | Digit 5 |
36 | 6 | Digit 6 |
37 | 7 | Digit 7 |
38 | 8 | Digit 8 |
39 | 9 | Digit 9 |
3A | : | Colon |
3B | ; | Semicolon |
3C | < | Less than |
3D | = | Equal sign |
3E | > | Greater than |
3F | ? | Question mark |
40 | @ | At sign |
41 | A | Uppercase A |
42 | B | Uppercase B |
43 | C | Uppercase C |
44 | D | Uppercase D |
45 | E | Uppercase E |
46 | F | Uppercase F |
47 | G | Uppercase G |
48 | H | Uppercase H |
49 | I | Uppercase I |
4A | J | Uppercase J |
4B | K | Uppercase K |
4C | L | Uppercase L |
4D | M | Uppercase M |
4E | N | Uppercase N |
4F | O | Uppercase O |
50 | P | Uppercase P |
51 | Q | Uppercase Q |
52 | R | Uppercase R |
53 | S | Uppercase S |
54 | T | Uppercase T |
55 | U | Uppercase U |
56 | V | Uppercase V |
57 | W | Uppercase W |
58 | X | Uppercase X |
59 | Y | Uppercase Y |
5A | Z | Uppercase Z |
5B | [ | Left square bracket |
5C | \ | Backslash |
5D | ] | Right square bracket |
5E | ^ | Caret |
5F | _ | Underscore |
60 | ` | Grave accent |
61 | a | Lowercase a |
62 | b | Lowercase b |
63 | c | Lowercase c |
64 | d | Lowercase d |
65 | e | Lowercase e |
66 | f | Lowercase f |
67 | g | Lowercase g |
68 | h | Lowercase h |
69 | i | Lowercase i |
6A | j | Lowercase j |
6B | k | Lowercase k |
6C | l | Lowercase l |
6D | m | Lowercase m |
6E | n | Lowercase n |
6F | o | Lowercase o |
70 | p | Lowercase p |
71 | q | Lowercase q |
72 | r | Lowercase r |
73 | s | Lowercase s |
74 | t | Lowercase t |
75 | u | Lowercase u |
76 | v | Lowercase v |
77 | w | Lowercase w |
78 | x | Lowercase x |
79 | y | Lowercase y |
7A | z | Lowercase z |
7B | { | Left curly brace |
7C | | | Vertical bar |
7D | } | Right curly brace |
7E | ~ | Tilde |
7F | DEL | Delete |
Unicode
Unicode is a character set standard that assigns a unique hexadecimal identifier, called a code point, to every character across all writing systems. These code points follow the format U+<....>
.
Code Point Categories
- Surrogate code points: Code points in the range
U+D800
toU+DFFF
, which are reserved for use in UTF-16 encoding - Scalar values: All other Unicode code points (excluding surrogate code points)
Character Encoding Standards
Unicode defines three primary character encoding standards for converting code points into bytes for storage and transmission. These standards use code units as their basic storage elements to represent characters.
- UTF-8
- Code unit size: 8 bits (1 byte)
- Character representation: 1 to 4 code units per character
- Encoding type: Variable-length encoding
- UTF-16
- Code unit size: 16 bits (2 bytes)
- Character representation: 1 code unit, or 2 code units for surrogate code points, and we call these 2 code units a surrogate pair
- UTF-32
- Code unit size: 32 bits (4 bytes)
- Character representation: Exactly 1 code unit per character
Grapheme clusters
A grapheme cluster represents what users typically think of as a "character", that is, the smallest unit of written language that has semantic meaning.
Example
The grapheme clusters in the Hindi word "क्षत्रिय"
are ["क्ष", "त्रि", "य"]
, where each cluster can comprise multiple Unicode code points. While "य"
is a single code point, "क्ष"
is a conjunct consonant formed from three code points: the base character "क"
(U+0915
), the virama "्"
(U+094D
), and "ष"
(U+0937
), which combine to create one visual unit. The cluster "त्रि"
is even more complex, consisting of four code points: "त"
(U+0924
), "्"
(U+094D
), "र"
(U+0930
), and the vowel sign "ि"
(U+093F
), where the vowel mark appears visually before the consonant cluster despite following it in the Unicode sequence.
The example above demonstrates why simply counting Unicode code points doesn't always correspond to what users perceive as individual characters, making grapheme cluster boundaries crucial for correct string processing.
Integer Representation
Unsigned Integer Representation
Unsigned integers use all available bits to represent positive values (including zero). No bit is reserved for a sign, allowing for a larger range of positive values compared to signed integers of the same bit width.
An \(N\)-bit unsigned integer can take up values from \(0\) to \(2^N - 1\).
Signed Integer Representation
Signed integers reserve one bit (typically the MSB) to indicate sign, reducing the range of representable positive values but enabling representation of negative numbers.
Two's Complement
Two's complement is the most common encoding system to represent signed integers. It treats the MSB exclusively as a sign bit:
- MSB = 0: positive number (or zero)
- MSB = 1: negative number
An \(N\)-bit unsigned integer using two's complement can take up values from \(-2^{N-1}\) to \(2^{N-1} - 1\).
To represent a number using two's complement:
- Start with the binary representation of the positive number
- Toggle all the bits
- Add 1 to the result
Note
When converting from binary to decimal, if the MSB is a 1, treat it as a -1, then proceed as discussed in the "Any Base to Decimal" section.
Floating-Point Representation
IEEE 754 Structure
Single Precision (32-bit)
| Sign | Biased Exponent | Normalized Mantissa |
| 1 | 8 | 23 |
Double Precision (64-bit)
| Sign | Biased Exponent | Normalized Mantissa |
| 1 | 11 | 52 |
Fields
- Sign: Represents the sign of the floating-point number using signed magnitude encoding (0 = positive, 1 = negative)
- Normalized Mantissa (Significand): Represents the significant digits of the number. The mantissa is normalized, meaning we move the decimal point so it falls in the range \([1.0, 2.0)\), but we implicitly store the 1, and explicitly store the digits after the decimal point (i.e., the fractional part).
- Biased Exponent: Determines the power of 2 by which to scale the mantissa. The exponent is biased, meaning we substract a constant bias to the actual exponent. For single precision, the bias is \(127_{10}\); for double precision, the bias is \(1023_{10}\)
Note
The mantissa is actually 24 bits total in single precision, if we count the implicit number before the decimal point, and 53 bits total in double precision due to the implicit leading number.
Converting Decimal to IEEE 754
To convert \(12.375_{10}\) to single precision IEEE 754:
Step 1: Convert to binary
\(12.375_{10} = 1100.011_2\)
Step 2: Determine the sign
Since it's a positive number, the sign bit is 0.
Step 3: Normalize the mantissa
\(1100.011_2 = 1.100011_2 \times 2^3\)
The normalized mantissa (fractional part only), padded to 23 bits, is: \(10001100000000000000000_2\)
Step 4: Calculate the biased exponent
Biased exponent = actual exponent + bias = \(3 + 127 = 130 = 10000010_2\)
Step 5: Combine all fields
| Sign | Exponent | Mantissa |
| 0 | 10000010 | 10001100000000000000000 |
Final result: \(01000001010001100000000000000000_2\)
Special Numbers
Denormalized Numbers
When calculations produce results that are too small for normal representation (i.e., below \(1.0 \times 2^{1 - 127}\) for single-precision and \(1.0 \times 2^{1 -1023}\) for double-precision) but are non-zero, we use denormalized representation. This is achieved by:
- Setting the exponent field to all zeros (which corresponds to using the fixed exponent \(1 -127\) for single-precision or \(1 - 1023\) for double-precision)
- Using an implicit leading 0 instead of 1 before the decimal point in the mantissa
- Calculating the mantissa such that \(0.\text{mantissa} \times 2^{1 -127}\) (or \(2^{1 - 1023}\) for double-precision) equals the initial number's binary scientific notation.
Infinity
When calculations exceed the representable range (overflow) or a non-zero value divided by zero, the result is infinity. The exponent field becomes all ones and the mantissa is all zeros. The sign bit determines positive or negative infinity.
This allows programs to continue running with meaningful results rather than crashing on overflow.
NaN (Not a Number)
Represents mathematically undefined results. Like infinity, the exponent is all ones, but now the mantissa is non-zero. Any operation with NaN produces NaN, and NaN is never equal to anything, including itself. It also has two types:
- Quiet NaN: Silently propagates through calculations
- Signaling NaN: Triggers an exception/error when encountered in operations
Integer Overflow
Integer overflow occurs when you try to store a number that exceeds the maximum value that can be represented with the available number of bits, causing the result to wrap around to an unexpected value.
The overflowed result can be found as follows:
- Unsigned integer: \(\text{(expected value)} \bmod 2^n\)
- Signed Integer:
- Positive overflow (i.e., \(\text{expected value} > 2^{N-1} - 1\)): \(\text{expected value} - 2^n\)
- Negative overflow (i.e., \(\text{expected value} < -2^{N-1}\)): \(\text{expected value} + 2^n\)
Integer Byte Order
Byte order, also known as endianness of a system defines how multi-byte values are assign to memory addresses.
- Big endian Byte Order: The most significant byte is stored at the lowest memory address.
- Little endian Byte Order: The least significant byte is stored at the lowest memory address.