Data Structures
An inspiration for the structure of these notes is my computer vision notes.
at a glance:
| Data Structure | Time Complexity | Space Complexity (Worst) | |||||||
|---|---|---|---|---|---|---|---|---|---|
| Average | Worst | ||||||||
| Access | Search | Insertion | Deletion | Access | Search | Insertion | Deletion | ||
| Array | Θ(1) |
Θ(n) |
Θ(n) |
Θ(n) |
O(1) |
O(n) |
O(n) |
O(n) |
O(n) |
| Stack | Θ(n) |
Θ(n) |
Θ(1) |
Θ(1) |
O(n) |
O(n) |
O(1) |
O(1) |
O(n) |
| Queue | Θ(n) |
Θ(n) |
Θ(1) |
Θ(1) |
O(n) |
O(n) |
O(1) |
O(1) |
O(n) |
| Singly-Linked List | Θ(n) |
Θ(n) |
Θ(1) |
Θ(1) |
O(n) |
O(n) |
O(1) |
O(1) |
O(n) |
| Doubly-Linked List | Θ(n) |
Θ(n) |
Θ(1) |
Θ(1) |
O(n) |
O(n) |
O(1) |
O(1) |
O(n) |
| Hash Table | N/A |
Θ(1) |
Θ(1) |
Θ(1) |
N/A |
O(n) |
O(n) |
O(n) |
O(n) |
| Binary Search Tree | Θ(log(n)) |
Θ(log(n)) |
Θ(log(n)) |
Θ(log(n)) |
O(n) |
O(n) |
O(n) |
O(n) |
O(n) |
| B-Tree | Θ(log(n)) |
Θ(log(n)) |
Θ(log(n)) |
O(log(n)) |
O(log(n)) |
O(log(n)) |
O(log(n)) |
O(log(n)) |
O(n) |
| Red-Black Tree | Θ(log(n)) |
Θ(log(n)) |
Θ(log(n)) |
Θ(log(n)) |
O(log(n)) |
O(log(n)) |
O(log(n)) |
O(log(n)) |
O(n) |
| AVL Tree | Θ(log(n)) |
Θ(log(n)) |
Θ(log(n)) |
Θ(log(n)) |
O(log(n)) |
O(log(n)) |
O(log(n)) |
O(log(n)) |
O(n) |
primitives
schematic/s
python
| type | example |
|---|---|
int |
32 |
float |
3.2 |
complex |
(3+2j) |
bool |
False |
str |
"Hello World!" |
bytes |
b'\x00\x00\x00' |
NoneType |
None |
java
| type | size (bits) | description | usage |
|---|---|---|---|
byte |
8 | -128 to 127 | byte b = 100; |
short |
16 | -32,768 to 32,767 | short s = 1000; |
int |
32 | -$2^{31}$ to $2^{31}-1$ | int i = 12345; |
long |
64 | -$2^{63}$ to $2^{63}-1$ | long l = 123456789L; |
float |
32 | single-precision floating number | float f = 3.14f; |
double |
64 | double-precision floating number | double d = 3.14159; |
char |
16 | single 16-bit unicode character | char c = 'A'; |
boolean |
1bit or 1byte 𐃏 | true or false |
boolan flag = true; |
c
| type | size (bits) | description | usage |
|---|---|---|---|
char |
typically 8 | character type (smallest addressable unit) | char c = 'A'; |
short |
typically 16 | short integer | short s = 1000; |
int |
typically 32 | integer number | int i = 42; |
long |
typically 32 or 64 | long integer | long l = 123456L; |
long long |
typically $\geq$64 | extended-precision integer | long long ll = 123456789LL; |
float |
typically 32 | single-precision floating number | float f = 3.14f; |
double |
typically 64 | double-precision floating number | double d = 3.14159; |
long double |
$\geq$80 (platform-dependent) | extended-precision floating number | long double ld = 3.141592653589793L; |
_Bool |
typically 8 | boolean value (true or false, since C99) | _Bool flag = 1; or bool flag = true; |
void |
— | represents “no value” or “no type” | void functionName(); |
arrays
schematic
time complexity
| Access | Search | Insertion | Deletion | |
|---|---|---|---|---|
| Average | $\mathcal{O}(1)$ | $\mathcal{O}(n)$ | $\mathcal{O}(n)$ | $\mathcal{O}(n)$ |
| Worst | . | . | . | . |
explanations are warranted for these. access will be thought of as the time complexity required to sequentially access the $k$th item in the data-structure.
to "access" the $k$th item we can index into the array: constant time
searching for a particular key is not something we can do intelligently in this contiguous block of memory, so we must check all $n$ items.
insertion would on average take $\dfrac n2$ time, but because we are working with asymptotics, the constant disappears. the worst case is insertion at the front of the array with every subsequent item having to be moved into the $k+1$th position.
deletion follows a similiar argument with worst-case being deletion of the first element, and the average case decaying to the worst-case bounds asymptotically.
space complexity
this is less of a question when you have a data structure, as opposed to an algorithm, because with $n$ elements you will have to store all of them uniquely. as such for arrays and all data structures on this page 𐃏 we have $\mathcal{O}(n)$ space complexity.
operations
naturally, whilst we are considering access, search, insertion and deletion operations on this page, an array is represented in Python as a list. these lists have the following methods:
| method | explanation |
|---|---|
append() |
adds an element to the end of the list |
clear() |
removes all the elements from the list |
copy() |
returns a copy of the list |
count(arg) |
returns the number of elements with value of arg |
extend(another_list) |
add the elements of another list to this list |
index(arg) |
return index of first element with arg value |
insert(arg) |
add element at arg position |
pop(arg) |
remove element at arg position |
remove() |
remove first item with specified value |
reverse() |
reverses the order of the list |
sort() |
sorts the list in place. mutates array |
linked lists
singly linked list
schematic
time complexity
| Access | Search | Insertion | Deletion | |
|---|---|---|---|---|
| Average | $\mathcal{O}(n)$ | $\mathcal{O}(n)$ | $\mathcal{O}(1)$ | $\mathcal{O}(1)$ |
| Worst | . | . | . | . |
once again, explanations are necessary:
- to access a $k$th item, we need to start at the
headand make our way to this item; hence $\mathcal{O}(n)$ for access - same for search; even if you know which item you want, there is no way of knowing where it is. you have to traverse from the
head. - the very act of insertion will take constant time. if you wish to find a "middle" item and then insert there, the complexity would be $\mathcal{O}(n)$, but here we are decoupling the operations
- same as above. simply freeing memory / deleting a node will take constant time.
doubly linked lists
schematic
time complexity
| Access | Search | Insertion | Deletion | |
|---|---|---|---|---|
| Average | $\mathcal{O}(n)$ | $\mathcal{O}(n)$ | $\mathcal{O}(1)$ | $\mathcal{O}(1)$ |
| Worst | . | . | . | . |
with respect to the asymptotics of operations, the doubly linked list provides no advantage over the singly linked list.
tradeoffs
non-contiguous use of memory is an advantage in terms of finding more memory for nodes, but is also a disadvantage in terms of traversal.
strings
strings really are just special cases of arrays. but because their operations vary so wildly from the usual array operations:
- concatenation
- joining
- comparison
- splitting
- searching for substrings
it makes sense for this topic to have its own little nook.
this page is not about algorithms, and so there is nothing really novel to add here at the moment, but a number of clever algorithms will relate back to this heading.
stacks & queues
stack
schematic
time complexity
| Access | Search | Insertion | Deletion | |
|---|---|---|---|---|
| Average | $\mathcal{O}(n)$ | $\mathcal{O}(n)$ | $\mathcal{O}(1)$ | $\mathcal{O}(1)$ |
| Worst | . | . | . | . |
as expected:
- random access will take $n$ steps. presumably here we are using a linked list implementation though and so arbitrary accesses will always take (asymptotically) $n$ steps
- similarly search as in an array or linked list requires $n$ steps
- the advantage comes from inserting into the stack which always takes constant time
- and deletion costs constant time. this is the use-case for this data structure anyhow
queue
schematic
time complexity
| Access | Search | Insertion | Deletion | |
|---|---|---|---|---|
| Average | $\mathcal{O}(n)$ | $\mathcal{O}(n)$ | $\mathcal{O}(1)$ | $\mathcal{O}(1)$ |
| Worst | . | . | . | . |
this is the same as the stack:
- underlying implementation details would be identical, so access behaviour wouldn't change
- neither would search functionality
- only the location of insertion
- and location of deletion would change
double ended queue (deque)
this is an interesting data structure. at first I thought it was short for dequeue, but it is not. instead this structure is pronounced "deck", and is a list with 2 pointers:
hash tables / hashmaps
A map maintains insertion order.
schematic
time complexity
| Access | Search | Insertion | Deletion | |
|---|---|---|---|---|
| Average | n/a | $\mathcal{O}(1)$ | $\mathcal{O}(1)$ | $\mathcal{O}(1)$ |
| Worst | . | $\mathcal{O}(n)$ | $\mathcal{O}(n)$ | $\mathcal{O}(n)$ |
obviously the time-complexity depends on the data structure and the way in which it is implemented from an atomic operations point of view.
having said this, hash-tables are sort of a mystery to me at the moment. they have varying implementations and even then I need to study each to understand what the associated compute would look like for each method.
for now I have just copied down what was given at bigocheatsheet.com
trees
schematic
Binary Search Tree
schematic
the critical distinguishing property that makes a binary tree a binary search tree is:
- the left child of each node must be less than the node, and;
- the right child of this node must be larger than the node.
time complexity
note that the following time complexities are for a binary search tree, because there is really no benefit to performing these operations on a regular binary tree. the time complexities are all $\mathcal{O}(n)$ for the unsorted tree.
| Access | Search | Insertion | Deletion | |
|---|---|---|---|---|
| Average | $\mathcal{O}(\log(n))$ | $\mathcal{O}(\log(n))$ | $\mathcal{O}(\log(n))$ | $\mathcal{O}(\log(n))$ |
| Worst | $\mathcal{O}(n)$ | $\mathcal{O}(n)$ | $\mathcal{O}(n)$ | $\mathcal{O}(n)$ |
naturally, this table warrants explanation. though currently I have none.
I can reconcile average complexity of search to take log n steps, but I wonder why worst case is not the same.
I wonder if insertion causes a total rebalancing to occur. is a Binary Search Tree even balanced?
full vs. complete vs. balanced vs. perfect
-
Complete binary tree
- every level is completely filled except possibly the last, and all nodes on the last level appear as far left as possible.
- use-cases: binary-heap, array representations
-
Full binary tree
- every node has either \(0\) or \(2\) children (no node has exactly one child).
- use-cases: huffman-encoding
-
Perfect binary tree
- both full and complete; equivalently, all internal nodes have \(2\) children and all leaves lie at the same depth.
-
Balanced binary tree
- take every node and find the height of its left-subtree and right-subtree. if the difference in heights exceed $k=1$, where $k$ can be any number criterion we choose, then the tree is unbalanced.
Red Black Trees
schematic
time complexity
| Access | Search | Insertion | Deletion | |
|---|---|---|---|---|
| Average | $\mathcal{O}(\log(n))$ | $\mathcal{O}(\log(n))$ | $\mathcal{O}(\log(n))$ | $\mathcal{O}(\log(n))$ |
| Worst | $\mathcal{O}(\log(n))$ | $\mathcal{O}(\log(n))$ | $\mathcal{O}(\log(n))$ | $\mathcal{O}(\log(n))$ |
I've just copied these down from bigocheatsheet.com blindly to close that tab and be done with it for now.
AVL Trees
Is a binary search tree that is height balanced; for each node $x$, the heights of the lef and right subtrees of $x$ differ by at most 1.
To implement such a tree, maintain an extra attribute $h$ in each node such that $x.h$ is the height of node $x$.
| Access | Search | Insertion | Deletion | |
| Average | $\mathcal{O}(\log(n))$ | $\mathcal{O}(\log(n))$ | $\mathcal{O}(\log(n))$ | $\mathcal{O}(\log(n))$ |
| Worst | $\mathcal{O}(\log(n))$ | $\mathcal{O}(\log(n))$ | $\mathcal{O}(\log(n))$ | $\mathcal{O}(\log(n))$ |
same thing as above, here. I've just copied the time-complexities down without understanding them.
I also worry about the additional 5 'data-structures':
- skip lists
- cartesian tree
- b-tree (this one makes sense)
- splay tree
- kd tree
and then the short few that I have found of my own accord:
- tries
- heaps
- priority queue (of which heap seems to be a form of)
heaps
schematic
graphs
schematic
representations
adjacency matrix, linked lists