Data structures are methods of storing data in a way that optimizes certain important operations.
You won’t get far without any of these.
Note
|
This lesson is intended mainly as a reference, and as such contain summary sections. There are, of course, lesson portions; but if you are learning these for the first time, do not only read the summaries, as they will be of no help. |
Sequence Containers
Sequence containers are used to store groups of similar objects in some undefined sequence, usually chronological.
Vector | Linked List | Deque | |
---|---|---|---|
STL analogue |
|||
C analogue |
|
||
Random Access |
\(\text{O}(1)\) |
\(\text{O}(n)\) |
\(\text{O}(1)\) |
Push/Pop Back |
\(\text{O}(1)\) |
\(\text{O}(1)\) |
\(\text{O}(1)\) |
Push/Pop Front |
\(\text{O}(1)\) |
\(\text{O}(1)\) |
|
Insert/Erase |
\(\text{O}(n)\) |
\(\text{O}(1)\) |
\(\text{O}(n)\) |
Merge |
\(\text{O}(n)\) |
\(\text{O}(1)\) |
\(\text{O}(n)\) |
Vector
STL analogue |
|
C analogue |
|
Random Access |
Push/Pop Back |
Insert/Erase |
Merge |
\(\text{O}(1)\) |
\(\text{O}(1)\) |
\(\text{O}(n)\) |
\(\text{O}(n)\) |
Vectors are the most basic container: an array of elements.
Accessing an element at index \(i\) only requires looking \(i\) elements in memory after the first element. Since this process is so simple, we can do this in \(\text{O}(1)\). This ability is called random access.
We can resize vectors by reallocating memory and copying the data over. As this is also a simple step, this takes \(\text{O}(1)\) time.
To push[1] or pop[2], we need only to resize if necessary before assigning to the last element. This, again, takes \(\text{O}(1)\) time.
To insert or erase at any position, however, requires that every element after is shifted[3] so that there is space or the gap is filled. Since up to all the elements might need to be shifted, and each shift must be made with a swap, this takes \(\text{O}(n)\) time.
If we needed to merge two vectors and combine their data, then we would have to push up as many elements as are in both. So, once again, this operation can occur in \(\text{O}(n)\) time.
Vectors are your most basic container. Most applications for storing data will use these.
Linked List
STL analogue |
|
Random Access |
Push/Pop |
Insert/Erase |
Merge |
\(\text{O}(n)\) |
\(\text{O}(1)\) |
\(\text{O}(1)\) |
\(\text{O}(1)\) |
Linked lists use a very different approach from the vector. Instead of storing every element contiguously in memory, the location of the first element is stored and each element stores the location of the next element.
In this way, we get a "linked" list, since each element "links" to the next.
A doubly-linked list stores also the last element and every element also stores the location of the previous element.
We must traverse linked lists, however, element-by-element, since otherwise we don’t know where anything is. Access, then, is \(\text{O}(n)\).
To insert or erase at any point[6], we only need to manipulate a few pointers; as such, we can achieve \(\text{O}(1)\) time.
To push or pop from the front or back[7], we do exactly what we did with insertion and erasure, and once again achieve \(\text{O}(1)\) time.
And to merge, all we need to do is link the last element of one linked list to the first element of the other linked list — then traversal will go through both lists. As such, we can achieve \(\text{O}(1)\) merging.
Now, seeing that more operations here have \(\text{O}(1)\) time than with vectors, you may be asking, "Why not use linked lists everywhere?"
The reason why you don’t see them more is because most of the time, we need to do a lot of random-access, and that wouldn’t be good with an \(\text{O}(n)\) random-access speed.
They are, however, useful where you need to combine or separate lists in different orders; for example, if you were to rotate an array by \(k\) elements, you’d need to do \(n\) swaps in a vector compared to a break and merge with a linked list.
Double-Ended Queue (Deque)
STL analogue |
Random Access |
Push/Pop |
Insert/Erase |
Merge |
\(\text{O}(1)\) |
\(\text{O}(1)\) |
\(\text{O}(n)\) |
\(\text{O}(n)\) |
Double ended queues, or deques (pronounced like deck), were containers made for the purpose of quick pushing and popping from both the front and the back.
Typically, they take the form of a collection of vectors: pushing to the back adds to the end vector and pushing to the front adds to the begin vector. When the end or begin vector runs out of space, a new one is declared and instantiated to have the respective position.
The collection of vectors is usually handled like both a vector and a linked list; each vector links to the next and previous vectors for the sake of traversal, and a master vector of pointers to each vector is kept for the sake of random access.
As such, access takes two redirections, one to find where the desired vector is and one to find which element of the vector is the desired element, achieving \(\text{O}(1)\) time.
Pushing and popping is trivial, and once again takes \(\text{O}(1)\) time.
Insertion and Erasure, like the vector, requires the shifting of all elements, and as expected takes \(\text{O}(n)\) time.
Merging is, too, just like the vector[8]: it takes \(\text{O}(n)\) time to push every element.
Deques have operations which run exactly at the same time complexity as vectors, but they have more functionality. So once again, you might ask, "Why should I ever use the verctor if I can use a deque?"
Deques use much more overhead processing: the same operation on a deque usually takes twice as long as one on a vector. And freeing all the memory from a deque is an \(\text{O}(n)\) operation, which I learned from bitter experience.
Container Adaptors
Container adaptors take base sequence containers and specialize them for a specific purpose. These are the most simple of what are considered proper data structures.
Stack
STL analogue |
Peek |
Push |
Pop |
top |
top |
top |
Stacks are first-in, last-out containers. Think of a stack of objects, like paper: you can put stuff on and take stuff off the top, but it’s messy to do it from anywhere else.
You may have heard of the "stack" and "heap" when it comes to memory allocation; the "stack" refers to the stack of function calls. A function cannot return until all the functions it was calling returns; the first function to be called is the last one to be returned from, and vice-versa.
We say that where we can push and pop is the top of the stack. Most implementations also allow a peek, which is to return the value of the top element.
These have many applications in client-side interfaces, such as the call stack or in menus. These are typically implemented with vectors.
Queue
STL analogue |
Peek |
Push |
Pop |
front, back |
back |
front |
Queues are first-in, first-out containers. Think of a a queue of customers waiting for the cashier at Foody Mart: the first one in the line gets out of the line first. If someone cuts in, we get undesirable results.
We say that the front of the queue is where we pop, and the back of the queue is where we push. You can typically peek at the front, and sometimes at the back.
These have many applications in network-side interfaces; for example, you’d want to process the first packet you receive first rather than last. They’re also useful when dealing with multithreading (which is kind of like a network) to avoid desyncs.
These are typically implemented with deques.
Heap
STL analogue |
Peek | Push | Pop |
---|---|---|
max |
dependent |
max |
\(\text{O}(1)\) |
\(\text{O}(\log{n})\) |
\(\text{O}(\log{n})\) |
Heaps are containers which "automatically sort" data.
While their mechanisms are somewhat out of the scope of this article since you can just use std::priority_queue<T>
, the result is that the top element is always the maximum (or minimum if you use the opposing predicate).
These are very useful in graph theory, which will be the focus of the next lesson.
Trees
Binary Search Tree
STL analogue |
Find[9] |
Insert |
Erase |
Memory Usage |
\(\text{O}(\log{n})\) |
\(\text{O}(\log{n})\) |
\(\text{O}(1)\) |
\(\Theta(n)\) |
You’ve heard of binary search. The binary search tree constructs a tree optimized for searching.
Whereas a heap keeps one element in the right place, the binary tree keeps all of them in the right place. Unfortunately, because of its structure, the minimum/maximum element is almost never going to be near the beginning of the search, and so it isn’t just a better version of the heap.
Prefix Tree (Trie)
Find |
Insert |
Erase |
Memory Usage |
\(\text{O}(m)\)[10] |
\(\text{O}(m)\) |
\(\text{O}(m)\) |
\(\Theta(m\log{n})\) |
A trie stores prefixes (hence the name "prefix tree"). The name "trie" comes from the middle syllable of "retrieval." I’m not quite sure why.
This is a method for storing large numbers of strings in a way that’s both easy on memory and fast for retrieval.
Instead of making a giant array, we construct a tree.
Its root has all the first letters as children, and each node has all the next relevant letters as children.
A special character, usually nul
, denotes the end of string.
We most likely won’t run into this, but it’s useful to know about. Perhaps it’ll inspire a solution to some other problem.
Disjoint-Set
Find |
Union |
Memory Usage |
\(\text{O}(\alpha(n))\)[11] |
\(\text{O}(\alpha(n))\) |
\(\Theta(n)\) |
You will be asked to check if two things are in the same group. You will be told to merge groups, given two individual things: one from each group.
Enter the disjoint-set data structure!
We begin with an array as long as the number of elements.
We use two rules to determine which elements are in which sets:
-
If the value of an element is their index \(i\), then we say that that element is in set \(i\).
-
If the value \(n\) of an element is not their index, then they are in the set that the element at index \(n\) is at.
For example, take the following array:
<span id="L1" class="line"><span class="n">ind</span><span class="p">.</span> <span class="mi">0</span> <span class="mi">1</span> <span class="mi">2</span> <span class="mi">3</span> <span class="mi">4</span> <span class="mi">5</span> <span class="mi">6</span> <span class="mi">7</span> <span class="mi">8</span> <span class="mi">9</span></span>
<span id="L2" class="line"><span class="n">val</span><span class="p">.</span> <span class="mi">0</span> <span class="mi">1</span> <span class="mi">2</span> <span class="mi">2</span> <span class="mi">4</span> <span class="mi">5</span> <span class="mi">2</span> <span class="mi">3</span> <span class="mi">8</span> <span class="mi">9</span></span>
Elements \(\{0, 1, 2, 4, 5, 8, 9\}\) are in sets \(\{0, 1, 2, 4, 5, 8, 9\}\), respectively, by rule 1. Elements \(\{3, 6\}\) are in set \(2\) by rule 2. And element \(7\) is in the same set as element \(3\) by rule 2, so it’s also in set 2.
You can represent this situation mathematically as \(\{\{0\},\{1\},\{2,3,6,7\},\{4\},\{5\},\{8\},\{9\}\}\). We can also represent this in a forest of trees (hence why this is under the tree section) like so:
<span id="L1" class="line"><span class="mi">0</span> <span class="mi">1</span> <span class="mi">2</span> <span class="mi">4</span> <span class="mi">5</span> <span class="mi">8</span> <span class="mi">9</span></span>
<span id="L2" class="line"> <span class="o">|</span>\</span>
<span id="L3" class="line"> <span class="mi">3</span> <span class="mi">6</span></span>
<span id="L4" class="line"> <span class="o">|</span></span>
<span id="L5" class="line"> <span class="mi">7</span></span>
When we find, we simply traverse the tree. We can use this to do an optimization by flattening the tree: for every element we pass through, we set it to the root element. For example, if we passed a find operation through element \(7\) in our previous example, we would change \(7\) to point to \(2\), shortening later running times.
<span id="L1" class="line"><span class="mi">0</span> <span class="mi">1</span> <span class="mi">2</span> <span class="mi">4</span> <span class="mi">5</span> <span class="mi">8</span> <span class="mi">9</span></span>
<span id="L2" class="line"> <span class="o">/|</span>\</span>
<span id="L3" class="line"> <span class="mi">3</span> <span class="mi">6</span> <span class="mi">7</span></span>
When we merge, we only need to find the roots of both elements before setting one root to another root. For example, if we merged the sets of elements \(1\) and \(2\), we would set the value of element \(2\) to \(1\) (or vice versa, depending on implementation).
<span id="L1" class="line"><span class="mi">0</span> <span class="mi">1</span> <span class="mi">4</span> <span class="mi">5</span> <span class="mi">8</span> <span class="mi">9</span></span>
<span id="L2" class="line"> <span class="o">|</span></span>
<span id="L3" class="line"> <span class="mi">2</span></span>
<span id="L4" class="line"> <span class="o">/|</span>\</span>
<span id="L5" class="line"><span class="mi">3</span> <span class="mi">6</span> <span class="mi">7</span></span>
If we ran a merge now on \(7\) and \(5\), we would first do a find, and then a merge:
<span id="L1" class="line"><span class="mi">0</span> <span class="mi">1</span> <span class="mi">4</span> <span class="mi">5</span> <span class="mi">8</span> <span class="mi">9</span> <span class="mi">0</span> <span class="mi">1</span> <span class="mi">4</span> <span class="mi">8</span> <span class="mi">9</span></span>
<span id="L2" class="line"> <span class="o">|</span><span class="err">\</span> <span class="o">/|</span>\</span>
<span id="L3" class="line"> <span class="mi">2</span> <span class="mi">7</span> <span class="o">--></span> <span class="mi">2</span> <span class="mi">5</span> <span class="mi">7</span></span>
<span id="L4" class="line"> <span class="o">|</span><span class="err">\</span> <span class="o">|</span>\</span>
<span id="L5" class="line"> <span class="mi">3</span> <span class="mi">6</span> <span class="mi">3</span> <span class="mi">6</span></span>
Naturally, when asked for something that requires millions of elements, this becomes very useful.
Other
Hash Table
STL analogue |
|
Associative |
Keys can be of any type |
Find |
Insert |
Erase |
Memory Usage |
\(\text{O}(1)\) |
\(\text{O}(1)\) |
\(\text{O}(1)\) |
\(\Omega(n)\)[14] |
A hash table is an associative container: a value is found based on its key.
This is somewhat similar to that of a how you can access elements in vector by its index, but here, the keys can be of any type and, if an int
, can take any value, including negatives and numbers in the billions (which normally wouldn’t fit on a 32-bit pointer).
It does this by implementing a hash function that takes the key, does some voodoo magic on it, and returns an index small enough to keep in a smaller array. For example, if there’s fifty elements in a hash table, the table’s capacity might actually be something like seventy-three.
Doing this allows you to use large, negative, and non-integral keys.
For example, it’s very useful to use these to convert from std::string
[15] to int
.
Practice
Structure | Problems |
---|---|
Linked List |
|
Deque |
|
Stack |
|
Queue |
|
Heap |
|
Binary Search Tree |
|
Trie |
|
Disjoint-Set |
|
Hash Table |