This lesson is a draft.

We're currently in the process of finishing this lesson. Check back frequently as we continue to add more content and improve our typos.

Data structures are methods of storing data in a way that optimizes certain important operations.

You won’t get far without any of these.

Note

This lesson is intended mainly as a reference, and as such contain summary sections.

There are, of course, lesson portions; but if you are learning these for the first time, do not only read the summaries, as they will be of no help.

Sequence Containers

Sequence containers are used to store groups of similar objects in some undefined sequence, usually chronological.

Vector Linked List Deque

STL analogue

std::vector<T>

std::list<T> std::forward_list<T>

std::deque<T>

C analogue

T*

Random Access

\(\text{O}(1)\)

\(\text{O}(n)\)

\(\text{O}(1)\)

Push/Pop Back

\(\text{O}(1)\)

\(\text{O}(1)\)

\(\text{O}(1)\)

Push/Pop Front

\(\text{O}(1)\)

\(\text{O}(1)\)

Insert/Erase

\(\text{O}(n)\)

\(\text{O}(1)\)

\(\text{O}(n)\)

Merge

\(\text{O}(n)\)

\(\text{O}(1)\)

\(\text{O}(n)\)

Vector

STL analogue

std::vector<T>

C analogue

T*

Random Access

Push/Pop Back

Insert/Erase

Merge

\(\text{O}(1)\)

\(\text{O}(1)\)

\(\text{O}(n)\)

\(\text{O}(n)\)

Vectors are the most basic container: an array of elements.

Accessing an element at index \(i\) only requires looking \(i\) elements in memory after the first element. Since this process is so simple, we can do this in \(\text{O}(1)\). This ability is called random access.

We can resize vectors by reallocating memory and copying the data over. As this is also a simple step, this takes \(\text{O}(1)\) time.

To push[1] or pop[2], we need only to resize if necessary before assigning to the last element. This, again, takes \(\text{O}(1)\) time.

To insert or erase at any position, however, requires that every element after is shifted[3] so that there is space or the gap is filled. Since up to all the elements might need to be shifted, and each shift must be made with a swap, this takes \(\text{O}(n)\) time.

If we needed to merge two vectors and combine their data, then we would have to push up as many elements as are in both. So, once again, this operation can occur in \(\text{O}(n)\) time.

Vectors are your most basic container. Most applications for storing data will use these.

Linked List

STL analogue

std::list<T>[4] std::forward_list<T>[5] C++11

Random Access

Push/Pop

Insert/Erase

Merge

\(\text{O}(n)\)

\(\text{O}(1)\)

\(\text{O}(1)\)

\(\text{O}(1)\)

Linked lists use a very different approach from the vector. Instead of storing every element contiguously in memory, the location of the first element is stored and each element stores the location of the next element.

In this way, we get a "linked" list, since each element "links" to the next.

A doubly-linked list stores also the last element and every element also stores the location of the previous element.

We must traverse linked lists, however, element-by-element, since otherwise we don’t know where anything is. Access, then, is \(\text{O}(n)\).

To insert or erase at any point[6], we only need to manipulate a few pointers; as such, we can achieve \(\text{O}(1)\) time.

To push or pop from the front or back[7], we do exactly what we did with insertion and erasure, and once again achieve \(\text{O}(1)\) time.

And to merge, all we need to do is link the last element of one linked list to the first element of the other linked list — then traversal will go through both lists. As such, we can achieve \(\text{O}(1)\) merging.

Now, seeing that more operations here have \(\text{O}(1)\) time than with vectors, you may be asking, "Why not use linked lists everywhere?"

The reason why you don’t see them more is because most of the time, we need to do a lot of random-access, and that wouldn’t be good with an \(\text{O}(n)\) random-access speed.

They are, however, useful where you need to combine or separate lists in different orders; for example, if you were to rotate an array by \(k\) elements, you’d need to do \(n\) swaps in a vector compared to a break and merge with a linked list.

Double-Ended Queue (Deque)

STL analogue

std::deque<T>

Random Access

Push/Pop

Insert/Erase

Merge

\(\text{O}(1)\)

\(\text{O}(1)\)

\(\text{O}(n)\)

\(\text{O}(n)\)

Double ended queues, or deques (pronounced like deck), were containers made for the purpose of quick pushing and popping from both the front and the back.

Typically, they take the form of a collection of vectors: pushing to the back adds to the end vector and pushing to the front adds to the begin vector. When the end or begin vector runs out of space, a new one is declared and instantiated to have the respective position.

The collection of vectors is usually handled like both a vector and a linked list; each vector links to the next and previous vectors for the sake of traversal, and a master vector of pointers to each vector is kept for the sake of random access.

As such, access takes two redirections, one to find where the desired vector is and one to find which element of the vector is the desired element, achieving \(\text{O}(1)\) time.

Pushing and popping is trivial, and once again takes \(\text{O}(1)\) time.

Insertion and Erasure, like the vector, requires the shifting of all elements, and as expected takes \(\text{O}(n)\) time.

Merging is, too, just like the vector[8]: it takes \(\text{O}(n)\) time to push every element.

Deques have operations which run exactly at the same time complexity as vectors, but they have more functionality. So once again, you might ask, "Why should I ever use the verctor if I can use a deque?"

Deques use much more overhead processing: the same operation on a deque usually takes twice as long as one on a vector. And freeing all the memory from a deque is an \(\text{O}(n)\) operation, which I learned from bitter experience.

Container Adaptors

Container adaptors take base sequence containers and specialize them for a specific purpose. These are the most simple of what are considered proper data structures.

Stack

STL analogue

std::stack<T>

Peek

Push

Pop

top

top

top

Stacks are first-in, last-out containers. Think of a stack of objects, like paper: you can put stuff on and take stuff off the top, but it’s messy to do it from anywhere else.

You may have heard of the "stack" and "heap" when it comes to memory allocation; the "stack" refers to the stack of function calls. A function cannot return until all the functions it was calling returns; the first function to be called is the last one to be returned from, and vice-versa.

We say that where we can push and pop is the top of the stack. Most implementations also allow a peek, which is to return the value of the top element.

These have many applications in client-side interfaces, such as the call stack or in menus. These are typically implemented with vectors.

Queue

STL analogue

std::queue<T>

Peek

Push

Pop

front, back

back

front

Queues are first-in, first-out containers. Think of a a queue of customers waiting for the cashier at Foody Mart: the first one in the line gets out of the line first. If someone cuts in, we get undesirable results.

We say that the front of the queue is where we pop, and the back of the queue is where we push. You can typically peek at the front, and sometimes at the back.

These have many applications in network-side interfaces; for example, you’d want to process the first packet you receive first rather than last. They’re also useful when dealing with multithreading (which is kind of like a network) to avoid desyncs.

These are typically implemented with deques.

Heap

STL analogue

std::priority_queue<T>

Peek Push Pop

max

dependent

max

\(\text{O}(1)\)

\(\text{O}(\log{n})\)

\(\text{O}(\log{n})\)

Heaps are containers which "automatically sort" data. While their mechanisms are somewhat out of the scope of this article since you can just use std::priority_queue<T>, the result is that the top element is always the maximum (or minimum if you use the opposing predicate).

These are very useful in graph theory, which will be the focus of the next lesson.

Trees

Binary Search Tree

STL analogue

std::set<T>

Find[9]

Insert

Erase

Memory Usage

\(\text{O}(\log{n})\)

\(\text{O}(\log{n})\)

\(\text{O}(1)\)

\(\Theta(n)\)

You’ve heard of binary search. The binary search tree constructs a tree optimized for searching.

Whereas a heap keeps one element in the right place, the binary tree keeps all of them in the right place. Unfortunately, because of its structure, the minimum/maximum element is almost never going to be near the beginning of the search, and so it isn’t just a better version of the heap.

Prefix Tree (Trie)

Find

Insert

Erase

Memory Usage

\(\text{O}(m)\)[10]

\(\text{O}(m)\)

\(\text{O}(m)\)

\(\Theta(m\log{n})\)

A trie stores prefixes (hence the name "prefix tree"). The name "trie" comes from the middle syllable of "retrieval." I’m not quite sure why.

This is a method for storing large numbers of strings in a way that’s both easy on memory and fast for retrieval. Instead of making a giant array, we construct a tree. Its root has all the first letters as children, and each node has all the next relevant letters as children. A special character, usually nul, denotes the end of string.

We most likely won’t run into this, but it’s useful to know about. Perhaps it’ll inspire a solution to some other problem.

Disjoint-Set

Find

Union

Memory Usage

\(\text{O}(\alpha(n))\)[11]

\(\text{O}(\alpha(n))\)

\(\Theta(n)\)

Say you have some large number of things[12]. Each thing constitutes their own group[13].

You will be asked to check if two things are in the same group. You will be told to merge groups, given two individual things: one from each group.

Enter the disjoint-set data structure!

We begin with an array as long as the number of elements.

We use two rules to determine which elements are in which sets:

  1. If the value of an element is their index \(i\), then we say that that element is in set \(i\).

  2. If the value \(n\) of an element is not their index, then they are in the set that the element at index \(n\) is at.

For example, take the following array:

<span id="L1" class="line"><span class="n">ind</span><span class="p">.</span> <span class="mi">0</span> <span class="mi">1</span> <span class="mi">2</span> <span class="mi">3</span> <span class="mi">4</span> <span class="mi">5</span> <span class="mi">6</span> <span class="mi">7</span> <span class="mi">8</span> <span class="mi">9</span></span>
<span id="L2" class="line"><span class="n">val</span><span class="p">.</span> <span class="mi">0</span> <span class="mi">1</span> <span class="mi">2</span> <span class="mi">2</span> <span class="mi">4</span> <span class="mi">5</span> <span class="mi">2</span> <span class="mi">3</span> <span class="mi">8</span> <span class="mi">9</span></span>

Elements \(\{0, 1, 2, 4, 5, 8, 9\}\) are in sets \(\{0, 1, 2, 4, 5, 8, 9\}\), respectively, by rule 1. Elements \(\{3, 6\}\) are in set \(2\) by rule 2. And element \(7\) is in the same set as element \(3\) by rule 2, so it’s also in set 2.

You can represent this situation mathematically as \(\{\{0\},\{1\},\{2,3,6,7\},\{4\},\{5\},\{8\},\{9\}\}\). We can also represent this in a forest of trees (hence why this is under the tree section) like so:

<span id="L1" class="line"><span class="mi">0</span> <span class="mi">1</span> <span class="mi">2</span> <span class="mi">4</span> <span class="mi">5</span> <span class="mi">8</span> <span class="mi">9</span></span>
<span id="L2" class="line">    <span class="o">|</span>\</span>
<span id="L3" class="line">    <span class="mi">3</span> <span class="mi">6</span></span>
<span id="L4" class="line">    <span class="o">|</span></span>
<span id="L5" class="line">    <span class="mi">7</span></span>

When we find, we simply traverse the tree. We can use this to do an optimization by flattening the tree: for every element we pass through, we set it to the root element. For example, if we passed a find operation through element \(7\) in our previous example, we would change \(7\) to point to \(2\), shortening later running times.

<span id="L1" class="line"><span class="mi">0</span> <span class="mi">1</span> <span class="mi">2</span> <span class="mi">4</span> <span class="mi">5</span> <span class="mi">8</span> <span class="mi">9</span></span>
<span id="L2" class="line">   <span class="o">/|</span>\</span>
<span id="L3" class="line">  <span class="mi">3</span> <span class="mi">6</span> <span class="mi">7</span></span>

When we merge, we only need to find the roots of both elements before setting one root to another root. For example, if we merged the sets of elements \(1\) and \(2\), we would set the value of element \(2\) to \(1\) (or vice versa, depending on implementation).

<span id="L1" class="line"><span class="mi">0</span> <span class="mi">1</span> <span class="mi">4</span> <span class="mi">5</span> <span class="mi">8</span> <span class="mi">9</span></span>
<span id="L2" class="line">  <span class="o">|</span></span>
<span id="L3" class="line">  <span class="mi">2</span></span>
<span id="L4" class="line"> <span class="o">/|</span>\</span>
<span id="L5" class="line"><span class="mi">3</span> <span class="mi">6</span> <span class="mi">7</span></span>

If we ran a merge now on \(7\) and \(5\), we would first do a find, and then a merge:

<span id="L1" class="line"><span class="mi">0</span> <span class="mi">1</span> <span class="mi">4</span> <span class="mi">5</span> <span class="mi">8</span> <span class="mi">9</span>     <span class="mi">0</span> <span class="mi">1</span> <span class="mi">4</span> <span class="mi">8</span> <span class="mi">9</span></span>
<span id="L2" class="line">  <span class="o">|</span><span class="err">\</span>             <span class="o">/|</span>\</span>
<span id="L3" class="line">  <span class="mi">2</span> <span class="mi">7</span>     <span class="o">--&gt;</span>   <span class="mi">2</span> <span class="mi">5</span> <span class="mi">7</span></span>
<span id="L4" class="line">  <span class="o">|</span><span class="err">\</span>            <span class="o">|</span>\</span>
<span id="L5" class="line">  <span class="mi">3</span> <span class="mi">6</span>           <span class="mi">3</span> <span class="mi">6</span></span>

Naturally, when asked for something that requires millions of elements, this becomes very useful.

Other

Hash Table

STL analogue

std::unordered_map<const T,T>

Associative

Keys can be of any type

Find

Insert

Erase

Memory Usage

\(\text{O}(1)\)

\(\text{O}(1)\)

\(\text{O}(1)\)

\(\Omega(n)\)[14]

A hash table is an associative container: a value is found based on its key. This is somewhat similar to that of a how you can access elements in vector by its index, but here, the keys can be of any type and, if an int, can take any value, including negatives and numbers in the billions (which normally wouldn’t fit on a 32-bit pointer).

It does this by implementing a hash function that takes the key, does some voodoo magic on it, and returns an index small enough to keep in a smaller array. For example, if there’s fifty elements in a hash table, the table’s capacity might actually be something like seventy-three.

Doing this allows you to use large, negative, and non-integral keys. For example, it’s very useful to use these to convert from std::string[15] to int.

Practice

Structure Problems

Linked List

Deque

Stack

CCC15S1: Zero that Out

CCC14S3: The Geneva Confection

Queue

CCC13S2: Bridge Transport

Heap

Binary Search Tree

Trie

Disjoint-Set

CCC15S3: Gates[16]

Hash Table

CCC96S4: When in Rome

CCC17S3: Nailed It!


1. insert to back
2. erase at back
3. to the right for insertions, to the left for deletions
4. doubly-linked list
5. singly-linked list
6. at where the current iterator is
7. only front in a singly-linked list
8. it can be argued that with the correct conditions, we can append a deque as with a linked list, but the master vector still needs to be updated in \(\text{O}(n)\) time
9. find and insert are best and average case; the worse case is \(\text{O}(n)\), but that doesn’t occur as often.
10. for \(m\), the length of the data
11. \(\alpha(n)\) is the inverse ackermann function, which for all practical inputs is less than five.
12. elements
13. set
14. the Big Omega (\(\Omega\)) denotes the lower bound: it’ll use more memory than \(n\) times some constant multiple.
15. only use std::string with std::map<const T, T>; you’ll run into lots of problems with char*
16. this problem’s problem statement starts out with the best sentence ever