Data structure introduction
We have already seen data structures such as arrays. All data structures have two main aspects:
- operation they support (search item, find minimum, find maximum, insert new item, delete an existing item etc.). We can think of this as an interface to the data structure.
- implementation of such data structure
Depending on the what operations we need to perform on our data, a specific data structure is more suitable than others.
Array
Arrays are good for certain operations but bad for others. Here are some pros and cons of static (size of the array is fixed) arrays.
Pros:
- Constant time access given the index
- Only consists of data, in contrary for linked lists we need to store data and a pointer for each entry
- Physical continuity (memory locality) between successive elements help exploit the high-speed cache-memory in modern CPUs.
Cons:
- Size of array must be known beforehand
- Resizing array is expensive
- Insertion/
deletion requires shifting subsequent elements
Dynamic array
Dynamic arrays might start with a specific size, and double the size once run out of space. Although resizing is expensive, we hope that we won't do it too often (for elements we need to double times).
The apparent waste in this procedure involves the recopying of the old contents on each expansion. If half the elements move once, a quarter of the elements move twice, and so on, the total number of movements is:
Thus each of the n elements move an average of only twice, the total work of managing the dynamic array is the same as a simple array .
Another disadvantage is it will waste memory by allocating extremely large arrays.
Linked lists
Searching in a linked list can be done iteratively or recursively. Searching in
sorted array could be done by binary search in . Access is
since we need to traverser from the head; compared to
in case of arrays. Insertion/
Pros of linked list:
- No overflow
- Insertion and deletion are simpler than static arrays
- With large records, moving pointer is much more efficient than moving the data part.
Stacks and Queues
Containers to store items, we access them in the order they came. Stack: Last in first out (LIFO).
Queue: First in first out (FIFO).
Stacks and Queues can be implemented using arrays or linked lists.
Dictionary/Dynamic set operation
Supports operations:
Search(Set, k)
: returns pointer to an itemx
such thatkey[x] = k
, ornil
if no such element exists.Insert(S, x)
Delete(S, x)
Min(S)
,Max(S)
(smallest/largest key) Predecessor(S, x)
/Successor(S, x)
There are variety of implementation of these dictionary operations, each of which yield different time complexities of different operations.
Example: Array based sets: unsorted arrays
Search(S, k)
- sequential search,Insert(S, x)
- place in first empty spot (last element), , can use dynamic array, no need to worry about resizingDelete(S, x)
- copy nth (last item) to thex
-th spot,Min/Max
- sequential searchSuccessor/Predecessor
- sequential search (previous/next index is not successor/ predecessor)
If we use sorted array, searching could be done by binary search and complexity reduces to .
Insert/delete
becomes costly as we need to keep them sorted.
complexity. Min/Max
is easy . Successor/Predecessor
is also
.
We can implement dictionary using singly/
Dictionary Operation | Singly unsorted | Singly sorted | Doubly unsorted | Doubly sorted |
---|---|---|---|---|
Search(L, k) | , sequential search | |||
Insert(L, x) | , insert in the head | |||
Delete(L, x) | , sequential search to arrive | |||
Successor/ Predecessor(L, x) | , need to go through all items | Successor , Predecessor | ||
Min/ Max(L) | Min , Max |
Hash tables
Hash tables are a very practical way to maintain a dictionary. The idea is simply that looking an item up in an array is once you have its index.
A hash function is a mathematical function which maps keys to integers. Since we are mapping larger set to a smaller set, collisions might happen. If the keys are uniformly distributed, then each bucket should contain very few keys.
Hash functions
A good hash function has following properties:
- Is cheap to evaluate
- Tends to use all positions from with uniform frequency.
Here is an example of hash function: say we want to map text data into integers. The first step is usually to map the key to a big integer, remember ascii characters can have code up to 128 (27).
Here is the key length.
Modular arithmetic
Then we use use modular arithmetic to map large integers to smaller ones, whose size is between 1 and the size of our hash table.
Where M is best a large prime not too close to , which would mask off the high bits. is the remainder of . This works on the same principle as a roulette wheel.
Collisions
No matter how good is our hash function, we have to prepared for collisions
since we are mapping larger space into a smaller size space. Recall the birthday
paradox: if you are in a party of about 23 people, there is 50% probability that
two person share the same birthday. And if there are about 56 people, the
probability is very close to 1. If we have too many collisions operations on our
data structure becomes less efficient. For example look up takes
without collision, now it becomes , where x
is the number of collision in a particular bin, and in worst case it is
.
Priority Queue
Supports following three operations:
Insert(Q, x)
: Given an itemx
with keyk
, insert into the priority queue .Min/Max(Q)
Delete-Min/Max(Q)
We could use balanced binary trees for queues, and all three operations could be done in time.