cs20pf22as11: Implementing a Hash Table

Goals

Practice designing a hash table that works similarly to Python's built-in set, but is written in pure Python.
Get some experience creating a data structure that performs constant-time (O(1)) insertion, search, and deletion operations.

Prerequisites

This assignment requires familiarity with the lecture materials presented in class through week 11.

Assignment

You shall define a class named HashSet in a module named hash_set that re-implements a basic version of Python's built-in set() type as a hash table. The class must offer the following functionality:

As discussed in lecture, an initial table size of 8 (you'll want to use a list internally), a maximum load factor of ⅔, and a new table size of the set length * 3 when the load factor is exceeded ¹⁾.
1. Make sure the class doesn't hash an inserted key more than once, i.e. caches hashes to reuse when it is time to rebuild a bigger table.
Method __init__() that accepts an optional iterable object containing initial keys for the set, à la the built-in set() constructor.
A linear probing sequence that alternates looking first forward and then backward an increasing distance from the collision index. e.g. if there is a collision at index 0, the probing sequence to find an unused index is 1, -1, 2, -2, etc.
Method __bool__() that returns True if the set is non-empty, and False otherwise.
Method __contains__() that tests for membership in the set.
Method __getitem__() that returns the key at a given index of the internal table.
Method __iter__() that yields keys from the internal table, in the order encountered in the table.
Method __len__() that returns the number of keys in the set.
Method add() that adds a key to the set (keys must be hashable).
Method table_size() that returns the size of the internal table.

It should go without saying, but you may not use built-in set, dict, or anything built on top of those. Implement your hash table using a list, as described in lecture.

Demonstration

Here is a REPL session demonstrating the expected behavior of the class:

>>> h = HashSet()
>>> len(h)
0
>>> h.table_size()
8
>>> list(h)
[]
>>> list(enumerate(h[i] for i in range(h.table_size())))
[(0, None), (1, None), (2, None), (3, None), (4, None), (5, None), (6, None), (7, None)]
>>> 
>>> h.add(1)
>>> len(h)
1
>>> h.table_size()
8
>>> list(h)
[1]
>>> list(enumerate(h[i] for i in range(h.table_size())))
[(0, None), (1, 1), (2, None), (3, None), (4, None), (5, None), (6, None), (7, None)]
>>> 
>>> h.add(10)
>>> len(h)
2
>>> h.table_size()
8
>>> list(enumerate(h[i] for i in range(h.table_size())))
[(0, None), (1, 1), (2, 10), (3, None), (4, None), (5, None), (6, None), (7, None)]
>>> list(h)
[1, 10]
>>> 
>>> h.add(9)  # collision at index 1, should probe 2 (used) and 0 (unused)
>>> len(h)
3
>>> list(h)
[9, 1, 10]
>>> h.table_size()
8
>>> list(enumerate(h[i] for i in range(h.table_size())))
[(0, 9), (1, 1), (2, 10), (3, None), (4, None), (5, None), (6, None), (7, None)]
>>> 
>>> h.add(10)  # duplicate, should be ignored
>>> len(h)
3
>>> list(h)
[9, 1, 10]
>>> list(enumerate(h[i] for i in range(h.table_size())))
[(0, 9), (1, 1), (2, 10), (3, None), (4, None), (5, None), (6, None), (7, None)]
>>> 
>>> h.add(18)  # collision at index 2, should probe 3 (unused)
>>> list(enumerate(h[i] for i in range(h.table_size())))
[(0, 9), (1, 1), (2, 10), (3, 18), (4, None), (5, None), (6, None), (7, None)]
>>> len(h)
4
>>> list(h)
[9, 1, 10, 18]
>>> 
>>> h.add (123)  # load factor 5/8, almost exceeds maximum...
>>> list(h)
[9, 1, 10, 18, 123]
>>> list(enumerate(h[i] for i in range(h.table_size())))
[(0, 9), (1, 1), (2, 10), (3, 18), (4, 123), (5, None), (6, None), (7, None)]
>>> 
>>> h.add(456)  # max load factor will exceed 2/3, must rebuild table
>>> len(h)
6
>>> list(h)
[18, 1, 456, 9, 10, 123]
>>> h.table_size()
18
>>> list(enumerate(h[i] for i in range(h.table_size())))
[(0, 18), (1, 1), (2, None), (3, None), (4, None), (5, None), (6, 456), (7, None), (8, None), (9, 9), (10, 10), (11, None), (12, None), (13, None), (14, None), (15, 123), (16, None), (17, None)]
>>>
>>> words = HashSet(map(str.rstrip, open('/srv/datasets/many-english-words.txt')))
>>> len(words)
704240
>>> words.table_size()
1376253
>>> 'PALOMA' in words
True
>>> 'PELOTA' in words
True
>>> 'PERICO' in words
False
>>> 'PICNIC' in words
True

Submission

Submit your source-code file(s) via turnin.

Feedback Robot

This project has a feedback robot that will run some tests on your submission and provide you with a feedback report via email within roughly one minute.

Please read the feedback carefully.

Due Date and Point Value

Due at 23:59:59 on the date listed on the syllabus.

Assignment 11 is worth 60 points.

Possible point values per category:
---------------------------------------
Method functionality                 60
  (split roughly evenly)
Possible deductions:
  Style and practices            10–20%
Possible extra credit:
  Submission via Git                 5%
---------------------------------------

¹⁾

i.e. the table sizes as set size increases should be 8, 18, 39, 81, 165, 333, 669, 1341, 2685, 5373, 10749, 21501, 43005, 86013, 172029, 344061, 688125, 1376253, 2752509, 5505021, 11010045, 22020093, etc.

portfolio

Table of Contents