====== cs20pf22as11: Implementing a Hash Table ======
===== Goals =====
* Practice designing a hash table that works similarly to Python's built-in ''set'', but is written in pure Python.
* Get some experience creating a data structure that performs constant-time (O(1)) insertion, search, and deletion operations.
-----
====== Prerequisites ======
This assignment requires familiarity with the [[:lecture materials/]] presented in class through [[:lecture materials/week 11]].
-----
====== Assignment ======
You shall define a class named ''HashSet'' in a module named ''hash_set'' that re-implements a basic version of Python's built-in ''set()'' type as a hash table. The class must offer the following functionality:
- As discussed in lecture, an initial table size of 8 (you'll want to use a ''list'' internally), a maximum load factor of ⅔, and a new table size of the set length * 3 when the load factor is exceeded ((i.e. the table sizes as set size increases should be 8, 18, 39, 81, 165, 333, 669, 1341, 2685, 5373, 10749, 21501, 43005, 86013, 172029, 344061, 688125, 1376253, 2752509, 5505021, 11010045, 22020093, etc.)).
- Make sure the class doesn't hash an inserted key more than once, i.e. caches hashes to reuse when it is time to rebuild a bigger table.
- Method ''__init__()'' that accepts an optional iterable object containing initial keys for the set, à la the built-in ''[[https://docs.python.org/3/library/functions.html#func-set|set()]]'' constructor.
- A [[https://en.wikipedia.org/wiki/Linear_probing|linear probing]] sequence that alternates looking first forward and then backward an increasing distance from the collision index. e.g. if there is a collision at index 0, the probing sequence to find an unused index is 1, -1, 2, -2, etc.
- Method ''__bool__()'' that returns ''True'' if the set is non-empty, and ''False'' otherwise.
- Method ''__contains__()'' that tests for membership in the set.
- Method ''__getitem__()'' that returns the key at a given index of the internal table.
- Method ''__iter__()'' that yields keys from the internal table, in the order encountered in the table.
- Method ''__len__()'' that returns the number of keys in the set.
- Method ''add()'' that adds a key to the set (keys must be [[https://docs.python.org/3.1/glossary.html#term-hashable|hashable]]).
- Method ''table_size()'' that returns the size of the internal table.
It should go without saying, but you may not use built-in ''set'', ''dict'', or anything built on top of those. Implement your hash table using a ''list'', as described in lecture.
===== Demonstration =====
Here is a REPL session demonstrating the expected behavior of the class:
>>> h = HashSet()
>>> len(h)
0
>>> h.table_size()
8
>>> list(h)
[]
>>> list(enumerate(h[i] for i in range(h.table_size())))
[(0, None), (1, None), (2, None), (3, None), (4, None), (5, None), (6, None), (7, None)]
>>>
>>> h.add(1)
>>> len(h)
1
>>> h.table_size()
8
>>> list(h)
[1]
>>> list(enumerate(h[i] for i in range(h.table_size())))
[(0, None), (1, 1), (2, None), (3, None), (4, None), (5, None), (6, None), (7, None)]
>>>
>>> h.add(10)
>>> len(h)
2
>>> h.table_size()
8
>>> list(enumerate(h[i] for i in range(h.table_size())))
[(0, None), (1, 1), (2, 10), (3, None), (4, None), (5, None), (6, None), (7, None)]
>>> list(h)
[1, 10]
>>>
>>> h.add(9) # collision at index 1, should probe 2 (used) and 0 (unused)
>>> len(h)
3
>>> list(h)
[9, 1, 10]
>>> h.table_size()
8
>>> list(enumerate(h[i] for i in range(h.table_size())))
[(0, 9), (1, 1), (2, 10), (3, None), (4, None), (5, None), (6, None), (7, None)]
>>>
>>> h.add(10) # duplicate, should be ignored
>>> len(h)
3
>>> list(h)
[9, 1, 10]
>>> list(enumerate(h[i] for i in range(h.table_size())))
[(0, 9), (1, 1), (2, 10), (3, None), (4, None), (5, None), (6, None), (7, None)]
>>>
>>> h.add(18) # collision at index 2, should probe 3 (unused)
>>> list(enumerate(h[i] for i in range(h.table_size())))
[(0, 9), (1, 1), (2, 10), (3, 18), (4, None), (5, None), (6, None), (7, None)]
>>> len(h)
4
>>> list(h)
[9, 1, 10, 18]
>>>
>>> h.add (123) # load factor 5/8, almost exceeds maximum...
>>> list(h)
[9, 1, 10, 18, 123]
>>> list(enumerate(h[i] for i in range(h.table_size())))
[(0, 9), (1, 1), (2, 10), (3, 18), (4, 123), (5, None), (6, None), (7, None)]
>>>
>>> h.add(456) # max load factor will exceed 2/3, must rebuild table
>>> len(h)
6
>>> list(h)
[18, 1, 456, 9, 10, 123]
>>> h.table_size()
18
>>> list(enumerate(h[i] for i in range(h.table_size())))
[(0, 18), (1, 1), (2, None), (3, None), (4, None), (5, None), (6, 456), (7, None), (8, None), (9, 9), (10, 10), (11, None), (12, None), (13, None), (14, None), (15, 123), (16, None), (17, None)]
>>>
>>> words = HashSet(map(str.rstrip, open('/srv/datasets/many-english-words.txt')))
>>> len(words)
704240
>>> words.table_size()
1376253
>>> 'PALOMA' in words
True
>>> 'PELOTA' in words
True
>>> 'PERICO' in words
False
>>> 'PICNIC' in words
True
-----
====== Submission ======
Submit your source-code file(s) via [[info:turnin]].
{{https://jeff.cis.cabrillo.edu/images/feedback-robot.png?nolink }} //**Feedback Robot**//
This project has a feedback robot that will run some tests on your submission and provide you with a feedback report via email within roughly one minute.
Please read the feedback carefully.
====== Due Date and Point Value ======
Due at 23:59:59 on the date listed on the [[:syllabus|syllabus]].
''Assignment 11'' is worth 60 points.
Possible point values per category:
---------------------------------------
Method functionality 60
(split roughly evenly)
Possible deductions:
Style and practices 10–20%
Possible extra credit:
Submission via Git 5%
---------------------------------------