====== cs20ps23as03: n-grams! ======
===== Goals =====
* Practice writing a module that serves both as a library of functions and as a script.
* Practice with user-defined functions and a bit of text processing.
-----
====== Prerequisites ======
This assignment requires familiarity with the [[:lecture materials/]] presented in class through [[:lecture materials/week 03]].
-----
====== Background ======
The arbiter of all human information describes [[https://en.wikipedia.org/wiki/N-gram|n-grams]] (not to be confused with [[https://en.wikipedia.org/wiki/Engram_(neuropsychology)|engrams]] or—Xenu forbid— [[https://en.wikipedia.org/wiki/Engram_(Dianetics)|engrams]]) as follows:
> In the fields of computational linguistics and probability, an n-gram (sometimes also called Q-gram) is a contiguous sequence of n items from a given sample of text or speech. The items can be phonemes, syllables, letters, words or base pairs according to the application. The n-grams typically are collected from a text or speech corpus. When the items are words, n-grams may also be called shingles.
>
> Using Latin numerical prefixes, an n-gram of size 1 is referred to as a "unigram"; size 2 is a "bigram" (or, less commonly, a "digram"); size 3 is a "trigram". English cardinal numbers are sometimes used, e.g., "four-gram", "five-gram", and so on. In computational biology, a polymer or oligomer of a known size is called a k-mer instead of an n-gram, with specific names using Greek numerical prefixes such as "monomer", "dimer", "trimer", "tetramer", "pentamer", etc., or English cardinal numbers, "one-mer", "two-mer", "three-mer", etc.
If we consider words to be the n-gram items in the following sentence, note the 1-grams (//unigrams//), 2-grams (//bigrams//) and 3-grams (//trigrams//) and 4-grams contained therein:
^ Sentence | 'This is a sentence.'
|
^ 1-grams | 'This', 'is', 'a', 'sentence'
|
^ 2-grams | ('This', 'is'), ('is', 'a'), ('a', 'sentence')
|
^ 3-grams | ('This', 'is', 'a'), ('is', 'a', 'sentence')
|
^ 4-grams | ('This', 'is', 'a', 'sentence')
|
Search engines and other data-mining operations make heavy use of n-grams for the purpose of indexing, categorizing, and guessing the meaning and context of text. For example, Amazon uses [[https://en.wikipedia.org/wiki/Statistically_improbable_phrase|statistically improbable phrases]] (word n-grams with an extraordinary number of occurrences) in books to determine keywords and important topics that a book covers, and researchers have used similar analysis to develop software that can fairly reliably determine a person's sex, political persuasion, and sexual orientation by analyzing their social-media posts. Google Books has a fun [[https://books.google.com/ngrams/graph?content=work+ethic%2Ctrigger+warning%2Csafe+space%2Cpersonal+responsibility&year_start=1800&year_end=2019&corpus=en-2019&smoothing=3|n-gram viewer]] tracking the prevalence of n-grams in books over time.
-----
====== Assignment ======
You shall define a module named ''n_grams'' with the following attributes:
- A function named ''n_grams'' that returns all word n-grams of a specified length that occur a minimum number of times in a body of text.
- A function named ''most_frequent_n_grams'' that returns the most frequent word n-grams in a body of text.
- A function named ''main'' called in a main block that prints all word n-grams of a specified length that occur a minimum of times in the text on standard input, in descending order of frequency.
Use the docstrings and examples below to understand the specifics of each function and the main block.
For the purposes of this assignment, a word shall be defined as a sequence of non-whitespace
characters, converted to lowercase, with the following characters stripped from either end:
!"#%&'()*,-./:;?@\_¡§¶·¿
===== Tips =====
You may find the ''[[https://docs.python.org/3/library/collections.html#collections.Counter|Counter]]'' and ''[[https://docs.python.org/3/library/collections.html#collections.defaultdict|defaultdict]]'' types from the ''collections'' library module useful for this assignment!
===== Starter Code =====
Please start with the following code, which includes signatures and docstrings for each function:
r"""
CS 20P n-grams!
For the purposes of this assignment, a word shall be defined as a sequence of non-whitespace
characters, with the following characters stripped from either end: !"#%&'()*,-./:;?@\_¡§¶·¿
"""
__author__ = 'Someone in CS 20P, someone@jeff.cis.cabrillo.edu'
def n_grams(text: str, n_gram_len: int, min_count: int = 2) -> dict[int, list[tuple[str]]]:
"""
Finds and returns all word n-grams of length `n_gram_len`
occurring at least `min_count` times in `text`.
:param text: the text to analyze
:param n_gram_len: the desired length of n-grams (e.g. 2 for 2-grams)
:param min_count: the minimum number of times a n-gram must appear in the text to be counted
:return a dictionary mapping n-gram occurrence counts to a list of the n-grams occurring that
number of times, as a list of n_gram_len-tuples of strings in ascending
lexicographic/alphabetical order of the n-gram words.
"""
pass # TODO
def most_frequent_n_grams(text: str,
min_len: int = 1,
max_len: int = 10,
limit: int = 5) -> dict[int, list[tuple[tuple[str], int]]]:
"""
Returns a dictionary mapping n-gram lengths to a list of the most frequently occurring word
n-grams of that length, along with their occurrence counts, for n-grams appearing at least twice.
:param text: the text to analyze
:param min_len: the minimum n-gram length
:param max_len: the maximum n-gram length
:param limit: the maximum number of n-grams to display for each length
:return a dictionary mapping n-gram lengths to a list of the most frequently occurring n-grams
of that length, along with their occurrence counts, as a list of 2-tuples, where
each tuple contains a tuple of the words in an n-gram and the n-gram's occurrence count.
The list shall be sorted in descending order of occurrence count, with ties broken in
ascending lexicographic/alphabetical order of the n-gram words.
"""
pass # TODO
def main():
"""
Expects one or two command-line arguments:
sys.argv[1]: A length of n-gram (e.g. 2 for bigrams)
sys.argv[2] (optional): A minimum occurrence count (2 if unspecified)
Then prints, in descending order of occurrence count, lines containing (a) the occurrence count
and (b) a comma-separated list of all n-grams with that occurrence count,
in ascending alphabetical/lexicographic order.
"""
pass # TODO
if __name__ == '__main__':
main()
===== Sample Executables =====
Two executables on the server exist for the purposes of checking your work on this assignment:
''cs20p_n_grams'' demonstrates the expected behavior of your main block, and indirectly what your ''n_grams'' function should return. It accepts one or two command-line arguments, namely the desired n-gram length and optionally a minimum occurrence count (defaults to 2).
e.g. to find all 5-grams occurring at least 3 times in file ''/srv/datasets/communist-manifesto.txt'':
$ cs20p_n_grams 5 3
The equivalent function call and return value would look something like this:
>>> import n_grams
>>> import pprint
>>> pprint.pprint(n_grams.n_grams(open('/srv/datasets/communist-manifesto.txt').read(), n_gram_len=5, min_count=3))
{3: [('for', 'the', 'benefit', 'of', 'the'),
('intending', 'to', 'do', 'away', 'with'),
('the', 'hands', 'of', 'the', 'state')],
4: [('benefit', 'of', 'the', 'working', 'class'),
('in', 'place', 'of', 'the', 'old'),
('the', 'benefit', 'of', 'the', 'working')],
5: [('in', 'the', 'hands', 'of', 'the')],
6: [('means', 'of', 'production', 'and', 'of'),
('of', 'production', 'and', 'of', 'exchange')]}
''cs20p_most_frequent_n_grams'' illustrates the expected return value of your ''most_frequent_n_grams'' function. It accepts up to 3 command-line arguments, namely the minimum n-gram length (default 1), the maximum n-gram length (default 10), and the maximum requested number of n-grams of each length (default 5).
e.g. to find the top 5 most-frequently occurring 1-through-4-grams in file ''/srv/datasets/communist-manifesto.txt'':
$ cs20p_most_frequent_n_grams 1 4 5
The equivalent function call and return value would look something like this:
>>> import n_grams
>>> import pprint
>>> pprint.pprint(n_grams.most_frequent_n_grams(open('/srv/datasets/communist-manifesto.txt').read(), max_len=4))
{1: [(('the',), 1136),
(('of',), 778),
(('and',), 344),
(('in',), 287),
(('to',), 262)],
2: [(('of', 'the'), 240),
(('in', 'the'), 87),
(('the', 'bourgeoisie'), 67),
(('the', 'proletariat'), 51),
(('to', 'the'), 43)],
3: [(('of', 'the', 'proletariat'), 23),
(('the', 'working', 'class'), 16),
(('of', 'the', 'bourgeoisie'), 15),
(('of', 'production', 'and'), 12),
(('of', 'the', 'old'), 12)],
4: [(('of', 'the', 'working', 'class'), 11),
(('of', 'production', 'and', 'of'), 8),
(('in', 'the', 'hands', 'of'), 7),
(('means', 'of', 'production', 'and'), 6),
(('production', 'and', 'of', 'exchange'), 6)]}
===== Leaderboard =====
As submissions are received, this leaderboard will be updated with the top-performing fully/nearly functional solutions, with regard to execution speed.
Rank | Test Time (s) | Memory Usage (kB) | SLOC (lines) | User |
-----
====== Submission ======
Submit ''n_grams.py'' via [[info:turnin]].
{{https://jeff.cis.cabrillo.edu/images/feedback-robot.png?nolink }} //**Feedback Robot**//
This project has a feedback robot that will run some tests on your submission and provide you with a feedback report via email within roughly one minute.
Please read the feedback carefully.
====== Due Date and Point Value ======
Due at 23:59:59 on the date listed on the [[:syllabus|syllabus]].
''Assignment 03'' is worth 60 points, but you may earn up to 30 points of extra credit, for a total of 90 points possible.
Possible point values per category:
---------------------------------------
Function n_grams() 30
Main block 30
Function most_frequent_n_grams() 30
Possible deductions:
Style and practices 10–20%
Possible extra credit:
Submission via Git 5%
---------------------------------------