cs20ps23as03: n-grams!

Goals

Practice writing a module that serves both as a library of functions and as a script.
Practice with user-defined functions and a bit of text processing.

Prerequisites

This assignment requires familiarity with the lecture materials presented in class through week 03.

Background

The arbiter of all human information describes n-grams (not to be confused with engrams or—Xenu forbid— engrams) as follows:

In the fields of computational linguistics and probability, an n-gram (sometimes also called Q-gram) is a contiguous sequence of n items from a given sample of text or speech. The items can be phonemes, syllables, letters, words or base pairs according to the application. The n-grams typically are collected from a text or speech corpus. When the items are words, n-grams may also be called shingles.

Using Latin numerical prefixes, an n-gram of size 1 is referred to as a “unigram”; size 2 is a “bigram” (or, less commonly, a “digram”); size 3 is a “trigram”. English cardinal numbers are sometimes used, e.g., “four-gram”, “five-gram”, and so on. In computational biology, a polymer or oligomer of a known size is called a k-mer instead of an n-gram, with specific names using Greek numerical prefixes such as “monomer”, “dimer”, “trimer”, “tetramer”, “pentamer”, etc., or English cardinal numbers, “one-mer”, “two-mer”, “three-mer”, etc.

If we consider words to be the n-gram items in the following sentence, note the 1-grams (unigrams), 2-grams (bigrams) and 3-grams (trigrams) and 4-grams contained therein:

Sentence	'This is a sentence.'
1-grams	'This', 'is', 'a', 'sentence'
2-grams	('This', 'is'), ('is', 'a'), ('a', 'sentence')
3-grams	('This', 'is', 'a'), ('is', 'a', 'sentence')
4-grams	('This', 'is', 'a', 'sentence')

Search engines and other data-mining operations make heavy use of n-grams for the purpose of indexing, categorizing, and guessing the meaning and context of text. For example, Amazon uses statistically improbable phrases (word n-grams with an extraordinary number of occurrences) in books to determine keywords and important topics that a book covers, and researchers have used similar analysis to develop software that can fairly reliably determine a person's sex, political persuasion, and sexual orientation by analyzing their social-media posts. Google Books has a fun n-gram viewer tracking the prevalence of n-grams in books over time.

Assignment

You shall define a module named n_grams with the following attributes:

A function named n_grams that returns all word n-grams of a specified length that occur a minimum number of times in a body of text.
A function named most_frequent_n_grams that returns the most frequent word n-grams in a body of text.
A function named main called in a main block that prints all word n-grams of a specified length that occur a minimum of times in the text on standard input, in descending order of frequency.

Use the docstrings and examples below to understand the specifics of each function and the main block.

For the purposes of this assignment, a word shall be defined as a sequence of non-whitespace characters, converted to lowercase, with the following characters stripped from either end:

!"#%&'()*,-./:;?@\_¡§¶·¿

Tips

You may find the Counter and defaultdict types from the collections library module useful for this assignment!

Starter Code

Please start with the following code, which includes signatures and docstrings for each function:

n_grams.py

r"""
CS 20P n-grams!
 
For the purposes of this assignment, a word shall be defined as a sequence of non-whitespace
characters, with the following characters stripped from either end: !"#%&'()*,-./:;?@\_¡§¶·¿
"""
__author__ = 'Someone in CS 20P, someone@jeff.cis.cabrillo.edu'
 
 
def n_grams(text: str, n_gram_len: int, min_count: int = 2) -> dict[int, list[tuple[str]]]:
  """
  Finds and returns all word n-grams of length `n_gram_len`
  occurring at least `min_count` times in `text`.
 
  :param text: the text to analyze
  :param n_gram_len: the desired length of n-grams (e.g. 2 for 2-grams)
  :param min_count: the minimum number of times a n-gram must appear in the text to be counted
  :return a dictionary mapping n-gram occurrence counts to a list of the n-grams occurring that
          number of times, as a list of n_gram_len-tuples of strings in ascending
          lexicographic/alphabetical order of the n-gram words.
  """
  pass  # TODO
 
 
def most_frequent_n_grams(text: str,
                          min_len: int = 1,
                          max_len: int = 10,
                          limit: int = 5) -> dict[int, list[tuple[tuple[str], int]]]:
  """
  Returns a dictionary mapping n-gram lengths to a list of the most frequently occurring word
  n-grams of that length, along with their occurrence counts, for n-grams appearing at least twice.
 
  :param text: the text to analyze
  :param min_len: the minimum n-gram length
  :param max_len: the maximum n-gram length
  :param limit: the maximum number of n-grams to display for each length
  :return a dictionary mapping n-gram lengths to a list of the most frequently occurring n-grams
          of that length, along with their occurrence counts, as a list of 2-tuples, where
          each tuple contains a tuple of the words in an n-gram and the n-gram's occurrence count.
          The list shall be sorted in descending order of occurrence count, with ties broken in
          ascending lexicographic/alphabetical order of the n-gram words.
  """
  pass  # TODO
 
 
def main():
  """
  Expects one or two command-line arguments:
  sys.argv[1]: A length of n-gram (e.g. 2 for bigrams)
  sys.argv[2] (optional): A minimum occurrence count (2 if unspecified)
  Then prints, in descending order of occurrence count, lines containing (a) the occurrence count
  and (b) a comma-separated list of all n-grams with that occurrence count,
  in ascending alphabetical/lexicographic order.
  """
  pass  # TODO
 
 
if __name__ == '__main__':
  main()

Sample Executables

Two executables on the server exist for the purposes of checking your work on this assignment:

cs20p_n_grams demonstrates the expected behavior of your main block, and indirectly what your n_grams function should return. It accepts one or two command-line arguments, namely the desired n-gram length and optionally a minimum occurrence count (defaults to 2).

e.g. to find all 5-grams occurring at least 3 times in file /srv/datasets/communist-manifesto.txt:

$ cs20p_n_grams 5 3 </srv/datasets/communist-manifesto.txt 
6 means of production and of, of production and of exchange
5 in the hands of the
4 benefit of the working class, in place of the old, the benefit of the working
3 for the benefit of the, intending to do away with, the hands of the state

The equivalent function call and return value would look something like this:

>>> import n_grams
>>> import pprint
>>> pprint.pprint(n_grams.n_grams(open('/srv/datasets/communist-manifesto.txt').read(), n_gram_len=5, min_count=3))
{3: [('for', 'the', 'benefit', 'of', 'the'),
     ('intending', 'to', 'do', 'away', 'with'),
     ('the', 'hands', 'of', 'the', 'state')],
 4: [('benefit', 'of', 'the', 'working', 'class'),
     ('in', 'place', 'of', 'the', 'old'),
     ('the', 'benefit', 'of', 'the', 'working')],
 5: [('in', 'the', 'hands', 'of', 'the')],
 6: [('means', 'of', 'production', 'and', 'of'),
     ('of', 'production', 'and', 'of', 'exchange')]}

cs20p_most_frequent_n_grams illustrates the expected return value of your most_frequent_n_grams function. It accepts up to 3 command-line arguments, namely the minimum n-gram length (default 1), the maximum n-gram length (default 10), and the maximum requested number of n-grams of each length (default 5).

e.g. to find the top 5 most-frequently occurring 1-through-4-grams in file /srv/datasets/communist-manifesto.txt:

$ cs20p_most_frequent_n_grams 1 4 5 </srv/datasets/communist-manifesto.txt 
1-grams:
	the (1136)
	of (778)
	and (344)
	in (287)
	to (262)
2-grams:
	of the (240)
	in the (87)
	the bourgeoisie (67)
	the proletariat (51)
	to the (43)
3-grams:
	of the proletariat (23)
	the working class (16)
	of the bourgeoisie (15)
	of production and (12)
	of the old (12)
4-grams:
	of the working class (11)
	of production and of (8)
	in the hands of (7)
	means of production and (6)
	production and of exchange (6)

The equivalent function call and return value would look something like this:

>>> import n_grams
>>> import pprint
>>> pprint.pprint(n_grams.most_frequent_n_grams(open('/srv/datasets/communist-manifesto.txt').read(), max_len=4))
{1: [(('the',), 1136),
     (('of',), 778),
     (('and',), 344),
     (('in',), 287),
     (('to',), 262)],
 2: [(('of', 'the'), 240),
     (('in', 'the'), 87),
     (('the', 'bourgeoisie'), 67),
     (('the', 'proletariat'), 51),
     (('to', 'the'), 43)],
 3: [(('of', 'the', 'proletariat'), 23),
     (('the', 'working', 'class'), 16),
     (('of', 'the', 'bourgeoisie'), 15),
     (('of', 'production', 'and'), 12),
     (('of', 'the', 'old'), 12)],
 4: [(('of', 'the', 'working', 'class'), 11),
     (('of', 'production', 'and', 'of'), 8),
     (('in', 'the', 'hands', 'of'), 7),
     (('means', 'of', 'production', 'and'), 6),
     (('production', 'and', 'of', 'exchange'), 6)]}

Leaderboard

As submissions are received, this leaderboard will be updated with the top-performing fully/nearly functional solutions, with regard to execution speed.

Rank	Test Time (s)	Memory Usage (kB)	SLOC (lines)	User

Submission

Submit n_grams.py via turnin.

Feedback Robot

This project has a feedback robot that will run some tests on your submission and provide you with a feedback report via email within roughly one minute.

Please read the feedback carefully.

Due Date and Point Value

Due at 23:59:59 on the date listed on the syllabus.

Assignment 03 is worth 60 points, but you may earn up to 30 points of extra credit, for a total of 90 points possible.

Possible point values per category:
---------------------------------------
Function n_grams()                   30
Main block                           30
Function most_frequent_n_grams()     30
Possible deductions:
  Style and practices            10–20%
Possible extra credit:
  Submission via Git                 5%
---------------------------------------

portfolio

Table of Contents