====== cs20pf22as03: IPA Pronunciation ====== ===== Goals ===== * Continue practicing with Python built-in types and structures. * Create a module that is both importable to provide functionality and executable as a script. * Learn a bit more about natural language processing. ----- ====== Prerequisites ====== This assignment requires familiarity with the [[:lecture materials/]] presented in class through [[:lecture materials/week 03]]. ----- ====== Background ====== ===== CMUdict ===== > [[http://www.speech.cs.cmu.edu/cgi-bin/cmudict|The Carnegie Mellon University Pronouncing Dictionary]] is an open-source machine-readable pronunciation dictionary for North American English that contains over 134,000 words and their pronunciations. I have downloaded and slightly post-processed the [[http://svn.code.sf.net/p/cmusphinx/code/trunk/cmudict/|CMUdict dataset]], available in file ''/srv/datasets/cmudict-0.7b-words'' (also available [[http://jeff.cis.cabrillo.edu/datasets/cmudict-0.7b-words|via HTTP]]). The format of file ''cmudict-0.7b-words'' is as follows: Lines consist of a word followed by its pronunciation as a sequence of [[https://en.wikipedia.org/wiki/Phoneme|phonemes]] in [[https://en.wikipedia.org/wiki/ARPABET|ARPAbet]] format, e.g.: THINKING TH IH1 NG K IH0 NG Words with multiple pronunciations are suffixed with (n), e.g.: SUBJECTS S AH1 B JH IH0 K T S SUBJECTS(1) S AH0 B JH EH1 K T S Vowel phonemes that indicate a stressed syllable have integer suffixes: * Unstressed syllables: ''0'' * Primary stressed syllables: ''1'' * Secondary stressed syllables: ''2'' ===== International Phonetic Alphabet ===== > The [[https://en.wikipedia.org/wiki/International_Phonetic_Alphabet|International Phonetic Alphabet]] (IPA) is an alphabetic system of phonetic notation based primarily on the Latin script. It was devised by the International Phonetic Association in the late 19th century as a standardized representation of speech sounds in written form. The IPA is used by lexicographers, foreign language students and teachers, linguists, speech–language pathologists, singers, actors, constructed language creators and translators. > > The IPA is designed to represent those qualities of speech that are part of lexical (and to a limited extent prosodic) sounds in oral language: phones, phonemes, intonation and the separation of words and syllables.[1] To represent additional qualities of speech, such as tooth gnashing, lisping, and sounds made with a cleft lip and cleft palate, an extended set of symbols, the extensions to the International Phonetic Alphabet, may be used. File ''/srv/datasets/arpabet-to-ipa'' on our server (also available [[http://jeff.cis.cabrillo.edu/datasets/arpabet-to-ipa|via HTTP]]) describes a translation from ARPAbet codes to IPA symbols. Each line contains two whitespace-delimited tokens, viz. an ARPAbet code and its equivalent IPA symbol, e.g.:


SH ʃ
ZH ʒ
HH h
M m
N n
NG ŋ

----- ====== Assignment ====== You shall write a Python module named ''cmudict_ipa'' that offers the following functionality: * Function ''cmudict_ipa.ipa()'' returns a list of IPA pronunciations of a word. * Running ''cmudict_ipa.py'' as a script receives words as input and prints their pronunciations on standard output. Both aspects are specified in detail below. ===== Function cmudict_ipa.ipa() ===== Here is a stub for this function:


def ipa(word: str) -> list[str]:
  '''
  Returns a list of IPA pronunciations of a given word, if available.
  Calling this function shall not result in opening any data files.

  :param word: the word in question, case-insensitive
  :return: a list of IPA pronunciations of the word in lexicographic order,
           e.g. ipa('roof') == ['ruf', 'rʊf']. Returns an empty list if the
           CMUdict dataset lacks an entry for the word.
  '''
  pass

Note that the term //lexicographic order// is fulfilled by the natural ordering of Python ''str''ings when sorted. More formally, in a sorted list of strings named ''pronunciations'', the following is ''True'': all(pronunciations[i] <= pronunciations[i+1] for i in range(len(pronunciations) - 1)) ===== Script Functionality and Sample Executable ===== Ensure that your module has a main block that receives input words and prints their pronunciations to standard output (one pronunciation per line): - If there are [[https://docs.python.org/3/library/sys.html#sys.argv|command-line arguments]] present, use them as the input words. - Otherwise, use any whitespace-delimited tokens on [[https://docs.python.org/3/library/sys.html#sys.stdin|standard input]]. A sample executable named ''cs20p_cmudict_ipa'' is available on our server, and demonstrates the expected behavior of your script. In the following examples, substituting ''./cmudict_ipa.py'' for ''cs20p_cmudict_ipa'' should generate the same output.


% cs20p_cmudict_ipa <<


FYI, you can use the ''[[https://en.wikipedia.org/wiki/Diff|diff]]'' utility to compare your script's output with that of the sample executable. In the following example, we will use all the words from John Muir's essay "Mountain Thoughts" as input. If the outputs are identical, you will see no output on screen. If there are differences, lines beginning with ''-'' are unique to the sample executable, lines beginning with ''+'' are unique to your script, and lines without any prefix are identical.


diff -u \
  <(grep -oP "[\w']+" ~/datasets/muir-mountain-thoughts.txt | cs20p_cmudict_ipa) \
  <(grep -oP "[\w']+" ~/datasets/muir-mountain-thoughts.txt | ./cmudict_ipa.py)



===== Suggested/Possible Module Structure =====

Since this is a multi-purpose module meant to serve both as an importable module with a callable function and as a standalone script, it's worth thinking about various ways to structure the code.

I would suggest defining two or more //internal// functions (i.e., with names beginning with an underscore):

  * One or more functions responsible for populating ''dict''ionaries with relevant data from the two data files, for use within function ''ipa()'', to be called from anywhere within the module, likely returning values to be assigned to variables within the module.
  * A function that implements the script functionality, i.e. to be called from within the main block.

Here is a screenshot of PyCharm's "Structure" view showing the **f**unction and **v**ariable attributes extant in my module:

{{:cmudict_ipa_structure.png?nolink|Screenshot from PyCharm's "Structure" view showing the function and variable attributes extant in my module}}

===== Leaderboard =====

As submissions are received, this leaderboard will be updated with the top-performing fully functional solutions, with regard to execution speed.





Rank Test Time (s) Memory Usage (kB) SLOC (lines) User







-----

====== Submission ======

Submit ''cmudict_ipa.py'' via [[info:turnin]].


{{https://jeff.cis.cabrillo.edu/images/feedback-robot.png?nolink }} //**Feedback Robot**//

This project has a feedback robot that will run some tests on your submission and provide you with a feedback report via email within roughly one minute.

Please read the feedback carefully.


====== Due Date and Point Value ======


Due at 23:59:59 on the date listed on the [[:syllabus|syllabus]].

''Assignment 03'' is worth 60 points.


Possible point values per category:
----------------------------------------
Correct output from cmudict_ipa.ipa() 30
Script functionality                  30
Possible deductions:
  Style and practices            10–20%
Possible extra credit:
  Submission via Git                 5%
----------------------------------------