Table of Contents
cs20pf22as03: IPA Pronunciation
Goals
- Continue practicing with Python built-in types and structures.
- Create a module that is both importable to provide functionality and executable as a script.
- Learn a bit more about natural language processing.
Prerequisites
This assignment requires familiarity with the lecture materials presented in class through week 03.
Background
CMUdict
The Carnegie Mellon University Pronouncing Dictionary is an open-source machine-readable pronunciation dictionary for North American English that contains over 134,000 words and their pronunciations.
I have downloaded and slightly post-processed the CMUdict dataset, available in file /srv/datasets/cmudict-0.7b-words
(also available via HTTP).
The format of file cmudict-0.7b-words
is as follows:
Lines consist of a word followed by its pronunciation as a sequence of phonemes in ARPAbet format, e.g.:
THINKING TH IH1 NG K IH0 NG
Words with multiple pronunciations are suffixed with (n), e.g.:
SUBJECTS S AH1 B JH IH0 K T S SUBJECTS(1) S AH0 B JH EH1 K T S
Vowel phonemes that indicate a stressed syllable have integer suffixes:
- Unstressed syllables:
0
- Primary stressed syllables:
1
- Secondary stressed syllables:
2
International Phonetic Alphabet
The International Phonetic Alphabet (IPA) is an alphabetic system of phonetic notation based primarily on the Latin script. It was devised by the International Phonetic Association in the late 19th century as a standardized representation of speech sounds in written form. The IPA is used by lexicographers, foreign language students and teachers, linguists, speech–language pathologists, singers, actors, constructed language creators and translators.
The IPA is designed to represent those qualities of speech that are part of lexical (and to a limited extent prosodic) sounds in oral language: phones, phonemes, intonation and the separation of words and syllables.[1] To represent additional qualities of speech, such as tooth gnashing, lisping, and sounds made with a cleft lip and cleft palate, an extended set of symbols, the extensions to the International Phonetic Alphabet, may be used.
File /srv/datasets/arpabet-to-ipa
on our server (also available via HTTP) describes a translation from ARPAbet codes to IPA symbols. Each line contains two whitespace-delimited tokens, viz. an ARPAbet code and its equivalent IPA symbol, e.g.:
SH ʃ ZH ʒ HH h M m N n NG ŋ
Assignment
You shall write a Python module named cmudict_ipa
that offers the following functionality:
- Function
cmudict_ipa.ipa()
returns a list of IPA pronunciations of a word. - Running
cmudict_ipa.py
as a script receives words as input and prints their pronunciations on standard output.
Both aspects are specified in detail below.
Function cmudict_ipa.ipa()
Here is a stub for this function:
def ipa(word: str) -> list[str]: ''' Returns a list of IPA pronunciations of a given word, if available. Calling this function shall not result in opening any data files. :param word: the word in question, case-insensitive :return: a list of IPA pronunciations of the word in lexicographic order, e.g. ipa('roof') == ['ruf', 'rʊf']. Returns an empty list if the CMUdict dataset lacks an entry for the word. ''' pass
Note that the term lexicographic order is fulfilled by the natural ordering of Python str
ings when sorted. More formally, in a sorted list of strings named pronunciations
, the following is True
:
all(pronunciations[i] <= pronunciations[i+1] for i in range(len(pronunciations) - 1))
Script Functionality and Sample Executable
Ensure that your module has a main block that receives input words and prints their pronunciations to standard output (one pronunciation per line):
- If there are command-line arguments present, use them as the input words.
- Otherwise, use any whitespace-delimited tokens on standard input.
A sample executable named cs20p_cmudict_ipa
is available on our server, and demonstrates the expected behavior of your script. In the following examples, substituting ./cmudict_ipa.py
for cs20p_cmudict_ipa
should generate the same output.
% cs20p_cmudict_ipa <<<roof ruf rʊf % tail -5 /srv/datasets/cat-in-the-hat.txt | grep -oP "[\w']+" | cs20p_cmudict_ipa ʃʊd wi tel hɜr əbaʊt ɪt naʊ hwʌt wʌt ʃʊd wi du wel hwʌt wʌt wʊd ju du ɪf jɔr jʊr mʌðɜr æskt æst ju
FYI, you can use the diff
utility to compare your script's output with that of the sample executable. In the following example, we will use all the words from John Muir's essay “Mountain Thoughts” as input. If the outputs are identical, you will see no output on screen. If there are differences, lines beginning with -
are unique to the sample executable, lines beginning with +
are unique to your script, and lines without any prefix are identical.
diff -u \ <(grep -oP "[\w']+" ~/datasets/muir-mountain-thoughts.txt | cs20p_cmudict_ipa) \ <(grep -oP "[\w']+" ~/datasets/muir-mountain-thoughts.txt | ./cmudict_ipa.py)
Suggested/Possible Module Structure
Since this is a multi-purpose module meant to serve both as an importable module with a callable function and as a standalone script, it's worth thinking about various ways to structure the code.
I would suggest defining two or more internal functions (i.e., with names beginning with an underscore):
- One or more functions responsible for populating
dict
ionaries with relevant data from the two data files, for use within functionipa()
, to be called from anywhere within the module, likely returning values to be assigned to variables within the module. - A function that implements the script functionality, i.e. to be called from within the main block.
Here is a screenshot of PyCharm's “Structure” view showing the function and variable attributes extant in my module:
Leaderboard
As submissions are received, this leaderboard will be updated with the top-performing fully functional solutions, with regard to execution speed.
Rank | Test Time (s) | Memory Usage (kB) | SLOC (lines) | User |
---|
Submission
Submit cmudict_ipa.py
via turnin.
Feedback Robot
This project has a feedback robot that will run some tests on your submission and provide you with a feedback report via email within roughly one minute.
Please read the feedback carefully.
Due Date and Point Value
Due at 23:59:59 on the date listed on the syllabus.
Assignment 03
is worth 60 points.
Possible point values per category: ---------------------------------------- Correct output from cmudict_ipa.ipa() 30 Script functionality 30 Possible deductions: Style and practices 10–20% Possible extra credit: Submission via Git 5% ----------------------------------------