User Tools

Site Tools


cs20pf22as03

cs20pf22as03: IPA Pronunciation

Goals

  • Continue practicing with Python built-in types and structures.
  • Create a module that is both importable to provide functionality and executable as a script.
  • Learn a bit more about natural language processing.

Prerequisites

This assignment requires familiarity with the lecture materials presented in class through week 03.


Background

CMUdict

The Carnegie Mellon University Pronouncing Dictionary is an open-source machine-readable pronunciation dictionary for North American English that contains over 134,000 words and their pronunciations.

I have downloaded and slightly post-processed the CMUdict dataset, available in file /srv/datasets/cmudict-0.7b-words (also available via HTTP).

The format of file cmudict-0.7b-words is as follows:

Lines consist of a word followed by its pronunciation as a sequence of phonemes in ARPAbet format, e.g.:

THINKING  TH IH1 NG K IH0 NG

Words with multiple pronunciations are suffixed with (n), e.g.:

SUBJECTS  S AH1 B JH IH0 K T S
SUBJECTS(1)  S AH0 B JH EH1 K T S

Vowel phonemes that indicate a stressed syllable have integer suffixes:

  • Unstressed syllables: 0
  • Primary stressed syllables: 1
  • Secondary stressed syllables: 2

International Phonetic Alphabet

The International Phonetic Alphabet (IPA) is an alphabetic system of phonetic notation based primarily on the Latin script. It was devised by the International Phonetic Association in the late 19th century as a standardized representation of speech sounds in written form. The IPA is used by lexicographers, foreign language students and teachers, linguists, speech–language pathologists, singers, actors, constructed language creators and translators.

The IPA is designed to represent those qualities of speech that are part of lexical (and to a limited extent prosodic) sounds in oral language: phones, phonemes, intonation and the separation of words and syllables.[1] To represent additional qualities of speech, such as tooth gnashing, lisping, and sounds made with a cleft lip and cleft palate, an extended set of symbols, the extensions to the International Phonetic Alphabet, may be used.

File /srv/datasets/arpabet-to-ipa on our server (also available via HTTP) describes a translation from ARPAbet codes to IPA symbols. Each line contains two whitespace-delimited tokens, viz. an ARPAbet code and its equivalent IPA symbol, e.g.:

SH ʃ
ZH ʒ
HH h
M m
N n
NG ŋ

Assignment

You shall write a Python module named cmudict_ipa that offers the following functionality:

  • Function cmudict_ipa.ipa() returns a list of IPA pronunciations of a word.
  • Running cmudict_ipa.py as a script receives words as input and prints their pronunciations on standard output.

Both aspects are specified in detail below.

Function cmudict_ipa.ipa()

Here is a stub for this function:

def ipa(word: str) -> list[str]:
  '''
  Returns a list of IPA pronunciations of a given word, if available.
  Calling this function shall not result in opening any data files.
 
  :param word: the word in question, case-insensitive
  :return: a list of IPA pronunciations of the word in lexicographic order,
           e.g. ipa('roof') == ['ruf', 'rʊf']. Returns an empty list if the
           CMUdict dataset lacks an entry for the word.
  '''
  pass

Note that the term lexicographic order is fulfilled by the natural ordering of Python strings when sorted. More formally, in a sorted list of strings named pronunciations, the following is True:

all(pronunciations[i] <= pronunciations[i+1] for i in range(len(pronunciations) - 1))

Script Functionality and Sample Executable

Ensure that your module has a main block that receives input words and prints their pronunciations to standard output (one pronunciation per line):

  1. If there are command-line arguments present, use them as the input words.
  2. Otherwise, use any whitespace-delimited tokens on standard input.

A sample executable named cs20p_cmudict_ipa is available on our server, and demonstrates the expected behavior of your script. In the following examples, substituting ./cmudict_ipa.py for cs20p_cmudict_ipa should generate the same output.

% cs20p_cmudict_ipa <<<roof
ruf
rʊf
% tail -5 /srv/datasets/cat-in-the-hat.txt | grep -oP "[\w']+" | cs20p_cmudict_ipa
ʃʊd
wi
tel
hɜr
əbaʊt
ɪt
naʊ
hwʌt
wʌt
ʃʊd
wi
du
wel
hwʌt
wʌt
wʊd
ju
du
ɪf
jɔr
jʊr
mʌðɜr
æskt
æst
ju

FYI, you can use the diff utility to compare your script's output with that of the sample executable. In the following example, we will use all the words from John Muir's essay “Mountain Thoughts” as input. If the outputs are identical, you will see no output on screen. If there are differences, lines beginning with - are unique to the sample executable, lines beginning with + are unique to your script, and lines without any prefix are identical.

diff -u \
  <(grep -oP "[\w']+" ~/datasets/muir-mountain-thoughts.txt | cs20p_cmudict_ipa) \
  <(grep -oP "[\w']+" ~/datasets/muir-mountain-thoughts.txt | ./cmudict_ipa.py)

Suggested/Possible Module Structure

Since this is a multi-purpose module meant to serve both as an importable module with a callable function and as a standalone script, it's worth thinking about various ways to structure the code.

I would suggest defining two or more internal functions (i.e., with names beginning with an underscore):

  • One or more functions responsible for populating dictionaries with relevant data from the two data files, for use within function ipa(), to be called from anywhere within the module, likely returning values to be assigned to variables within the module.
  • A function that implements the script functionality, i.e. to be called from within the main block.

Here is a screenshot of PyCharm's “Structure” view showing the function and variable attributes extant in my module:

Screenshot from PyCharm's "Structure" view showing the function and variable attributes extant in my module

Leaderboard

As submissions are received, this leaderboard will be updated with the top-performing fully functional solutions, with regard to execution speed.

<html>

<table> <thead> <tr><th>Rank</th><th>Test Time (s)</th><th>Memory Usage (kB)</th><th>SLOC (lines)<th>User</th></tr> </thead> <tbody id=“leaderboard-table”> </tbody> </table> <script> function updateLeaderboard() { window.fetch(`/~turnin/leaderboard-cs20pf22as03.txt?t=${new Date().getTime()}`, {

method: 'get'

}).then(response ⇒

response.text()

).then(text ⇒ {

let updated = document.getElementById('leaderboard-updated');
updated.innerText = `(Last updated: ${new Date()})`;
let lines = text.split('\n');
let table = document.getElementById('leaderboard-table');
while (table.firstChild)
  table.removeChild(table.firstChild);
for (let i = 0; i < lines.length; ++i) {
  let tokens = lines[i].split(' ').filter(token => token.length > 0);
  if (tokens.length < 2)
    continue;
  let tdRank = document.createElement('td');
  tdRank.textContent = i + 1;
  let tdTime = document.createElement('td');
  tdTime.textContent = Number(tokens[0]).toFixed(4);
  let tdMemUsage = document.createElement('td');
  tdMemUsage.textContent = tokens[1];
  let tdSloc = document.createElement('td');
  tdSloc.textContent = tokens[2];
  let tdUser = document.createElement('td');
  let userLink = document.createElement('a');
  userLink.href = `/~${tokens[3]}/`;
  userLink.target = '_blank';
  userLink.textContent = tokens[3];
  tdUser.appendChild(userLink);
  let tr = document.createElement('tr');
  tr.appendChild(tdRank);
  tr.appendChild(tdTime);
  tr.appendChild(tdMemUsage);
  tr.appendChild(tdSloc);
  tr.appendChild(tdUser);
  table.appendChild(tr);
}

}).catch(err ⇒ {

console.log('Something bad happened: ' + err);

});

window.setTimeout(updateLeaderboard, 60000);

} updateLeaderboard(); </script> </html>


Submission

Submit cmudict_ipa.py via turnin.

Feedback Robot

This project has a feedback robot that will run some tests on your submission and provide you with a feedback report via email within roughly one minute.

Please read the feedback carefully.

Due Date and Point Value

Due at 23:59:59 on the date listed on the syllabus.

Assignment 03 is worth 60 points.

Possible point values per category:
----------------------------------------
Correct output from cmudict_ipa.ipa() 30
Script functionality                  30
Possible deductions:
  Style and practices            10–20%
Possible extra credit:
  Submission via Git                 5%
----------------------------------------
cs20pf22as03.txt · Last modified: 2023-03-27 17:22 by 127.0.0.1