cs19f22as05: Leveraging the STL to Estimate English Text Readability

Goals

Practice taking advantage of various parts of the STL.
Learn a bit about simple readability formulas.

Image displaying the word "Readability" with a magnifying glass in the foreground and nondescript text in the background.

Prerequisites

This assignment requires familiarity with the lecture materials presented in class through week 05.

Background

There are many ways to estimate the readability of text, often of interest to educators when determining the appropriate age/grade level for various books etc.

Definitions

In this assignment you will implement several automated readability tests on English text. We will work with the following definitions:

A word shall be any whitespace-delimited token that remains non-empty after removing all characters other than the following (case-insensitive) and stripping single-quotes and hyphens from both ends:
```
abcdefghijklmnopqrstuvwxyz'-
```
A sentence shall be denoted by any whitespace-delimited token that ends with any of the following, ignoring any single or double quotes at the end of the token:
```
.?!
```
A character shall be any character in a word, as defined above.
The number of syllables in a word shall be:
- The number of syllables indicated in file /srv/datasets/syllables.txt, if the file contains an entry for the word.
  - Some words have multiple syllable counts. In that case, opt for the highest count.
- If the above file does not contain an entry for the word, estimate that each sequence of five characters in a word represents a syllable, and round to the nearest syllable count for the word.

If you are unsure of how a segment of text should be classified with regard to the above definitions, consider running the sample executable with the -v flag as described below.

Readability Test Formulas

You will be implementing the following readability tests, which I have slightly simplified in some cases.

Automated Readability Index

The automated readability index (ARI) produces an estimate of the US grade level needed to comprehend a text. Here is the formula:

$$ 4.71 \left (\frac{\mbox{characters}}{\mbox{words}} \right) + 0.5 \left (\frac{\mbox{words}}{\mbox{sentences}} \right) - 21.43 $$

characters is the total number of characters in all words, words is the total number of words, and sentences is the total number of sentences.

Coleman-Liau Index

The Coleman-Liau index produces an estimate of the US grade level needed to comprehend a text. Here is the formula:

$$ 0.0588{L} - 0.296{S} - 15.8\,\! $$

L is the average number of characters per 100 words and S is the average number of sentences per 100 words.

Score	Age	Grade Level
1	5-6	Kindergarten
2	6-7	First Grade
3	7-8	Second Grade
4	8-9	Third Grade
5	9-10	Fourth Grade
6	10-11	Fifth Grade
7	11-12	Sixth Grade
8	12-13	Seventh Grade
9	13-14	Eighth Grade
10	14-15	Ninth Grade
11	15-16	Tenth Grade
12	16-17	Eleventh Grade
13	17-18	Twelfth Grade
14	18-22	College student

Dale-Chall Readability Score

The Dale-Chall readability score produces a numeric measure of how difficult a text is to comprehend. Here is the formula:

$$ 0.1579 \left (\frac{\mbox{difficult words}}{\mbox{words}}\times 100 \right) + 0.0496 \left (\frac{\mbox{words}}{\mbox{sentences}} \right) $$

difficult words is the number of words that are not present in a list of 3000 “easy” words (available in file /srv/datasets/dale-chall_familiar_words.txt), words is the total number of words, and sentences is the total number of sentences.

Score	Notes
4.9 or lower	easily understood by an average 4th-grade student or lower
5.0–5.9	easily understood by an average 5th or 6th-grade student
6.0–6.9	easily understood by an average 7th or 8th-grade student
7.0–7.9	easily understood by an average 9th or 10th-grade student
8.0–8.9	easily understood by an average 11th or 12th-grade student
9.0–9.9	easily understood by an average 13th to 15th-grade (college) student

Flesch-Kincaid Grade Level

The Flesch-Kincaid grade level produces an estimate of the US grade level needed to comprehend a text. Here is the formula:

$$ 0.39 \left ( \frac{\mbox{words}}{\mbox{sentences}} \right ) + 11.8 \left ( \frac{\mbox{syllables}}{\mbox{words}} \right ) - 15.59 $$

words is the total number of words, sentences is the total number of sentences, and syllables is the total number of syllables.

Gunning Fog Index

The Gunning fog index produces an estimate of the number of years of education needed to comprehend a text. Here is the formula:

$$ 0.4\left[ \left(\frac{\mbox{words}}{\mbox{sentences}}\right) + 100\left(\frac{\mbox{complex words}}{\mbox{words}}\right) \right] $$

words is the total number of words, sentences is the total number of sentences, and complex words is the total number of words consisting of three or more syllables.

Fog Index	Reading level by grade
17	College graduate
16	College senior
15	College junior
14	College sophomore
13	College freshman
12	High school senior
11	High school junior
10	High school sophomore
9	High school freshman
8	Eighth grade
7	Seventh grade
6	Sixth grade

SMOG Grade

The SMOG ("Simple Measure of Gobbledygook") grade produces an estimate of the number of years of education needed to comprehend a text. Here is the formula:

$$ \mbox{grade} = 1.0430 \sqrt{\mbox{complex words}\times{30 \over \mbox{sentences}} } + 3.1291 $$

sentences is the total number of sentences, and complex words is the total number of words consisting of three or more syllables.

Assignment

You shall write a C++ program that calculates the readability of text on standard input using one the tests described above. The test applied shall be decided based on the value of a command-line argument, as follows:

ari — computes the automated readability index
cli — computes the Coleman-Liau index
dcrs — computes the Dale-Chall readability score
fkgl — computes the Flesch-Kincaid grade level
gfi — computes the Gunning fog index
smog — computes the SMOG grade

The output from the program shall consist of one line on standard output containing the result of the test. No need to format the number in any particular way, but if you do, make sure to round to no fewer than 3 digits after the decimal point. (Full credit for each test requires accuracy to the thousandths, but partial credit is available with several thresholds of less precision.)

Here are the library headers I used in my solution:

#include <algorithm>
#include <cmath>
#include <fstream>
#include <iostream>
#include <string>
#include <tuple>
#include <unordered_map>
#include <unordered_set>
#include <vector>

And here are the members of the standard namespace I used in my solution:

std::cerr
std::cin
std::cout
std::fixed
std::getline
std::ifstream
std::max
std::ofstream
std::round
std::sqrt
std::string
std::transform
std::tuple
std::unordered_map
std::unordered_set
std::vector

Sample Executable

Executable cs19_readability exists on the server, and demonstrates the expected behavior of your program, along with an option to help you check your work. Your program (when compiled with -Ofast) will be expected to terminate in no more than three times the runtime duration of the sample executable, given the same input.

% cs19_readability smog </srv/datasets/dewey-moral-principles-education.txt 
14.566
 
% cs19_readability smog </srv/datasets/cat-in-the-hat.txt
3.647

Adding a -v flag as the second command-line argument will print the relevant counts in the text to stderr. You don't need to implement this in your program, but feel free to use it for testing purposes, e.g.:

% cs19_readability fkgl -v <<<"This is a *relatively* short text with nine words."
9 words 1 sentence 10 syllables
1.031
 
% cs19_readability gfi -v </srv/datasets/cat-in-the-hat.txt
1620 words 243 sentences 2 complex words
2.716

If the robot tells you to check the result of using a certain number of lines from a given file as a body of text, consider piping the output of the head utility into the sample executable and your program, e.g. to get the automated readability index of the first 500 lines of file /srv/datasets/shakespeare-macbeth.txt:

head -500 /srv/datasets/shakespeare-macbeth.txt | cs19_readability ari

Leaderboards

As submissions are received, these leaderboards will be updated with the top-performing fully functional/near-perfect solutions for each readability test, with regard to execution speed.

<html>

document.getElementById('leaderboards-updated').innerText = `(Last updated: ${new Date()})`;

let leaderboards = document.getElementById('leaderboards-content');
while (leaderboards.firstChild)
  leaderboards.removeChild(leaderboards.firstChild);

['ari', 'cli', 'dcrs', 'fkgl', 'gfi', 'smog'].forEach(readabilityTest => {

  let testHeader = document.createElement('h3');
  testHeader.textContent = readabilityTest;
  leaderboards.appendChild(testHeader);

  let table = document.createElement('table');
  let tHead = document.createElement('thead');
  let headRow = document.createElement('tr');
  let thRank = document.createElement('th');
  thRank.textContent = 'Rank';
  let thTestTime = document.createElement('th');
  thTestTime.textContent = 'Test Time (s)';
  let thMemUsage = document.createElement('th');
  thMemUsage.textContent = 'Memory Usage (kB)';
  let thSloc = document.createElement('th');
  thSloc.textContent = 'SLOC (lines)';
  let thUser = document.createElement('th');
  thUser.textContent = 'User';
  headRow.appendChild(thRank);
  headRow.appendChild(thTestTime);
  headRow.appendChild(thMemUsage);
  headRow.appendChild(thSloc);
  headRow.appendChild(thUser);
  tHead.appendChild(headRow);
  table.appendChild(tHead);
  let tBody = document.createElement('tbody');
  table.appendChild(tBody);

  window.fetch(`/~turnin/leaderboard-cs19f22as05-${readabilityTest}.txt?t=${new Date().getTime()}`, {
    method: 'get'
  }).then(response =>
    response.text()
  ).then(text => {
    let lines = text.split('\n');
    for (let i = 0; i < lines.length; ++i) {
      let tokens = lines[i].split(' ').filter(token => token.length > 0);
      if (tokens.length < 2)
        continue;
      let tdRank = document.createElement('td');
      tdRank.textContent = i + 1;
      let tdTime = document.createElement('td');
      tdTime.textContent = Number(tokens[0]).toFixed(4);
      let tdMemUsage = document.createElement('td');
      tdMemUsage.textContent = tokens[1];
      let tdSloc = document.createElement('td');
      tdSloc.textContent = tokens[2];
      let tdUser = document.createElement('td');
      let userLink = document.createElement('a');
      userLink.href = `/~${tokens[3]}/`;
      userLink.target = '_blank';
      userLink.textContent = tokens[3];
      tdUser.appendChild(userLink);
      let tr = document.createElement('tr');
      tr.appendChild(tdRank);
      tr.appendChild(tdTime);
      tr.appendChild(tdMemUsage);
      tr.appendChild(tdSloc);
      tr.appendChild(tdUser);
      tBody.appendChild(tr);
    }
  }).catch(err => {
    console.log('Something bad happened: ' + err);
  });
  leaderboards.appendChild(table);
});
window.setTimeout(updateLeaderboards, 60000);

} updateLeaderboards(); </script> </html>

Submission

Submit your source-code file(s) via turnin. If you submit multiple source-code files, make sure there is only one main() function.

Feedback Robot

This project has a feedback robot that will run some tests on your submission and provide you with a feedback report via email within roughly one minute.

Please read the feedback carefully.

Due Date and Point Value

Due at 23:59:59 on the date listed on the syllabus.

Assignment 05 is worth 60 points, though 120 points (i.e., 60 points of extra credit) are possible before other deductions/credits.

Possible point values per category:
---------------------------------------
Automated readability index          20
Coleman-Liau index                   20
Dale-Chall readability score         20
Flesch-Kincaid grade level           20
Gunning fog index                    20
SMOG grade                           20

Possible deductions:
  Style and practices            10–20%
---------------------------------------

portfolio

Table of Contents