Table of Contents
cs19f22as05: Leveraging the STL to Estimate English Text Readability
Goals
- Practice taking advantage of various parts of the STL.
- Learn a bit about simple readability formulas.
Prerequisites
This assignment requires familiarity with the lecture materials presented in class through week 05.
Background
There are many ways to estimate the readability of text, often of interest to educators when determining the appropriate age/grade level for various books etc.
Definitions
In this assignment you will implement several automated readability tests on English text. We will work with the following definitions:
- A word shall be any whitespace-delimited token that remains non-empty after removing all characters other than the following (case-insensitive) and stripping single-quotes and hyphens from both ends:
abcdefghijklmnopqrstuvwxyz'-
- A sentence shall be denoted by any whitespace-delimited token that ends with any of the following, ignoring any single or double quotes at the end of the token:
.?!
- A character shall be any character in a word, as defined above.
- The number of syllables in a word shall be:
- The number of syllables indicated in file
/srv/datasets/syllables.txt
, if the file contains an entry for the word.- Some words have multiple syllable counts. In that case, opt for the highest count.
- If the above file does not contain an entry for the word, estimate that each sequence of five characters in a word represents a syllable, and round to the nearest syllable count for the word.
If you are unsure of how a segment of text should be classified with regard to the above definitions, consider running the sample executable with the -v
flag as described below.
Readability Test Formulas
You will be implementing the following readability tests, which I have slightly simplified in some cases.
Automated Readability Index
The automated readability index (ARI) produces an estimate of the US grade level needed to comprehend a text. Here is the formula:
$$ 4.71 \left (\frac{\mbox{characters}}{\mbox{words}} \right) + 0.5 \left (\frac{\mbox{words}}{\mbox{sentences}} \right) - 21.43 $$
characters is the total number of characters in all words, words is the total number of words, and sentences is the total number of sentences.
Coleman-Liau Index
The Coleman-Liau index produces an estimate of the US grade level needed to comprehend a text. Here is the formula:
$$ 0.0588{L} - 0.296{S} - 15.8\,\! $$
L is the average number of characters per 100 words and S is the average number of sentences per 100 words.
Score | Age | Grade Level |
---|---|---|
1 | 5-6 | Kindergarten |
2 | 6-7 | First Grade |
3 | 7-8 | Second Grade |
4 | 8-9 | Third Grade |
5 | 9-10 | Fourth Grade |
6 | 10-11 | Fifth Grade |
7 | 11-12 | Sixth Grade |
8 | 12-13 | Seventh Grade |
9 | 13-14 | Eighth Grade |
10 | 14-15 | Ninth Grade |
11 | 15-16 | Tenth Grade |
12 | 16-17 | Eleventh Grade |
13 | 17-18 | Twelfth Grade |
14 | 18-22 | College student |
Dale-Chall Readability Score
The Dale-Chall readability score produces a numeric measure of how difficult a text is to comprehend. Here is the formula:
$$ 0.1579 \left (\frac{\mbox{difficult words}}{\mbox{words}}\times 100 \right) + 0.0496 \left (\frac{\mbox{words}}{\mbox{sentences}} \right) $$
difficult words is the number of words that are not present in a list of 3000 “easy” words (available in file /srv/datasets/dale-chall_familiar_words.txt
), words is the total number of words, and sentences is the total number of sentences.
Score | Notes |
---|---|
4.9 or lower | easily understood by an average 4th-grade student or lower |
5.0–5.9 | easily understood by an average 5th or 6th-grade student |
6.0–6.9 | easily understood by an average 7th or 8th-grade student |
7.0–7.9 | easily understood by an average 9th or 10th-grade student |
8.0–8.9 | easily understood by an average 11th or 12th-grade student |
9.0–9.9 | easily understood by an average 13th to 15th-grade (college) student |
Flesch-Kincaid Grade Level
The Flesch-Kincaid grade level produces an estimate of the US grade level needed to comprehend a text. Here is the formula:
$$ 0.39 \left ( \frac{\mbox{words}}{\mbox{sentences}} \right ) + 11.8 \left ( \frac{\mbox{syllables}}{\mbox{words}} \right ) - 15.59 $$
words is the total number of words, sentences is the total number of sentences, and syllables is the total number of syllables.
Gunning Fog Index
The Gunning fog index produces an estimate of the number of years of education needed to comprehend a text. Here is the formula:
$$ 0.4\left[ \left(\frac{\mbox{words}}{\mbox{sentences}}\right) + 100\left(\frac{\mbox{complex words}}{\mbox{words}}\right) \right] $$
words is the total number of words, sentences is the total number of sentences, and complex words is the total number of words consisting of three or more syllables.
Fog Index | Reading level by grade |
---|---|
17 | College graduate |
16 | College senior |
15 | College junior |
14 | College sophomore |
13 | College freshman |
12 | High school senior |
11 | High school junior |
10 | High school sophomore |
9 | High school freshman |
8 | Eighth grade |
7 | Seventh grade |
6 | Sixth grade |
SMOG Grade
The SMOG ("Simple Measure of Gobbledygook") grade produces an estimate of the number of years of education needed to comprehend a text. Here is the formula:
$$ \mbox{grade} = 1.0430 \sqrt{\mbox{complex words}\times{30 \over \mbox{sentences}} } + 3.1291 $$
sentences is the total number of sentences, and complex words is the total number of words consisting of three or more syllables.
Assignment
You shall write a C++ program that calculates the readability of text on standard input using one the tests described above. The test applied shall be decided based on the value of a command-line argument, as follows:
ari
— computes the automated readability indexcli
— computes the Coleman-Liau indexdcrs
— computes the Dale-Chall readability scorefkgl
— computes the Flesch-Kincaid grade levelgfi
— computes the Gunning fog indexsmog
— computes the SMOG grade
The output from the program shall consist of one line on standard output containing the result of the test. No need to format the number in any particular way, but if you do, make sure to round to no fewer than 3 digits after the decimal point. (Full credit for each test requires accuracy to the thousandths, but partial credit is available with several thresholds of less precision.)
Here are the library headers I used in my solution:
#include <algorithm> #include <cmath> #include <fstream> #include <iostream> #include <string> #include <tuple> #include <unordered_map> #include <unordered_set> #include <vector>
And here are the members of the standard namespace I used in my solution:
std::cerr std::cin std::cout std::fixed std::getline std::ifstream std::max std::ofstream std::round std::sqrt std::string std::transform std::tuple std::unordered_map std::unordered_set std::vector
Sample Executable
Executable cs19_readability
exists on the server, and demonstrates the expected behavior of your program, along with an option to help you check your work. Your program (when compiled with -Ofast
) will be expected to terminate in no more than three times the runtime duration of the sample executable, given the same input.
% cs19_readability smog </srv/datasets/dewey-moral-principles-education.txt 14.566 % cs19_readability smog </srv/datasets/cat-in-the-hat.txt 3.647
Adding a -v
flag as the second command-line argument will print the relevant counts in the text to stderr. You don't need to implement this in your program, but feel free to use it for testing purposes, e.g.:
% cs19_readability fkgl -v <<<"This is a *relatively* short text with nine words." 9 words 1 sentence 10 syllables 1.031 % cs19_readability gfi -v </srv/datasets/cat-in-the-hat.txt 1620 words 243 sentences 2 complex words 2.716
If the robot tells you to check the result of using a certain number of lines from a given file as a body of text, consider piping the output of the head
utility into the sample executable and your program, e.g. to get the automated readability index of the first 500 lines of file /srv/datasets/shakespeare-macbeth.txt
:
head -500 /srv/datasets/shakespeare-macbeth.txt | cs19_readability ari
Leaderboards
As submissions are received, these leaderboards will be updated with the top-performing fully functional/near-perfect solutions for each readability test, with regard to execution speed.
Submission
Submit your source-code file(s) via turnin. If you submit multiple source-code files, make sure there is only one main()
function.
Feedback Robot
This project has a feedback robot that will run some tests on your submission and provide you with a feedback report via email within roughly one minute.
Please read the feedback carefully.
Due Date and Point Value
Due at 23:59:59 on the date listed on the syllabus.
Assignment 05
is worth 60 points, though 120 points (i.e., 60 points of extra credit) are possible before other deductions/credits.
Possible point values per category: --------------------------------------- Automated readability index 20 Coleman-Liau index 20 Dale-Chall readability score 20 Flesch-Kincaid grade level 20 Gunning fog index 20 SMOG grade 20 Possible deductions: Style and practices 10–20% ---------------------------------------