This assignment requires familiarity with the lecture materials presented in class through week 05.
There are many ways to estimate the readability of text, often of interest to educators when determining the appropriate age/grade level for various books etc.
In this assignment you will implement several automated readability tests on English text. We will work with the following definitions:
abcdefghijklmnopqrstuvwxyz'-
.?!
/srv/datasets/syllables.txt
, if the file contains an entry for the word.
If you are unsure of how a segment of text should be classified with regard to the above definitions, consider running the sample executable with the -v
flag as described below.
You will be implementing the following readability tests, which I have slightly simplified in some cases.
The automated readability index (ARI) produces an estimate of the US grade level needed to comprehend a text. Here is the formula:
$$ 4.71 \left (\frac{\mbox{characters}}{\mbox{words}} \right) + 0.5 \left (\frac{\mbox{words}}{\mbox{sentences}} \right) - 21.43 $$
characters is the total number of characters in all words, words is the total number of words, and sentences is the total number of sentences.
The Coleman-Liau index produces an estimate of the US grade level needed to comprehend a text. Here is the formula:
$$ 0.0588{L} - 0.296{S} - 15.8\,\! $$
L is the average number of characters per 100 words and S is the average number of sentences per 100 words.
Score | Age | Grade Level |
---|---|---|
1 | 5-6 | Kindergarten |
2 | 6-7 | First Grade |
3 | 7-8 | Second Grade |
4 | 8-9 | Third Grade |
5 | 9-10 | Fourth Grade |
6 | 10-11 | Fifth Grade |
7 | 11-12 | Sixth Grade |
8 | 12-13 | Seventh Grade |
9 | 13-14 | Eighth Grade |
10 | 14-15 | Ninth Grade |
11 | 15-16 | Tenth Grade |
12 | 16-17 | Eleventh Grade |
13 | 17-18 | Twelfth Grade |
14 | 18-22 | College student |
The Dale-Chall readability score produces a numeric measure of how difficult a text is to comprehend. Here is the formula:
$$ 0.1579 \left (\frac{\mbox{difficult words}}{\mbox{words}}\times 100 \right) + 0.0496 \left (\frac{\mbox{words}}{\mbox{sentences}} \right) $$
difficult words is the number of words that are not present in a list of 3000 “easy” words (available in file /srv/datasets/dale-chall_familiar_words.txt
), words is the total number of words, and sentences is the total number of sentences.
Score | Notes |
---|---|
4.9 or lower | easily understood by an average 4th-grade student or lower |
5.0–5.9 | easily understood by an average 5th or 6th-grade student |
6.0–6.9 | easily understood by an average 7th or 8th-grade student |
7.0–7.9 | easily understood by an average 9th or 10th-grade student |
8.0–8.9 | easily understood by an average 11th or 12th-grade student |
9.0–9.9 | easily understood by an average 13th to 15th-grade (college) student |
The Flesch-Kincaid grade level produces an estimate of the US grade level needed to comprehend a text. Here is the formula:
$$ 0.39 \left ( \frac{\mbox{words}}{\mbox{sentences}} \right ) + 11.8 \left ( \frac{\mbox{syllables}}{\mbox{words}} \right ) - 15.59 $$
words is the total number of words, sentences is the total number of sentences, and syllables is the total number of syllables.
The Gunning fog index produces an estimate of the number of years of education needed to comprehend a text. Here is the formula:
$$ 0.4\left[ \left(\frac{\mbox{words}}{\mbox{sentences}}\right) + 100\left(\frac{\mbox{complex words}}{\mbox{words}}\right) \right] $$
words is the total number of words, sentences is the total number of sentences, and complex words is the total number of words consisting of three or more syllables.
Fog Index | Reading level by grade |
---|---|
17 | College graduate |
16 | College senior |
15 | College junior |
14 | College sophomore |
13 | College freshman |
12 | High school senior |
11 | High school junior |
10 | High school sophomore |
9 | High school freshman |
8 | Eighth grade |
7 | Seventh grade |
6 | Sixth grade |
The SMOG ("Simple Measure of Gobbledygook") grade produces an estimate of the number of years of education needed to comprehend a text. Here is the formula:
$$ \mbox{grade} = 1.0430 \sqrt{\mbox{complex words}\times{30 \over \mbox{sentences}} } + 3.1291 $$
sentences is the total number of sentences, and complex words is the total number of words consisting of three or more syllables.
You shall write a C++ program that calculates the readability of text on standard input using one the tests described above. The test applied shall be decided based on the value of a command-line argument, as follows:
ari
— computes the automated readability indexcli
— computes the Coleman-Liau indexdcrs
— computes the Dale-Chall readability scorefkgl
— computes the Flesch-Kincaid grade levelgfi
— computes the Gunning fog indexsmog
— computes the SMOG gradeThe output from the program shall consist of one line on standard output containing the result of the test. No need to format the number in any particular way, but if you do, make sure to round to no fewer than 3 digits after the decimal point. (Full credit for each test requires accuracy to the thousandths, but partial credit is available with several thresholds of less precision.)
Here are the library headers I used in my solution:
#include <algorithm> #include <cmath> #include <fstream> #include <iostream> #include <string> #include <tuple> #include <unordered_map> #include <unordered_set> #include <vector>
And here are the members of the standard namespace I used in my solution:
std::cerr std::cin std::cout std::fixed std::getline std::ifstream std::max std::ofstream std::round std::sqrt std::string std::transform std::tuple std::unordered_map std::unordered_set std::vector
Executable cs19_readability
exists on the server, and demonstrates the expected behavior of your program, along with an option to help you check your work. Your program (when compiled with -Ofast
) will be expected to terminate in no more than three times the runtime duration of the sample executable, given the same input.
% cs19_readability smog </srv/datasets/dewey-moral-principles-education.txt 14.566 % cs19_readability smog </srv/datasets/cat-in-the-hat.txt 3.647
Adding a -v
flag as the second command-line argument will print the relevant counts in the text to stderr. You don't need to implement this in your program, but feel free to use it for testing purposes, e.g.:
% cs19_readability fkgl -v <<<"This is a *relatively* short text with nine words." 9 words 1 sentence 10 syllables 1.031 % cs19_readability gfi -v </srv/datasets/cat-in-the-hat.txt 1620 words 243 sentences 2 complex words 2.716
If the robot tells you to check the result of using a certain number of lines from a given file as a body of text, consider piping the output of the head
utility into the sample executable and your program, e.g. to get the automated readability index of the first 500 lines of file /srv/datasets/shakespeare-macbeth.txt
:
head -500 /srv/datasets/shakespeare-macbeth.txt | cs19_readability ari
As submissions are received, these leaderboards will be updated with the top-performing fully functional/near-perfect solutions for each readability test, with regard to execution speed.
Submit your source-code file(s) via turnin. If you submit multiple source-code files, make sure there is only one main()
function.
Feedback Robot
This project has a feedback robot that will run some tests on your submission and provide you with a feedback report via email within roughly one minute.
Please read the feedback carefully.
Due at 23:59:59 on the date listed on the syllabus.
Assignment 05
is worth 60 points, though 120 points (i.e., 60 points of extra credit) are possible before other deductions/credits.
Possible point values per category: --------------------------------------- Automated readability index 20 Coleman-Liau index 20 Dale-Chall readability score 20 Flesch-Kincaid grade level 20 Gunning fog index 20 SMOG grade 20 Possible deductions: Style and practices 10–20% ---------------------------------------