====== cs19f22as05: Leveraging the STL to Estimate English Text Readability ======
===== Goals =====
* Practice taking advantage of various parts of the STL.
* Learn a bit about simple [[https://en.wikipedia.org/wiki/Readability|readability]] formulas.
{{:readability_image.jpg?nolink|Image displaying the word "Readability" with a magnifying glass in the foreground and nondescript text in the background.}}
-----
====== Prerequisites ======
This assignment requires familiarity with the [[:lecture materials/]] presented in class through [[:lecture materials/week 05]].
-----
====== Background ======
There are many ways to estimate the [[https://en.wikipedia.org/wiki/Readability|readability]] of text, often of interest to educators when determining the appropriate age/grade level for various books etc.
===== Definitions =====
In this assignment you will implement several automated readability tests on English text. We will work with the following definitions:
* A **word** shall be any whitespace-delimited token that remains non-empty after removing all characters other than the following (case-insensitive) and stripping single-quotes and hyphens from both ends: abcdefghijklmnopqrstuvwxyz'-
* A **sentence** shall be denoted by any whitespace-delimited token that ends with any of the following, ignoring any single or double quotes at the end of the token: .?!
* A **character** shall be any character in a **word**, as defined above.
* The number of **syllables** in a word shall be:
* The number of syllables indicated in file ''[[http://jeff.cis.cabrillo.edu/datasets/syllables.txt|/srv/datasets/syllables.txt]]'', if the file contains an entry for the word.
* Some words have multiple syllable counts. In that case, opt for the highest count.
* If the above file does not contain an entry for the word, estimate that each sequence of **five characters** in a word represents a syllable, and round to the nearest syllable count for the word.
If you are unsure of how a segment of text should be classified with regard to the above definitions, consider running the sample executable with the ''-v'' flag as described [[#sample executable|below]].
===== Readability Test Formulas =====
You will be implementing the following readability tests, which I have slightly simplified in some cases.
==== Automated Readability Index ====
The [[https://en.wikipedia.org/wiki/Automated_readability_index|automated readability index (ARI)]] produces an estimate of the US grade level needed to comprehend a text. Here is the formula:
$$
4.71 \left (\frac{\mbox{characters}}{\mbox{words}} \right) + 0.5 \left (\frac{\mbox{words}}{\mbox{sentences}} \right) - 21.43
$$
//characters// is the total number of characters in all words, //words// is the total number of words, and //sentences// is the total number of sentences.
==== Coleman-Liau Index ====
The [[https://en.wikipedia.org/wiki/Coleman%E2%80%93Liau_index|Coleman-Liau index]] produces an estimate of the US grade level needed to comprehend a text. Here is the formula:
$$
0.0588{L} - 0.296{S} - 15.8\,\!
$$
//L// is the average number of characters per 100 words and //S// is the average number of sentences per 100 words.
^ Score ^ Age ^ Grade Level ^
| 1 | 5-6 | Kindergarten |
| 2 | 6-7 | First Grade |
| 3 | 7-8 | Second Grade |
| 4 | 8-9 | Third Grade |
| 5 | 9-10 | Fourth Grade |
| 6 | 10-11 | Fifth Grade |
| 7 | 11-12 | Sixth Grade |
| 8 | 12-13 | Seventh Grade |
| 9 | 13-14 | Eighth Grade |
| 10 | 14-15 | Ninth Grade |
| 11 | 15-16 | Tenth Grade |
| 12 | 16-17 | Eleventh Grade |
| 13 | 17-18 | Twelfth Grade |
| 14 | 18-22 | College student |
==== Dale-Chall Readability Score ====
The [[https://en.wikipedia.org/wiki/Dale%E2%80%93Chall_readability_formula|Dale-Chall readability score]] produces a numeric measure of how difficult a text is to comprehend. Here is the formula:
$$
0.1579 \left (\frac{\mbox{difficult words}}{\mbox{words}}\times 100 \right) + 0.0496 \left (\frac{\mbox{words}}{\mbox{sentences}} \right)
$$
//difficult words// is the number of words that are not present in a list of 3000 "easy" words (available in file ''[[http://jeff.cis.cabrillo.edu/datasets/dale-chall_familiar_words.txt|/srv/datasets/dale-chall_familiar_words.txt]]''), //words// is the total number of words, and //sentences// is the total number of sentences.
^ Score ^ Notes ^
| 4.9 or lower | easily understood by an average 4th-grade student or lower |
| 5.0–5.9 | easily understood by an average 5th or 6th-grade student |
| 6.0–6.9 | easily understood by an average 7th or 8th-grade student |
| 7.0–7.9 | easily understood by an average 9th or 10th-grade student |
| 8.0–8.9 | easily understood by an average 11th or 12th-grade student |
| 9.0–9.9 | easily understood by an average 13th to 15th-grade (college) student |
==== Flesch-Kincaid Grade Level ====
The [[https://en.wikipedia.org/wiki/Flesch%E2%80%93Kincaid_readability_tests|Flesch-Kincaid grade level]] produces an estimate of the US grade level needed to comprehend a text. Here is the formula:
$$
0.39 \left ( \frac{\mbox{words}}{\mbox{sentences}} \right ) + 11.8 \left ( \frac{\mbox{syllables}}{\mbox{words}} \right ) - 15.59
$$
//words// is the total number of words, //sentences// is the total number of sentences, and //syllables// is the total number of syllables.
==== Gunning Fog Index ====
The [[https://en.wikipedia.org/wiki/Gunning_fog_index|Gunning fog index]] produces an estimate of the number of years of education needed to comprehend a text. Here is the formula:
$$
0.4\left[ \left(\frac{\mbox{words}}{\mbox{sentences}}\right) + 100\left(\frac{\mbox{complex words}}{\mbox{words}}\right) \right]
$$
//words// is the total number of words, //sentences// is the total number of sentences, and //complex words// is the total number of words consisting of three or more syllables.
^ Fog Index ^ Reading level by grade ^
| 17 | College graduate |
| 16 | College senior |
| 15 | College junior |
| 14 | College sophomore |
| 13 | College freshman |
| 12 | High school senior |
| 11 | High school junior |
| 10 | High school sophomore |
| 9 | High school freshman |
| 8 | Eighth grade |
| 7 | Seventh grade |
| 6 | Sixth grade |
==== SMOG Grade ====
The [[https://en.wikipedia.org/wiki/SMOG|SMOG ("Simple Measure of Gobbledygook") grade]] produces an estimate of the number of years of education needed to comprehend a text. Here is the formula:
$$
\mbox{grade} = 1.0430 \sqrt{\mbox{complex words}\times{30 \over \mbox{sentences}} } + 3.1291
$$
//sentences// is the total number of sentences, and //complex words// is the total number of words consisting of three or more syllables.
-----
====== Assignment ======
You shall write a C++ program that calculates the readability of text on standard input using one the tests described above. The test applied shall be decided based on the value of a command-line argument, as follows:
* ''ari'' — computes the [[#automated readability index]]
* ''cli'' — computes the [[#Coleman-Liau index]]
* ''dcrs'' — computes the [[#Dale-Chall readability score]]
* ''fkgl'' — computes the [[#Flesch-Kincaid grade level]]
* ''gfi'' — computes the [[#Gunning fog index]]
* ''smog'' — computes the [[#SMOG grade]]
The output from the program shall consist of **one line on standard output** containing the result of the test. No need to format the number in any particular way, but if you do, make sure to round to no fewer than 3 digits after the decimal point. (Full credit for each test requires accuracy to the thousandths, but partial credit is available with several thresholds of less precision.)
Here are the library headers I used in my solution:
#include
#include
#include
#include
#include
#include
#include
#include
#include
And here are the members of the standard namespace
I used in my solution:
std::cerr
std::cin
std::cout
std::fixed
std::getline
std::ifstream
std::max
std::ofstream
std::round
std::sqrt
std::string
std::transform
std::tuple
std::unordered_map
std::unordered_set
std::vector
===== Sample Executable =====
Executable ''cs19_readability'' exists on the server, and demonstrates the expected behavior of your program, along with an option to help you check your work. Your program (when compiled with ''-Ofast'') will be expected to terminate in no more than three times the runtime duration of the sample executable, given the same input.
% cs19_readability smog
Adding a ''-v'' flag as the second command-line argument will print the relevant counts in the text to stderr. You don't need to implement this in your program, but feel free to use it for testing purposes, e.g.:
% cs19_readability fkgl -v <<<"This is a *relatively* short text with nine words."
9 words 1 sentence 10 syllables
1.031
% cs19_readability gfi -v
If the robot tells you to check the result of using a certain number of lines from a given file as a body of text, consider piping the output of the ''head'' utility into the sample executable and your program, e.g. to get the automated readability index of the first 500 lines of file ''/srv/datasets/shakespeare-macbeth.txt'':
head -500 /srv/datasets/shakespeare-macbeth.txt | cs19_readability ari
===== Leaderboards =====
As submissions are received, these leaderboards will be updated with the top-performing fully functional/near-perfect solutions for each readability test, with regard to execution speed.
-----
====== Submission ======
Submit your source-code file(s) via [[info:turnin]]. If you submit multiple source-code files, make sure there is only one ''main()'' function.
{{https://jeff.cis.cabrillo.edu/images/feedback-robot.png?nolink }} //**Feedback Robot**//
This project has a feedback robot that will run some tests on your submission and provide you with a feedback report via email within roughly one minute.
Please read the feedback carefully.
====== Due Date and Point Value ======
Due at 23:59:59 on the date listed on the [[:syllabus|syllabus]].
''Assignment 05'' is worth 60 points, though **120 points (i.e., 60 points of extra credit)** are possible before other deductions/credits.
Possible point values per category:
---------------------------------------
Automated readability index 20
Coleman-Liau index 20
Dale-Chall readability score 20
Flesch-Kincaid grade level 20
Gunning fog index 20
SMOG grade 20
Possible deductions:
Style and practices 10–20%
---------------------------------------