====== cs19f22as07: Protein Bioinformatics Lite ======
===== Goals =====
* Develop a class that represents some basic properties of the primary structure of proteins.
* Practice defining types that provide an idiomatic C++ interface.
-----
====== Prerequisites ======
This assignment requires familiarity with the [[:lecture materials/]] presented in class through [[:lecture materials/week 07]].
-----
====== Background ======
[[https://en.wikipedia.org/wiki/Protein|Proteins]] constitute a class of molecules that are the primary drivers of all biochemical activity in living organisms on Earth. Every viable cell in every Earth-based life form has DNA molecules inside of its nucleus, and through the process of [[https://en.wikipedia.org/wiki/Protein_biosynthesis|protein biosynthesis]], cells [[https://en.wikipedia.org/wiki/Transcription_(biology)|transcribe]] DNA into RNA, then [[https://en.wikipedia.org/wiki/Translation_(biology)|translate]] that RNA into protein products.
{{:447px-protein_primary_structure.svg.png?nolink|Diagram of the primary structure of a protein, showing a chain of amino acids}}
Protein biosynthesis involves assembling a chain of [[https://en.wikipedia.org/wiki/Amino_acid|amino acids]], each of which is coded for by a sequence of three RNA nucleotides. The resulting molecule is a form of [[https://en.wikipedia.org/wiki/Peptide|polypeptide]] known as a protein. We refer to this sequence of amino acids as the **[[https://en.wikipedia.org/wiki/Protein_primary_structure|primary structure]]** of a protein. The [[https://en.wikipedia.org/wiki/Protein_secondary_structure|secondary]], [[https://en.wikipedia.org/wiki/Protein_tertiary_structure|ternary]] and [[https://en.wikipedia.org/wiki/Protein_quaternary_structure|quaternary]] structures refer to the complex shapes formed as hydrophobic interactions, intramolecular hydrogen bonds and van der Waals forces cause the primary structure of a protein to [[https://en.wikipedia.org/wiki/Protein_folding|fold]] around itself and other proteins, giving the protein its essential biological functionality.
{{:levels-of-protein-structure-1.jpg?nolink|Diagram of primary, secondary, ternary and quaternary protein structures}}
There are 20 proteogenic amino acids. We usually use single-letter abbreviations to refer to them (''A'' for alanine, ''C'' for cysteine, etc.), constituting an alphabet for describing protein structure:
A C D E F G H I K L M N P Q R S T V W Y
File ''/srv/datasets/amino-monoisotopic-mass'' (also available [[http://jeff.cis.cabrillo.edu/datasets/amino-monoisotopic-mass|via HTTP]]) contains a simple table of the [[https://en.wikipedia.org/wiki/Monoisotopic_mass|monoisotopic masses]] (measured in [[https://en.wikipedia.org/wiki/Dalton_(unit)|daltons]]) of each of the 20 proteogenic amino acids, along with the letters that we usually use to refer to them. Here are the first five lines of the file:
A 71.03711
C 103.00919
D 115.02694
E 129.04259
F 147.06841
As in [[https://rosalind.info/problems/prtm/|Rosalind problem PRTM]], we can use this data to compute the weight/mass of a protein, given its primary structure. This information can be useful for the purposes of [[https://en.wikipedia.org/wiki/Protein_mass_spectrometry|protein mass spectrometry]], e.g. in the identification and categorization of proteins involved in diseases such as [[https://en.wikipedia.org/wiki/Coronavirus_spike_protein|those produced by viruses like SARS-Cov-2]], which is the foundation for developing [[https://en.wikipedia.org/wiki/MRNA_vaccine|mRNA vaccines]] such as those that were quickly developed in response to COVID-19.
-----
====== Assignment ======
You shall define a class named ''Protein'' in the ''cs19'' namespace, instances of which represent the primary structure of a protein using the amino-acid alphabet. For simplicity's sake, define the class entirely within file ''cs19_protein.h'' (i.e. no separation of interface from implementation this time).
''cs19::Protein'' provides an idiomatic C++ interface, overriding various operators for the purposes of inspecting and modifying a ''cs19::Protein'' object, as well as computing its expected mass for the purposes of spectrometry.
Design your class to match [[https://jeff.cis.cabrillo.edu/datasets/docs_cs19_protein/classcs19_1_1Protein.html|this specificiation]] according to the description of each member function. You must implement at least [[https://en.wikipedia.org/wiki/Method_stub|stub]] versions of **all** of the specified constructors and member functions in order to receive a grade.
You will save yourself a substantial amount of work if start with a copy of the ''Dna'' class from [[:lecture materials/week 07]] and modify it to represent proteins instead of DNA sequences. Much of the code could conceivably be quite similar.
===== Testing =====
Here is a small amount of code that uses assertions to test the constructors and the ''mass()'' function, then reads strings from stdin, assumes those strings are protein descriptors, and prints a couple of facts about each protein. I've also created a sample executable named ''cs19_protein'' that is a working version of this program, if you'd like to consult it for testing purposes.
#include
#include
#include
#include
#include
#include
#include "cs19_protein.h"
int main(int argc, char **argv) {
// from Rosalind PRTM: https://rosalind.info/problems/prtm/
const std::initializer_list prtm_prot{'S', 'K', 'A', 'D', 'Y', 'E', 'K'};
constexpr double prtm_mass = 821.392;
assert(std::abs(cs19::Protein(prtm_prot).mass() - prtm_mass) < .001);
assert(std::abs(cs19::Protein(std::string(prtm_prot)).mass() - prtm_mass) < .001);
assert(std::abs(cs19::Protein(std::string(prtm_prot).c_str()).mass() - prtm_mass) < .001);
assert(std::abs(cs19::Protein(prtm_prot.begin(), prtm_prot.end()).mass() - prtm_mass) < .001);
cs19::Protein test;
while (std::cin) {
try {
if (std::cin >> test)
std::cout << test << ' ' << test.size() << ' ' << test.mass() << '\n';
} catch (std::domain_error &error) {
std::cerr << error.what() << '\n';
}
}
}
e.g. try running the sample executable and giving it some text on stdin or the name of a file (such as ''[[http://jeff.cis.cabrillo.edu/datasets/ebola_orf_products|/srv/datasets/ebola_orf_products]]'' which contains 220 protein descriptors from the segments of the [[http://jeff.cis.cabrillo.edu/datasets/ebola|Ebola virus reference genome]] that constitute open reading frames, or potential protein products of Ebola:
% cs19_protein <<
-----
====== Submission ======
Submit ''cs19_protein.h'' via [[info:turnin]].
{{https://jeff.cis.cabrillo.edu/images/feedback-robot.png?nolink }} //**Feedback Robot**//
This project has a feedback robot that will run some tests on your submission and provide you with a feedback report via email within roughly one minute.
Please read the feedback carefully.
====== Due Date and Point Value ======
Due at 23:59:59 on the date listed on the [[:syllabus|syllabus]].
''Assignment 07'' is worth 60 points.
Possible point values per category:
---------------------------------------
Correctly implemented member 60
functions, split roughly evenly
(constructors and to_string() must
work to receive much credit)
Possible deductions:
Repeated/no access to
amino-monoisotopic-mass 25%
Style and practices 10–20%
Possible extra credit:
Submission via Git 5%
---------------------------------------