Added more documentation

YACS-RCOS · Dec 3, 2023 · 7fd741e · 7fd741e
1 parent 32ae5a5
commit 7fd741e
Show file tree

Hide file tree

Showing 8 changed files with 80 additions and 19 deletions.
diff --git a/rpi_data/modules/READEME.md b/rpi_data/modules/READEME.md
@@ -10,11 +10,12 @@ selenium
 # Course Parser
 Hopefully this will be the last one.
 The relevant files in the folder are csv_to_course.py, course.py, headless_login.py, new_parse.py, and parse_runner.py
-The other files in the folder are legacy code that was used in the old web scraper. For now it they will remain in here as there are some pieces of code that could be useful if future edge cases pop up.
+The other files in the folder are legacy code that were used in the old web scraper. For now they will remain in here as there are some pieces of code that could be useful if future edge cases pop up.
 
 # How to run
-run parse_runner.py with a term as the command argument. Rhe term is formatted as termYEAR. If a csv doesn't exist yet it'll do a full parse, if one does it'll immediately start updating.
-# Common issues
+run parse_runner.py with a term as the command argument. The term is formatted as termYEAR. If the specified csv doesn't exist it'll do a full parse, if it does it'll immediately start updating.
+
+# Common issues with SIS scraping
 
 ------------------------------------------------------------------------------------------------------------------------------------------
 Sel | CRN | Maj | Cod | Sec | Cmp | Crd | Nme | Dys | Tme | Cap | Act | Rem | WLC | WLA | WLR | XLC | XLA | XLR | Prof | Date | Loc | Attr
@@ -41,4 +42,49 @@ SR  |90453|BMED |6940 |01   | T   | 1-9 | REB | TBA       | 0 | 0   | 0  | 0   |
 -----------------------------------------------------------------------------------------------------------------------------------------
 because of the use of colspan in the days column. This means that our indecies are off when we start formatting and processing stuff, which crashes the web scraper. We get around this by inserting a TBA for the value of the colspan that we see.
 
-These are the two most common offenders, but other issues can pop up, so generally, a row should always have exactly 21 things in it before we begin processing. Many issues that pop up with the webscraper are related to the rows not having a length of 21.
+We also split some fields into two different fields, namely the start and end times a class (eg. 2:00pm - 3:50pm) and the start and end dates of the classes (eg. 01/08-04/24).
+However there are often times where courses may have these fields as TBA or blank, eg admin 1030 which looks like this
+
+------------------------------------------------------------------------------------------------------------------------------------------
+SR  |93972|ADMN |1030 |01   | T   | 0.00 | AXPA |   |  TBA |1000| 443| 557| 0  | 0   | 0   | 0   | 0  | 0   |Cary  |01-04 |TBA  |
+------------------------------------------------------------------------------------------------------------------------------------------
+
+As you can see, the time field is TBA, which we can't easily divide into two seperate fields. At the moment we just add in two TBA's in this case. However in the event that another web scraper is needed or it breaks, this may be a source of failure.
+
+These are the some of the more common offenders, but other issues can pop up, so generally, a row should always have exactly 21 things in it before we begin processing. Many issues that pop up with the webscraper are related to the rows not having a length of 21.
+
+# Common Issues with catalog scraping
+
+Unfortunately, SIS scraping is relatively simple compared to catalog scraping, which has many issues.
+Though most of these issues will probably (hopefully) dissapear when the catalog api is implemented.
+However I am under the assumption that speedup is all we can except from the api.
+So, SIS will not give us everything that we want, in particular the prerequsites and corequisites of a course, in order to get that we will need to scrap from the catalog, in particular, this link
+https://sis.rpi.edu/rss/bwckctlg.p_disp_course_detail?cat_term_in=?&subj_code_in=?&crse_numb_in=?
+Where you would replace the ?'s with a basevalue (the integer representation of a semester - Spring 2024 -> 202401), Major, and course code.
+
+Because there is notablly less information to parse, there are less issues with the catalog at present, though some of the issues are more severe.
+![Alt text](image-1.png)
+
+This is as close to the ideal course that one can find, there is a clear list of prerequisites and corequisites, as well as a description. (Though there is a slight issues where it's listed as "Prerequisites/Corequisites: Prerequisite:" instead of "Prerequisites/Corequisites: Prerequisites:" like other courses, but it's pretty good beyond that).
+
+However, there are many courses that do not follow this, for example, 
+
+![Alt text](image-3.png)
+
+Even though capstone is listed as having prerequsites or corequisites, it only has prerequisites, and that is difficult for a computer to distinguish, namely because it is missing the "Prerequites:" or "Corequisites" that other courses will have. For reasons that will be mentioned below, this is not too big of an issue with prerequisites, as there is a consistent way to get those, but getting corequisites consistently is difficult.
+
+![Alt text](image-2.png)
+
+This is another case of weird prerequisite and corequisite formatting, where is it difficult to parse the two. 
+
+![Alt text](image-4.png)
+
+This is RCOS for next semester, however if you did not already know the course code, it would be very impossible to tell that this was RCOS. So, when parsed, there will be no prerequisites, corequisites, or description for the course, even though this is not actually true.
+
+It is worth mentioning that there are two prerequisites, one called prerequisites, ie "Prerequisites/Corequisites: Prerequisites: CSCI 1200 and Introduction to Calculus (MATH 1010 or MATH 1500 or MATH 1020 or MATH 2010); MATH 1020 is strongly recommended.", and another called raw, or raw prerequistes in the database. Raw is
+
+![Alt text](image-5.png)
+
+In the webpage. When you click on a course in the explore page, this is the information that is displayed as the prerequisites of a course. Notable, raw is actually reliable and so many of the issues with prerequistes and corequisites mentioned here are mostly dealing with corequisites.
+
+However, aside from raw, all of the other situtations are unique problems that do not have, or have limited solutions in the webscraper, this is especially true with the corequisite problem.
diff --git a/rpi_data/modules/image-1.png b/rpi_data/modules/image-1.png
diff --git a/rpi_data/modules/image-2.png b/rpi_data/modules/image-2.png
diff --git a/rpi_data/modules/image-3.png b/rpi_data/modules/image-3.png
diff --git a/rpi_data/modules/image-4.png b/rpi_data/modules/image-4.png
diff --git a/rpi_data/modules/image-5.png b/rpi_data/modules/image-5.png
diff --git a/rpi_data/modules/image.png b/rpi_data/modules/image.png
diff --git a/rpi_data/modules/new_parse.py b/rpi_data/modules/new_parse.py
@@ -120,9 +120,13 @@ def findAllSubjectCodes(driver) -> dict():
     return code_school_dict
 
 
-#For cases where courses are missing a significant amount of information. Exclusevly (need to test this though to see if anything slips through)
-#used for test blocks, recitations, and labs
+#For courses that do not have a crn. 99% of the time, these will be lab blocks, test blocks, or recitations
 def processSpecial(info, prevrow) -> list[str]:
+    #If this is ever called on an incorrect course.
+    #Shouldn't happen but who knows 
+    if prevrow == None:
+        print("course has no crn but first in major")
+        return info #Maybe just exit the program instead?
     tmp = formatTimes(info)
     tmp[18] = formatTeachers(tmp[18])
     info = prevrow
@@ -189,10 +193,11 @@ def formatCredits(info):
     return int(float(info[4]))
 
 #Given a row in sis, process the data in said row including crn, course code, days, seats, etc
+#Note that we remove a lot of info from the row in SIS, this is to match the format that the csv is expecting
+#See Fall 2023 or any other csv in the repository.
 def processRow(data: list[str], prevrow: list[str], year: int) -> list[str]:
     info = []
-    #Ignore the 1st element because that's the status on whether a course is open to register or not, and other parts of the app
-    #Can tell that information to users
+    #The first and last elements of the row are useful to us as other parts of the application handle those
     for i in range(1, len(data) - 1, 1):
         #Edge case where the registrar decides to make a column an inconcsistent width.
         #See MGMT 2940 - Readings in MGMT in spring 2024.
@@ -204,17 +209,19 @@ def processRow(data: list[str], prevrow: list[str], year: int) -> list[str]:
         else:
             info.append(data[i].text)
     if len(info) != 21:
-        print("error in: ")
+        print("Length error in: ")
         print(info[0])
     # info[0] is crn, info[1] is major, 2 - course code, 3- section, 4 - if class is on campus (most are), 5 - credits, 6 - class name 
-    #info[7] is days, info[8] is time, info[9] - info[17] are seat cap, act, rem, waitlist, and crosslist
+    #info[7] is days, info[8] is time, info[9] is total course capactiy, info[10] is number of students enrolled,
+    #info[11] is the number of seats left, info[12] - waitlist capacity, info[13] - waitlist enrolled, info[14] - waitlist spots left,
+    #info[15] - crosslist capacity, info[16] - crosslist enrolled, info[17] - crosslist seats left,
     #info[18] are the profs, info[19] are days of the sem that the course spans, and info[20] is location
     #Remove index[4] because most classes are on campus, with exceptions for some grad and doctoral courses.    
     info.pop(4)
 
     #Note that this will shift the above info down by 1 to
     # info[0] crn, info[1]  major, 2 - course code, 3- section, 4 - credits, 5 - class name 
-    #info[6] is days, info[7] time, info[8] - info[16] seat cap, act, rem, waitlist, and crosslist
+    #info[6] is days, info[7] time, info[8] - info[16] actual, waitlist, and crosslist capacity, enrolled, remaining
     #info[17] profs, info[18] days of sem, and info[19] location
     #The above info is what we are working with for the rest of the method
 
@@ -277,24 +284,32 @@ def getCourseInfo(driver, year:str, schools : dict) -> list:
                 c.addSchool("Interdisciplinary and Other")
             courses.append(c)
     return courses
-#Given a url for a course, as well as the course code and major, return a list of prereqs, coreqs, and raw
+#Given a url for a course, as well as the course code and major, return a list of prereqs, coreqs, and description of the course
+#Eg. ITWS 2110 - https://sis.rpi.edu/rss/bwckctlg.p_disp_course_detail?cat_term_in=202401&subj_code_in=ITWS&crse_numb_in=2110 
+# Prereqs - ITWS 1100
+# Coreqs - CSCI 1200
+# Raw -  Undergraduate level CSCI 1200 Minimum Grade of D and Undergraduate level ITWS 1100 Minimum Grade of D or Prerequisite Override 100 
+# Description - This course involves a study of the methods used to extract and deliver dynamic information on the World Wide Web.
+# The course uses a hands-on approach in which students actively develop Web-based software systems.
+# Additional topics include installation, configuration, and management of Web servers.
+# Students are required to have access to a PC on which they can install software such as a Web server and various programming environments.
 def getReqFromLink(webres, courseCode, major) -> list:
     page = webres.content
     soup = bs(page, "html.parser")
     body = soup.find('td', class_='ntdefault')
+    #The page is full of \n\n's for some reason, and this nicely splits it into sections
     classInfo = body.text.strip('\n\n').split('\n\n')
     for i in range(0,len(classInfo),1):
         while '\n' in classInfo[i]:
+            #Some \n's can make it into the parsed data, so we need to get rid of them.
             classInfo[i] = classInfo[i].replace('\n','')
     key = "Prerequisites/Corequisites"
     preKey = "Prerequisite"
     prereqs = ""
     coreqs = ""
     raw = ""
     desc = classInfo[0]
-    #If the description starts with a number, set it to nothing.
-    #Only happens is the course does not have a description and skips into credit value or something else.
-    #
+    #If the course does not have a description, usually this menas that classInfo[0] will be the credit value.
     if desc.strip()[0].isdigit():
         desc = ""
     for i in range(1, len(classInfo)):
@@ -311,23 +326,23 @@ def getReqFromLink(webres, courseCode, major) -> list:
                 prereqs = combo[len(preKey):]
             else:
                 #Default case where someone forgets the words we're looking for
-                #Note that there are still more edge cases(looking at you csci 6560 and 2110)
+                #Note that there are still more edge cases(looking at you csci 6560 and 2110 in spring 2024)
                 prereqs = combo
             prereqs = prereqs[prereqs.find(' '):255].strip()
             coreqs = coreqs[coreqs.find(' '):255].strip()
         if classInfo[i].strip() == (preKey + "s:"):
             raw = classInfo[i+1].strip()
     retList = [prereqs, coreqs, raw, desc]
     return retList
-#Add the prereqs for a course
+#Add the prereqs for a course to that course
 def getReqForClass(course: Course) -> None:
     semester = getSemester(course)
     url = "https://sis.rpi.edu/rss/bwckctlg.p_disp_course_detail?cat_term_in={}&subj_code_in={}&crse_numb_in={}".format(semester, course.major, course.code)
     session = requests.session()
     webres = session.get(url)
     course.addReqsFromList(getReqFromLink(webres, course.code, course.major))
-#Given a course, get the integer representation of that course's semester
-def getSemester(course: Course):
+#Given a course, return the basevalue of that course, eg 2024-01 is returned as 202401
+def getSemester(course: Course) -> int:
     dates = course.sdate.split("-")
     month = dates[1]
     year = dates[0]